Extracting text from HTML while retaining formatting can be tricky, but BeautifulSoup makes it easy. In this comprehensive guide, we‘ll cover how to use BeautifulSoup to extract text elements like headings, bold text, italic text, and more from an HTML document or web page.
Contents
Overview
BeautifulSoup is a popular Python library used for web scraping purposes. It allows you to parse HTML and XML documents and extract data from them.
When scraping a web page, you may want to extract not just plain text, but text with its formatting intact. For example, you want headings to be marked as headings, bold text to remain bold, etc.
BeautifulSoup provides methods to search for and extract specific HTML tags or CSS selectors. We can use this to find all text marked up with a certain formatting, like <b>
for bold text, and then extract just the text while omitting the HTML tags.
In this guide, we‘ll cover:
- Importing BeautifulSoup
- Parsing an HTML document
- Extracting headings
- Extracting bold text
- Extracting italic text
- Extracting other formatting like links and images
- Putting it all together in one script
We‘ll be using simple HTML examples, but the same principles apply to scraping text from actual web pages.
Importing BeautifulSoup
Before we can use BeautifulSoup, we need to import it. The main class we need is BeautifulSoup
from the bs4
module:
from bs4 import BeautifulSoup
This gives us access to the BeautifulSoup
class to parse HTML documents.
Parsing HTML
The first step is to get the HTML content we want to parse as a string. This might come from a local file, scraping a web page, an API response, etc.
For this guide, we‘ll just define a simple HTML string:
html = """
<p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p>
"""
To parse this using BeautifulSoup, we pass it to the BeautifulSoup
constructor:
soup = BeautifulSoup(html, ‘html.parser‘)
This creates a BeautifulSoup
object that we can use to extract information from the HTML.
Extracting Headings
Let‘s start with extracting headings. To find all <h1>
tags, we can use:
headings = soup.find_all(‘h1‘)
This gives us a list of BeautifulSoup
objects for each <h1>
tag. To extract just the text, we loop through and call .text
on each:
for h in headings:
print(h.text)
This prints out just the heading text without the HTML:
This is a title
We can adapt this to extract other heading tags like <h2>
, <h3>
, etc.
Extracting Bold Text
To extract all bold text, we search for <b>
tags:
bold = soup.find_all(‘b‘)
Then loop through and print out just the .text
like before:
for b in bold:
print(b.text)
This extracts the bold text from the paragraph:
bold
The same process allows us to extract italic text using <i>
tags, links with <a>
tags, etc.
Extracting Other Formatting
Aside from text formatting like bold and italic, we can extract other HTML elements.
For example, to extract all images, we can search for <img>
tags:
images = soup.find_all(‘img‘)
This gives us BeautifulSoup
objects representing each image. We can then extract the src
attribute to get the image URLs:
for img in images:
print(img[‘src‘])
We can also extract links by searching for <a>
tags, then extract the href
attribute for the URLs:
links = soup.find_all(‘a‘)
for link in links:
print(link[‘href‘])
These are just a few examples – there are many more tags and attributes we could extract depending on what information we want from the HTML.
Putting It All Together
Now let‘s put everything together into one script that extracts:
- Headings
- Bold text
- Italic text
- Images
- Links
Here is the full code:
from bs4 import BeautifulSoup
html = """
<p>This is a <b>bold</b> paragraph with <i>italic</i> text and <a href="https://www.example.com">a link</a>.</p>
<p>Here is an image: <img src="image.jpg"></p>
"""
soup = BeautifulSoup(html, ‘html.parser‘)
# Extract headings
headings = soup.find_all(‘h1‘)
for h in headings:
print(h.text)
# Extract bold text
bold = soup.find_all(‘b‘)
for b in bold:
print(b.text)
# Extract italic text
italic = soup.find_all(‘i‘)
for i in italic:
print(i.text)
# Extract links
links = soup.find_all(‘a‘)
for link in links:
print(link[‘href‘])
# Extract images
images = soup.find_all(‘img‘)
for img in images:
print(img[‘src‘])
This script prints out:
This is a title
bold
italic
https://www.example.com
image.jpg
And there we have it – text extracted with formatting intact!
The same technique works on actual web pages by passing the HTML from the page‘s response
object to BeautifulSoup instead of a hardcoded string.
Summary
Extracting text with formatting from HTML using BeautifulSoup involves:
- Importing the
BeautifulSoup
class - Parsing the HTML with
BeautifulSoup()
- Using
find_all()
and CSS selectors to extract elements - Looping through the results and printing
element.text
to get just the text - Accessing tag attributes like
href
andsrc
for links and images
This allows you to scrape text from web pages and preserve useful formatting and links.
BeautifulSoup is a handy library for not just extracting plain text, but extracting structured data from HTML and XML documents.