How to Extract Text with Formatting Using BeautifulSoup in Python

Extracting text from HTML while retaining formatting can be tricky, but BeautifulSoup makes it easy. In this comprehensive guide, we‘ll cover how to use BeautifulSoup to extract text elements like headings, bold text, italic text, and more from an HTML document or web page.

Overview

BeautifulSoup is a popular Python library used for web scraping purposes. It allows you to parse HTML and XML documents and extract data from them.

When scraping a web page, you may want to extract not just plain text, but text with its formatting intact. For example, you want headings to be marked as headings, bold text to remain bold, etc.

BeautifulSoup provides methods to search for and extract specific HTML tags or CSS selectors. We can use this to find all text marked up with a certain formatting, like <b> for bold text, and then extract just the text while omitting the HTML tags.

In this guide, we‘ll cover:

  • Importing BeautifulSoup
  • Parsing an HTML document
  • Extracting headings
  • Extracting bold text
  • Extracting italic text
  • Extracting other formatting like links and images
  • Putting it all together in one script

We‘ll be using simple HTML examples, but the same principles apply to scraping text from actual web pages.

Importing BeautifulSoup

Before we can use BeautifulSoup, we need to import it. The main class we need is BeautifulSoup from the bs4 module:

from bs4 import BeautifulSoup

This gives us access to the BeautifulSoup class to parse HTML documents.

Parsing HTML

The first step is to get the HTML content we want to parse as a string. This might come from a local file, scraping a web page, an API response, etc.

For this guide, we‘ll just define a simple HTML string:

html = """

<p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p> 
"""

To parse this using BeautifulSoup, we pass it to the BeautifulSoup constructor:

soup = BeautifulSoup(html, ‘html.parser‘)

This creates a BeautifulSoup object that we can use to extract information from the HTML.

Extracting Headings

Let‘s start with extracting headings. To find all <h1> tags, we can use:

headings = soup.find_all(‘h1‘)

This gives us a list of BeautifulSoup objects for each <h1> tag. To extract just the text, we loop through and call .text on each:

for h in headings:
    print(h.text)

This prints out just the heading text without the HTML:

This is a title

We can adapt this to extract other heading tags like <h2>, <h3>, etc.

Extracting Bold Text

To extract all bold text, we search for <b> tags:

bold = soup.find_all(‘b‘) 

Then loop through and print out just the .text like before:

for b in bold:
    print(b.text)

This extracts the bold text from the paragraph:

bold

The same process allows us to extract italic text using <i> tags, links with <a> tags, etc.

Extracting Other Formatting

Aside from text formatting like bold and italic, we can extract other HTML elements.

For example, to extract all images, we can search for <img> tags:

images = soup.find_all(‘img‘)

This gives us BeautifulSoup objects representing each image. We can then extract the src attribute to get the image URLs:

for img in images:
    print(img[‘src‘])

We can also extract links by searching for <a> tags, then extract the href attribute for the URLs:

links = soup.find_all(‘a‘)

for link in links: 
    print(link[‘href‘])

These are just a few examples – there are many more tags and attributes we could extract depending on what information we want from the HTML.

Putting It All Together

Now let‘s put everything together into one script that extracts:

  • Headings
  • Bold text
  • Italic text
  • Images
  • Links

Here is the full code:

from bs4 import BeautifulSoup

html = """


<p>This is a <b>bold</b> paragraph with <i>italic</i> text and <a href="https://www.example.com">a link</a>.</p>

<p>Here is an image: <img src="image.jpg"></p>
"""

soup = BeautifulSoup(html, ‘html.parser‘)

# Extract headings
headings = soup.find_all(‘h1‘)
for h in headings:
    print(h.text)

# Extract bold text  
bold = soup.find_all(‘b‘)
for b in bold:
    print(b.text)

# Extract italic text
italic = soup.find_all(‘i‘)
for i in italic:
    print(i.text)

# Extract links
links = soup.find_all(‘a‘)
for link in links:
    print(link[‘href‘])

# Extract images    
images = soup.find_all(‘img‘)
for img in images:
    print(img[‘src‘])

This script prints out:

This is a title
bold  
italic
https://www.example.com
image.jpg

And there we have it – text extracted with formatting intact!

The same technique works on actual web pages by passing the HTML from the page‘s response object to BeautifulSoup instead of a hardcoded string.

Summary

Extracting text with formatting from HTML using BeautifulSoup involves:

  • Importing the BeautifulSoup class
  • Parsing the HTML with BeautifulSoup()
  • Using find_all() and CSS selectors to extract elements
  • Looping through the results and printing element.text to get just the text
  • Accessing tag attributes like href and src for links and images

This allows you to scrape text from web pages and preserve useful formatting and links.

BeautifulSoup is a handy library for not just extracting plain text, but extracting structured data from HTML and XML documents.

Avatar photo

Written by Python Scraper

As an accomplished Proxies & Web scraping expert with over a decade of experience in data extraction, my expertise lies in leveraging proxies to maximize the efficiency and effectiveness of web scraping projects. My journey in this field began with a fascination for the vast troves of data available online and a passion for unlocking its potential.

Over the years, I've honed my skills in Python, developing sophisticated scraping tools that navigate complex web structures. A critical component of my work involves using various proxy services, including BrightData, Soax, Smartproxy, Proxy-Cheap, and Proxy-seller. These services have been instrumental in my ability to obtain multiple IP addresses, bypass IP restrictions, and overcome geographical limitations, thus enabling me to access and extract data seamlessly from diverse sources.

My approach to web scraping is not just technical; it's also strategic. I understand that every scraping task has unique challenges, and I tailor my methods accordingly, ensuring compliance with legal and ethical standards. By staying up-to-date with the latest developments in proxy technologies and web scraping methodologies, I continue to provide top-tier services in data extraction, helping clients transform raw data into actionable insights.