How to Extract Text with Formatting Using BeautifulSoup in Python

Extracting text from HTML while retaining formatting can be tricky, but BeautifulSoup makes it easy. In this comprehensive guide, we‘ll cover how to use BeautifulSoup to extract text elements like headings, bold text, italic text, and more from an HTML document or web page.

Contents

Overview
Importing BeautifulSoup
Parsing HTML
Extracting Headings
Extracting Bold Text
Extracting Other Formatting
Putting It All Together
Summary

Overview

BeautifulSoup is a popular Python library used for web scraping purposes. It allows you to parse HTML and XML documents and extract data from them.

When scraping a web page, you may want to extract not just plain text, but text with its formatting intact. For example, you want headings to be marked as headings, bold text to remain bold, etc.

BeautifulSoup provides methods to search for and extract specific HTML tags or CSS selectors. We can use this to find all text marked up with a certain formatting, like <b> for bold text, and then extract just the text while omitting the HTML tags.

In this guide, we‘ll cover:

Importing BeautifulSoup
Parsing an HTML document
Extracting headings
Extracting bold text
Extracting italic text
Extracting other formatting like links and images
Putting it all together in one script

We‘ll be using simple HTML examples, but the same principles apply to scraping text from actual web pages.

Importing BeautifulSoup

Before we can use BeautifulSoup, we need to import it. The main class we need is BeautifulSoup from the bs4 module:

from bs4 import BeautifulSoup

This gives us access to the BeautifulSoup class to parse HTML documents.

Parsing HTML

The first step is to get the HTML content we want to parse as a string. This might come from a local file, scraping a web page, an API response, etc.

For this guide, we‘ll just define a simple HTML string:

html = """

<p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p> 
"""

To parse this using BeautifulSoup, we pass it to the BeautifulSoup constructor:

soup = BeautifulSoup(html, ‘html.parser‘)

This creates a BeautifulSoup object that we can use to extract information from the HTML.

Extracting Headings

Let‘s start with extracting headings. To find all <h1> tags, we can use:

headings = soup.find_all(‘h1‘)

This gives us a list of BeautifulSoup objects for each <h1> tag. To extract just the text, we loop through and call .text on each:

for h in headings:
    print(h.text)

This prints out just the heading text without the HTML:

This is a title

We can adapt this to extract other heading tags like <h2>, <h3>, etc.

Extracting Bold Text

To extract all bold text, we search for <b> tags:

bold = soup.find_all(‘b‘)

Then loop through and print out just the .text like before:

for b in bold:
    print(b.text)

This extracts the bold text from the paragraph:

bold

The same process allows us to extract italic text using <i> tags, links with <a> tags, etc.

Extracting Other Formatting

Aside from text formatting like bold and italic, we can extract other HTML elements.

For example, to extract all images, we can search for <img> tags:

images = soup.find_all(‘img‘)

This gives us BeautifulSoup objects representing each image. We can then extract the src attribute to get the image URLs:

for img in images:
    print(img[‘src‘])

We can also extract links by searching for <a> tags, then extract the href attribute for the URLs:

links = soup.find_all(‘a‘)

for link in links: 
    print(link[‘href‘])

These are just a few examples – there are many more tags and attributes we could extract depending on what information we want from the HTML.

Putting It All Together

Now let‘s put everything together into one script that extracts:

Headings
Bold text
Italic text
Images
Links

Here is the full code:

from bs4 import BeautifulSoup

html = """


<p>This is a <b>bold</b> paragraph with <i>italic</i> text and <a href="https://www.example.com">a link</a>.</p>

<p>Here is an image: <img src="image.jpg"></p>
"""

soup = BeautifulSoup(html, ‘html.parser‘)

# Extract headings
headings = soup.find_all(‘h1‘)
for h in headings:
    print(h.text)

# Extract bold text  
bold = soup.find_all(‘b‘)
for b in bold:
    print(b.text)

# Extract italic text
italic = soup.find_all(‘i‘)
for i in italic:
    print(i.text)

# Extract links
links = soup.find_all(‘a‘)
for link in links:
    print(link[‘href‘])

# Extract images    
images = soup.find_all(‘img‘)
for img in images:
    print(img[‘src‘])

This script prints out:

This is a title
bold  
italic
https://www.example.com
image.jpg

And there we have it – text extracted with formatting intact!

The same technique works on actual web pages by passing the HTML from the page‘s response object to BeautifulSoup instead of a hardcoded string.

Summary

Extracting text with formatting from HTML using BeautifulSoup involves:

Importing the BeautifulSoup class
Parsing the HTML with BeautifulSoup()
Using find_all() and CSS selectors to extract elements
Looping through the results and printing element.text to get just the text
Accessing tag attributes like href and src for links and images

This allows you to scrape text from web pages and preserve useful formatting and links.

BeautifulSoup is a handy library for not just extracting plain text, but extracting structured data from HTML and XML documents.

How to Extract Text with Formatting Using BeautifulSoup in Python

Overview

Importing BeautifulSoup

Parsing HTML

Extracting Headings

Extracting Bold Text

Extracting Other Formatting

Putting It All Together

Summary

What Is IP Rotation? Ways to Rotate an IP Address

5 Best India Proxy Providers of 2024

The Complete Guide on How to Create Multiple Facebook Accounts for Business Success

Bright Data Review: The Jack of All Trades Proxy Provider

Written by Python Scraper

[FIXED] “Windows Defender Blocked By Group Policy” Error

Best Driver Updater for Windows in 2024

IPv6 No Network Access: Everything You Need to Know and How to Fix It

Best 6 Methods to Fix “Wifi Keeps Disconnecting Windows 10” Issue

What is WaasMedic Agent Exe? How to Fix High CPU usage

Best Overclocking Software for Windows in 2024