How to Get Text Using LXML

LXML is a powerful Python library for parsing and scraping HTML and XML documents. With LXML, you can easily extract text and data from web pages. In this comprehensive guide, you‘ll learn how to use LXML‘s etree and xpath to get text from HTML documents.

Overview of LXML

LXML stands for "lxml.etree". It is a Pythonic binding for the C libraries libxml2 and libxslt. The key features of LXML include:

  • Fast XML and HTML parsing and XPath evaluations
  • Easy creation of XML documents
  • Powerful XML and HTML tree traversal
  • CSS selectors via cssselect
  • Built-in support for XSLT

Some key advantages of using LXML over Python‘s built-in XML parsing libraries:

  • Speed – LXML is very fast at parsing XML and HTML, as it‘s built on top of C libraries libxml2 and libxslt. It‘s faster than BeautifulSoup and built-in libraries like xml.etree.ElementTree.

  • XPath support – LXML provides full XPath 1.0 support, allowing powerful element selection capabilities. XPath makes selecting elements super easy.

  • CSS Selectors – You can use CSS selectors for selecting elements via the cssselect module. This is handy for folks familiar with CSS.

  • HTML Parsing – LXML can parse broken HTML pages and fix errors on-the-fly. This makes it more robust than regular XML parsers when dealing with web data.

  • Extensions – LXML provides a wide range of extensions, like HTML cleaning, serialization, and XML Schema validation.

Overall, LXML combines speed, rich features, and simplicity. It‘s a great choice for web scraping and processing XML/HTML documents in Python.

Install LXML

LXML runs on Python 3.5+ and 2.7. It can be installed via pip:

pip install lxml

For certain Linux distros, you may need to install libxml2 and libxslt beforehand.

Now let‘s see LXML in action for scraping text from HTML.

Parse HTML with LXML

The first step is to parse the HTML content into an LXML document that we can query.

LXML provides the lxml.html.fromstring() function to parse HTML:

from lxml import html

html_doc = """
<html>
<head>
  <title>My Page</title>
</head>

<body>

  <p>This is a page</p>
</body>
</html>
"""

page = html.fromstring(html_doc)

This parses the HTML into an Element tree. We can also parse from a file:

page = html.parse(‘page.html‘)

Or from a URL:

import requests
page = html.fromstring(requests.get(‘http://example.com‘).text) 

No matter the input source, LXML will parse it into an XML document that we can start querying.

Use XPath to Get Text

Now that we have a parsed LXML document, we can use XPath expressions to extract data.

XPath allows selecting elements by referencing their position in the document tree. Some examples:

  • /html/body/h1 – Select <h1> under <body>
  • //p – Select all <p> tags anywhere
  • /html/head/title/text() – Get text inside <title>

Let‘s grab the <h1> text from the parsed page:

h1 = page.xpath(‘/html/body/h1/text()‘)[0] 
print(h1)
# Prints "Hello World" 

The xpath() method runs the XPath expression against the element tree. It returns a list of matching elements. For text, we take the first result with [0].

Some more examples of getting text:

title = page.xpath(‘/html/head/title/text()‘)[0] 

paras = page.xpath(‘//p/text()‘)

link_text = page.xpath(‘//*[@id="link"]/text()‘) 

The xpath expressions select text nodes we want. This makes extracting text content super easy!

Get Text from a Web Page

Let‘s see how to scrape text from a real web page. We‘ll extract the headlines from the New York Times homepage:

import requests
from lxml import html

page = requests.get(‘https://www.nytimes.com/‘)
tree = html.fromstring(page.content)

#Get headlines   
headlines = tree.xpath(‘//h2/text()‘)

print(‘Headlines:‘)
for h in headlines:
    print(h)

This prints out all the <h2> headlines from the NYTimes homepage. The key steps are:

  1. Fetch page HTML using requests
  2. Parse HTML into LXML tree
  3. Use XPath to select //h2/text() text elements
  4. Print out text of matching elements

And we have successfully extracted text content from a live web page with LXML!

Namespaces in XPath

When parsing XML documents, you may encounter namespaces in tag names like <h:h1> and <content:p>.

Namespaces help avoid element naming conflicts but require a tweak to the XPath:

<html xmlns:content="http://example.com">

  <content:body>
    <content:p>Hello world!</content:p>
  </content:body>

</html>

To reference namespaced elements, use the * prefix:

para = tree.xpath(‘//*:p/text()‘)

This will match <content:p> elements in any namespace.

You can also bind a namespace to a prefix like c:

nsmap = {‘c‘: ‘http://example.com‘}
para = tree.xpath(‘//c:p/text()‘, namespaces=nsmap)

Binding namespaces to prefixes helps write more readable XPath selectors.

Use CSS Selectors

LXML also supports using CSS selectors for elements selection via the cssselect module.

For example:

from lxml.cssselect import CSSSelector

sel = CSSSelector(‘h1‘)  
headers = sel(tree)

This finds all <h1> elements from the parsed tree.

Some more examples of CSS selectors:

elts = CSSSelector(‘#content p.text‘)
links = CSSSelector(‘a[href]‘)

CSS selectors provide an alternative way to select elements, especially useful for those with CSS experience.

Clean Up HTML

Real-world HTML is often malformed and dirty. LXML has tools to clean up and normalize documents:

  • html.tostring – Serialize document to a string
  • html.clean – Clean up DOM
  • html.document_fromstring – Normalize to XHTML

For example:

dirty = ‘<p>Text</b>‘
clean = html.document_fromstring(dirty)
print(html.tostring(clean))

# ‘<html><body><p>Text</p></body></html>‘

This can help normalize documents to produce consistent XPath queries.

LXML Tips

Here are some additional tips when using LXML for HTML scraping:

  • Use /// for recursive descent through subelements at any depth.
  • Get element attributes with /@attribute.
  • Return only first match with [0], instead of list.
  • Pretty print XML using lxml.etree.tostring(el, pretty_print=True).
  • Remove unwanted elements like ads using element.remove() or XPath.
  • Watch out for duplicate IDs when scraping multiple pages.

And refer to the LXML tutorials for more details on using this excellent library!

Conclusion

LXML is a versatile library for parsing, scraping, and processing XML and HTML documents in Python.

Its main features include:

  • Fast HTML/XML parsing with lxml.html
  • XPath and CSS selector support for element selection
  • Handy utilities like cleaning and serialization
  • Extensions for validation and other functionality

With LXML and XPath, it‘s easy to scrape and extract text or data from web pages. It provides all the tools needed for HTML scraping and processing.

To recap, the key steps covered in this guide:

  1. Install LXML (pip install lxml)
  2. Parse HTML/XML into document tree
  3. Use XPath or CSS selectors to get elements
  4. Extract text with /text() or get attributes
  5. Clean and normalize documents

LXML is a great choice for any project involving screen scraping, web data extraction, or XML processing. Its speed and power will help you scrape text faster and easier than ever!

Avatar photo

Written by Python Scraper

As an accomplished Proxies & Web scraping expert with over a decade of experience in data extraction, my expertise lies in leveraging proxies to maximize the efficiency and effectiveness of web scraping projects. My journey in this field began with a fascination for the vast troves of data available online and a passion for unlocking its potential.

Over the years, I've honed my skills in Python, developing sophisticated scraping tools that navigate complex web structures. A critical component of my work involves using various proxy services, including BrightData, Soax, Smartproxy, Proxy-Cheap, and Proxy-seller. These services have been instrumental in my ability to obtain multiple IP addresses, bypass IP restrictions, and overcome geographical limitations, thus enabling me to access and extract data seamlessly from diverse sources.

My approach to web scraping is not just technical; it's also strategic. I understand that every scraping task has unique challenges, and I tailor my methods accordingly, ensuring compliance with legal and ethical standards. By staying up-to-date with the latest developments in proxy technologies and web scraping methodologies, I continue to provide top-tier services in data extraction, helping clients transform raw data into actionable insights.