LXML is a powerful Python library for parsing and scraping HTML and XML documents. With LXML, you can easily extract text and data from web pages. In this comprehensive guide, you‘ll learn how to use LXML‘s etree and xpath to get text from HTML documents.
Contents
Overview of LXML
LXML stands for "lxml.etree". It is a Pythonic binding for the C libraries libxml2 and libxslt. The key features of LXML include:
- Fast XML and HTML parsing and XPath evaluations
- Easy creation of XML documents
- Powerful XML and HTML tree traversal
- CSS selectors via cssselect
- Built-in support for XSLT
Some key advantages of using LXML over Python‘s built-in XML parsing libraries:
-
Speed – LXML is very fast at parsing XML and HTML, as it‘s built on top of C libraries libxml2 and libxslt. It‘s faster than BeautifulSoup and built-in libraries like xml.etree.ElementTree.
-
XPath support – LXML provides full XPath 1.0 support, allowing powerful element selection capabilities. XPath makes selecting elements super easy.
-
CSS Selectors – You can use CSS selectors for selecting elements via the cssselect module. This is handy for folks familiar with CSS.
-
HTML Parsing – LXML can parse broken HTML pages and fix errors on-the-fly. This makes it more robust than regular XML parsers when dealing with web data.
-
Extensions – LXML provides a wide range of extensions, like HTML cleaning, serialization, and XML Schema validation.
Overall, LXML combines speed, rich features, and simplicity. It‘s a great choice for web scraping and processing XML/HTML documents in Python.
Install LXML
LXML runs on Python 3.5+ and 2.7. It can be installed via pip:
pip install lxml
For certain Linux distros, you may need to install libxml2 and libxslt beforehand.
Now let‘s see LXML in action for scraping text from HTML.
Parse HTML with LXML
The first step is to parse the HTML content into an LXML document that we can query.
LXML provides the lxml.html.fromstring()
function to parse HTML:
from lxml import html
html_doc = """
<html>
<head>
<title>My Page</title>
</head>
<body>
<p>This is a page</p>
</body>
</html>
"""
page = html.fromstring(html_doc)
This parses the HTML into an Element
tree. We can also parse from a file:
page = html.parse(‘page.html‘)
Or from a URL:
import requests
page = html.fromstring(requests.get(‘http://example.com‘).text)
No matter the input source, LXML will parse it into an XML document that we can start querying.
Use XPath to Get Text
Now that we have a parsed LXML document, we can use XPath expressions to extract data.
XPath allows selecting elements by referencing their position in the document tree. Some examples:
/html/body/h1
– Select<h1>
under<body>
//p
– Select all<p>
tags anywhere/html/head/title/text()
– Get text inside<title>
Let‘s grab the <h1>
text from the parsed page:
h1 = page.xpath(‘/html/body/h1/text()‘)[0]
print(h1)
# Prints "Hello World"
The xpath()
method runs the XPath expression against the element tree. It returns a list of matching elements. For text, we take the first result with [0]
.
Some more examples of getting text:
title = page.xpath(‘/html/head/title/text()‘)[0]
paras = page.xpath(‘//p/text()‘)
link_text = page.xpath(‘//*[@id="link"]/text()‘)
The xpath expressions select text nodes we want. This makes extracting text content super easy!
Get Text from a Web Page
Let‘s see how to scrape text from a real web page. We‘ll extract the headlines from the New York Times homepage:
import requests
from lxml import html
page = requests.get(‘https://www.nytimes.com/‘)
tree = html.fromstring(page.content)
#Get headlines
headlines = tree.xpath(‘//h2/text()‘)
print(‘Headlines:‘)
for h in headlines:
print(h)
This prints out all the <h2>
headlines from the NYTimes homepage. The key steps are:
- Fetch page HTML using
requests
- Parse HTML into LXML tree
- Use XPath to select
//h2/text()
text elements - Print out text of matching elements
And we have successfully extracted text content from a live web page with LXML!
Namespaces in XPath
When parsing XML documents, you may encounter namespaces in tag names like <h:h1>
and <content:p>
.
Namespaces help avoid element naming conflicts but require a tweak to the XPath:
<html xmlns:content="http://example.com">
<content:body>
<content:p>Hello world!</content:p>
</content:body>
</html>
To reference namespaced elements, use the *
prefix:
para = tree.xpath(‘//*:p/text()‘)
This will match <content:p>
elements in any namespace.
You can also bind a namespace to a prefix like c
:
nsmap = {‘c‘: ‘http://example.com‘}
para = tree.xpath(‘//c:p/text()‘, namespaces=nsmap)
Binding namespaces to prefixes helps write more readable XPath selectors.
Use CSS Selectors
LXML also supports using CSS selectors for elements selection via the cssselect
module.
For example:
from lxml.cssselect import CSSSelector
sel = CSSSelector(‘h1‘)
headers = sel(tree)
This finds all <h1>
elements from the parsed tree.
Some more examples of CSS selectors:
elts = CSSSelector(‘#content p.text‘)
links = CSSSelector(‘a[href]‘)
CSS selectors provide an alternative way to select elements, especially useful for those with CSS experience.
Clean Up HTML
Real-world HTML is often malformed and dirty. LXML has tools to clean up and normalize documents:
html.tostring
– Serialize document to a stringhtml.clean
– Clean up DOMhtml.document_fromstring
– Normalize to XHTML
For example:
dirty = ‘<p>Text</b>‘
clean = html.document_fromstring(dirty)
print(html.tostring(clean))
# ‘<html><body><p>Text</p></body></html>‘
This can help normalize documents to produce consistent XPath queries.
LXML Tips
Here are some additional tips when using LXML for HTML scraping:
- Use
///
for recursive descent through subelements at any depth. - Get element attributes with
/@attribute
. - Return only first match with
[0]
, instead of list. - Pretty print XML using
lxml.etree.tostring(el, pretty_print=True)
. - Remove unwanted elements like ads using
element.remove()
or XPath. - Watch out for duplicate IDs when scraping multiple pages.
And refer to the LXML tutorials for more details on using this excellent library!
Conclusion
LXML is a versatile library for parsing, scraping, and processing XML and HTML documents in Python.
Its main features include:
- Fast HTML/XML parsing with
lxml.html
- XPath and CSS selector support for element selection
- Handy utilities like cleaning and serialization
- Extensions for validation and other functionality
With LXML and XPath, it‘s easy to scrape and extract text or data from web pages. It provides all the tools needed for HTML scraping and processing.
To recap, the key steps covered in this guide:
- Install LXML (
pip install lxml
) - Parse HTML/XML into document tree
- Use XPath or CSS selectors to get elements
- Extract text with
/text()
or get attributes - Clean and normalize documents
LXML is a great choice for any project involving screen scraping, web data extraction, or XML processing. Its speed and power will help you scrape text faster and easier than ever!