Selenium is a popular open-source automated testing framework used for web applications. With Selenium, you can easily scrape dynamic websites and extract data like URLs.
In this comprehensive guide, I‘ll walk you through the step-by-step process to find all URLs on a page using Python and Selenium.
Contents
Overview
Here‘s a quick overview of what we‘ll cover:
- Setting up Selenium in Python
- Locating elements with Selenium
- Extracting URLs from
<a>
tags - Using CSS selectors to find URLs
- Using XPath to locate URLs
- Removing duplicate URLs
- Handling pagination to get URLs from all pages
Let‘s get started!
Setting Up Selenium with Python
First, you‘ll need to install Selenium and a browser driver. I‘ll be using Chrome in this tutorial.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
This automatically installs the ChromeDriver binary and creates a Selenium webdriver instance to control Chrome.
Locating Page Elements
Selenium provides a few options to find elements on a page:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
For our URL scraping task, find_elements_by_tag_name
and CSS selectors will be the most useful.
Let‘s open up an example page:
driver.get("https://example.com")
Extract URLs from Anchor Tags
To get all URLs, we first need to locate all <a>
anchor tags on the page.
We can use find_elements_by_tag_name
to find all <a>
elements:
links = driver.find_elements_by_tag_name("a")
This returns a list of WebElement objects representing each <a>
tag.
To extract the URLs, we loop through each element and get the href
attribute:
for link in links:
url = link.get_attribute("href")
print(url)
This prints out every URL contained in an anchor tag on the page.
Using CSS Selectors
While getting URLs from anchors works, it‘s better to use a more targeted locator.
CSS selectors allow us to precisely locate elements with certain attributes.
For example, to find all links containing "category":
links = driver.find_elements_by_css_selector("a[href*=‘category‘]")
The *=‘category‘
matches elements with "category" anywhere in the href
.
We can then extract the URLs like before:
for link in links:
url = link.get_attribute("href")
print(url)
This only prints URLs with "category", filtering out others.
Using XPath to Find Links
Another option is to use XPath selectors to locate URLs.
The XPath to find links with "category":
links = driver.find_elements_by_xpath("//a[contains(@href, ‘category‘)]")
This looks for <a>
tags where the href
contains "category".
Extract the URLs:
for link in links:
url = link.get_attribute("href")
print(url)
XPath can help find elements that CSS selectors can‘t easily match.
Removing Duplicate URLs
When extracting URLs from a page, you‘ll often get duplicates.
To remove duplicates, we can store the URLs in a set:
url_set = set()
for link in links:
url = link.get_attribute("href")
#Add URL to set
url_set.add(url)
#Print unique URLs
for url in url_set:
print(url)
Now each URL will only be printed once.
Scraping Pagination Links
Many sites use pagination on category and search pages.
To extract URLs from all pages, we need to handle the pagination links.
# Get initial links
links = driver.find_elements_by_tag_name("a")
while True:
for link in links:
url = link.get_attribute("href")
print(url)
# Check if there‘s a next page
next_page = driver.find_elements_by_xpath("//a[text()=‘Next Page »‘]")
if not next_page:
break
# Click the next page
next_page[0].click()
# Update links
links = driver.find_elements_by_tag_name("a")
This scrolls through pagination, clicking "Next" and updating the links on each page.
We break out of the loop when there‘s no next page found.
Now we can extract all URLs across pagination.
Final Thoughts
That covers the main techniques to find and extract all URLs from a web page using Selenium in Python.
The key points are:
- Use
find_elements_by_tag_name
and anchor tags to get an initial list of URLs - Leverage CSS selectors or XPath to target specific URLs
- Remove duplicates by storing URLs in a set
- Handle pagination by clicking "Next" buttons and re-locating links
Selenium is a powerful tool for automated web scraping. Mastering element location and extraction is key for building scrapers that can imitate human actions.
Let me know if you have any other questions! I‘m always happy to help fellow web scraping enthusiasts.