How to Find All URLs Using Selenium

Selenium is a popular open-source automated testing framework used for web applications. With Selenium, you can easily scrape dynamic websites and extract data like URLs.

In this comprehensive guide, I‘ll walk you through the step-by-step process to find all URLs on a page using Python and Selenium.

Overview

Here‘s a quick overview of what we‘ll cover:

  • Setting up Selenium in Python
  • Locating elements with Selenium
  • Extracting URLs from <a> tags
  • Using CSS selectors to find URLs
  • Using XPath to locate URLs
  • Removing duplicate URLs
  • Handling pagination to get URLs from all pages

Let‘s get started!

Setting Up Selenium with Python

First, you‘ll need to install Selenium and a browser driver. I‘ll be using Chrome in this tutorial.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

This automatically installs the ChromeDriver binary and creates a Selenium webdriver instance to control Chrome.

Locating Page Elements

Selenium provides a few options to find elements on a page:

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

For our URL scraping task, find_elements_by_tag_name and CSS selectors will be the most useful.

Let‘s open up an example page:

driver.get("https://example.com") 

Extract URLs from Anchor Tags

To get all URLs, we first need to locate all <a> anchor tags on the page.

We can use find_elements_by_tag_name to find all <a> elements:

links = driver.find_elements_by_tag_name("a")

This returns a list of WebElement objects representing each <a> tag.

To extract the URLs, we loop through each element and get the href attribute:

for link in links:
  url = link.get_attribute("href")
  print(url)

This prints out every URL contained in an anchor tag on the page.

Using CSS Selectors

While getting URLs from anchors works, it‘s better to use a more targeted locator.

CSS selectors allow us to precisely locate elements with certain attributes.

For example, to find all links containing "category":

links = driver.find_elements_by_css_selector("a[href*=‘category‘]")

The *=‘category‘ matches elements with "category" anywhere in the href.

We can then extract the URLs like before:

for link in links:
  url = link.get_attribute("href")
  print(url)

This only prints URLs with "category", filtering out others.

Using XPath to Find Links

Another option is to use XPath selectors to locate URLs.

The XPath to find links with "category":

links = driver.find_elements_by_xpath("//a[contains(@href, ‘category‘)]")

This looks for <a> tags where the href contains "category".

Extract the URLs:

for link in links:
  url = link.get_attribute("href")
  print(url) 

XPath can help find elements that CSS selectors can‘t easily match.

Removing Duplicate URLs

When extracting URLs from a page, you‘ll often get duplicates.

To remove duplicates, we can store the URLs in a set:

url_set = set()

for link in links:
  url = link.get_attribute("href")

  #Add URL to set
  url_set.add(url) 

#Print unique URLs  
for url in url_set:
  print(url)

Now each URL will only be printed once.

Scraping Pagination Links

Many sites use pagination on category and search pages.

To extract URLs from all pages, we need to handle the pagination links.

# Get initial links
links = driver.find_elements_by_tag_name("a")

while True:

  for link in links:
    url = link.get_attribute("href")
    print(url)

  # Check if there‘s a next page
  next_page = driver.find_elements_by_xpath("//a[text()=‘Next Page »‘]")

  if not next_page: 
    break

  # Click the next page
  next_page[0].click()

  # Update links 
  links = driver.find_elements_by_tag_name("a")

This scrolls through pagination, clicking "Next" and updating the links on each page.

We break out of the loop when there‘s no next page found.

Now we can extract all URLs across pagination.

Final Thoughts

That covers the main techniques to find and extract all URLs from a web page using Selenium in Python.

The key points are:

  • Use find_elements_by_tag_name and anchor tags to get an initial list of URLs
  • Leverage CSS selectors or XPath to target specific URLs
  • Remove duplicates by storing URLs in a set
  • Handle pagination by clicking "Next" buttons and re-locating links

Selenium is a powerful tool for automated web scraping. Mastering element location and extraction is key for building scrapers that can imitate human actions.

Let me know if you have any other questions! I‘m always happy to help fellow web scraping enthusiasts.

Avatar photo

Written by Python Scraper

As an accomplished Proxies & Web scraping expert with over a decade of experience in data extraction, my expertise lies in leveraging proxies to maximize the efficiency and effectiveness of web scraping projects. My journey in this field began with a fascination for the vast troves of data available online and a passion for unlocking its potential.

Over the years, I've honed my skills in Python, developing sophisticated scraping tools that navigate complex web structures. A critical component of my work involves using various proxy services, including BrightData, Soax, Smartproxy, Proxy-Cheap, and Proxy-seller. These services have been instrumental in my ability to obtain multiple IP addresses, bypass IP restrictions, and overcome geographical limitations, thus enabling me to access and extract data seamlessly from diverse sources.

My approach to web scraping is not just technical; it's also strategic. I understand that every scraping task has unique challenges, and I tailor my methods accordingly, ensuring compliance with legal and ethical standards. By staying up-to-date with the latest developments in proxy technologies and web scraping methodologies, I continue to provide top-tier services in data extraction, helping clients transform raw data into actionable insights.