How to Find All URLs Using Selenium

Selenium is a popular open-source automated testing framework used for web applications. With Selenium, you can easily scrape dynamic websites and extract data like URLs.

In this comprehensive guide, I‘ll walk you through the step-by-step process to find all URLs on a page using Python and Selenium.

Contents

Overview
Setting Up Selenium with Python
Locating Page Elements
Extract URLs from Anchor Tags
Using CSS Selectors
Using XPath to Find Links
Removing Duplicate URLs
Scraping Pagination Links
Final Thoughts

Overview

Here‘s a quick overview of what we‘ll cover:

Setting up Selenium in Python
Locating elements with Selenium
Extracting URLs from <a> tags
Using CSS selectors to find URLs
Using XPath to locate URLs
Removing duplicate URLs
Handling pagination to get URLs from all pages

Let‘s get started!

Setting Up Selenium with Python

First, you‘ll need to install Selenium and a browser driver. I‘ll be using Chrome in this tutorial.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

This automatically installs the ChromeDriver binary and creates a Selenium webdriver instance to control Chrome.

Locating Page Elements

Selenium provides a few options to find elements on a page:

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

For our URL scraping task, find_elements_by_tag_name and CSS selectors will be the most useful.

Let‘s open up an example page:

driver.get("https://example.com")

Extract URLs from Anchor Tags

To get all URLs, we first need to locate all <a> anchor tags on the page.

We can use find_elements_by_tag_name to find all <a> elements:

links = driver.find_elements_by_tag_name("a")

This returns a list of WebElement objects representing each <a> tag.

To extract the URLs, we loop through each element and get the href attribute:

for link in links:
  url = link.get_attribute("href")
  print(url)

This prints out every URL contained in an anchor tag on the page.

Using CSS Selectors

While getting URLs from anchors works, it‘s better to use a more targeted locator.

CSS selectors allow us to precisely locate elements with certain attributes.

For example, to find all links containing "category":

links = driver.find_elements_by_css_selector("a[href*=‘category‘]")

The *=‘category‘ matches elements with "category" anywhere in the href.

We can then extract the URLs like before:

for link in links:
  url = link.get_attribute("href")
  print(url)

This only prints URLs with "category", filtering out others.

Using XPath to Find Links

Another option is to use XPath selectors to locate URLs.

The XPath to find links with "category":

links = driver.find_elements_by_xpath("//a[contains(@href, ‘category‘)]")

This looks for <a> tags where the href contains "category".

Extract the URLs:

for link in links:
  url = link.get_attribute("href")
  print(url)

XPath can help find elements that CSS selectors can‘t easily match.

Removing Duplicate URLs

When extracting URLs from a page, you‘ll often get duplicates.

To remove duplicates, we can store the URLs in a set:

url_set = set()

for link in links:
  url = link.get_attribute("href")

  #Add URL to set
  url_set.add(url) 

#Print unique URLs  
for url in url_set:
  print(url)

Now each URL will only be printed once.

Scraping Pagination Links

Many sites use pagination on category and search pages.

To extract URLs from all pages, we need to handle the pagination links.

# Get initial links
links = driver.find_elements_by_tag_name("a")

while True:

  for link in links:
    url = link.get_attribute("href")
    print(url)

  # Check if there‘s a next page
  next_page = driver.find_elements_by_xpath("//a[text()=‘Next Page »‘]")

  if not next_page: 
    break

  # Click the next page
  next_page[0].click()

  # Update links 
  links = driver.find_elements_by_tag_name("a")

This scrolls through pagination, clicking "Next" and updating the links on each page.

We break out of the loop when there‘s no next page found.

Now we can extract all URLs across pagination.

Final Thoughts

That covers the main techniques to find and extract all URLs from a web page using Selenium in Python.

The key points are:

Use find_elements_by_tag_name and anchor tags to get an initial list of URLs
Leverage CSS selectors or XPath to target specific URLs
Remove duplicates by storing URLs in a set
Handle pagination by clicking "Next" buttons and re-locating links

Selenium is a powerful tool for automated web scraping. Mastering element location and extraction is key for building scrapers that can imitate human actions.

Let me know if you have any other questions! I‘m always happy to help fellow web scraping enthusiasts.

How to Find All URLs Using Selenium

Overview

Setting Up Selenium with Python

Locating Page Elements

Extract URLs from Anchor Tags

Using CSS Selectors

Using XPath to Find Links

Removing Duplicate URLs

Scraping Pagination Links

Final Thoughts

What Is IP Rotation? Ways to Rotate an IP Address

5 Best India Proxy Providers of 2024

The Complete Guide on How to Create Multiple Facebook Accounts for Business Success

Bright Data Review: The Jack of All Trades Proxy Provider

Written by Python Scraper

[FIXED] “Windows Defender Blocked By Group Policy” Error

Best Driver Updater for Windows in 2024

IPv6 No Network Access: Everything You Need to Know and How to Fix It

Best 6 Methods to Fix “Wifi Keeps Disconnecting Windows 10” Issue

What is WaasMedic Agent Exe? How to Fix High CPU usage

Best Overclocking Software for Windows in 2024