If you‘ve done any amount of web scraping, you‘ve likely encountered CAPTCHAs – those pesky boxes asking you to identify street signs or click images with cars. CAPTCHAs can be frustrating roadblocks that grind your web scraper to a halt.
In this comprehensive guide, we‘ll dig deep into techniques for avoiding and bypassing CAPTCHAs to keep your web scraper running efficiently. By the end, you‘ll have an arsenal of tactics to power through CAPTCHAs and scrape to your heart‘s content!
Contents
What Exactly Are CAPTCHAs?
First, a quick primer on what CAPTCHAs are and why sites use them.
CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". The goal is to tell if a user is a real human or an automated bot.
Websites use CAPTCHAs to prevent abusive bots from accessing their content. Some examples of activities CAPTCHAs aim to block:
- Scraping or crawling a site – extracting large amounts of data
- Brute force attacks – trying to crack passwords
- Spam posting – repeatedly posting spam content
- Fake account creation – mass creating accounts for misuse
- DDoS attacks – overloading servers by flooding them with traffic
In short, CAPTCHAs act as gatekeepers that bots must get past in order to access a site‘s content or services.
A Brief History of CAPTCHAs
The first CAPTCHA was developed in 1997 by researchers at Carnegie Mellon University. The goal was to prevent bots from using online ticket services.
Over the years, CAPTCHAs have evolved:
- 1997 – First text CAPTCHAs using distorted letters/numbers
- Early 2000s – Gimpy CAPTCHAs add backgrounds to make text hard to segment
- 2003 – reCAPTCHA launched by Carnegie Mellon grad students
- 2014 – Google acquires reCAPTCHA, shifts to No CAPTCHA and automatic detection
- 2018 – hCaptcha launched as reCAPTCHA competitor
- 2020 – Cloudflare switches to hCaptcha, fuels hCaptcha‘s rise
Today, CAPTCHAs are used by millions of sites to filter out unwanted bots and humans solve over 100 million CAPTCHAs per day!
How Do CAPTCHAs Actually Work?
When a website detects high volumes of traffic or other suspicious signals, how does it decide whether to serve a CAPTCHA?
There are a few common techniques:
Simple Triggers
Sites look for simple signals like:
- High traffic volume – Unusually large amounts of traffic from a single IP
- Request rate – Too many requests per second
- Suspicious IP – Originating from a data center or TOR exit node
If traffic exceeds thresholds, the site may trigger a CAPTCHA to confirm the visitor is human.
Passive Fingerprinting
Sites passively collect information about a visitor‘s:
- OS, browser, and hardware via user agent
- Screen size from viewport
- Installed fonts
- Browser/OS quirks from navigator properties
- IP address from WebRTC leaks
This creates a fingerprint of the visitor‘s device. If it doesn‘t look like a real device, the site may require a CAPTCHA to continue.
Active Fingerprinting
For stronger detection, sites actively probe and analyze a device by:
- Checking Browser Runtime environment
- Fingerprinting canvas element pixels
- Testing WebGL capabilities
- Checking for ad blockers
- Benchmarking device performance
By creating a detailed fingerprint, sites can determine if a visitor is likely a bot and serve a CAPTCHA.
Types of CAPTCHA Challenges
Once triggered, what types of challenges might you encounter? Here are a few common CAPTCHA types:
Text CAPTCHAs
The classic CAPTCHA – recognize distorted text in an image:
Pros:
- Simple implementation
- Used by many sites for decades
Cons:
- Susceptible to OCR
- Frustrating user experience
Used by sites like:
- Online forums
- Ticketmaster
- Craigslist
- Wikipedia
- Amazon
Image CAPTCHAs
Select images that match a description, common in reCAPTCHA:
Pros:
- Easy for humans, hard for bots
- Flexible image selection
Cons:
- Annoying user experience
- Need large corpus of images
Used by:
- reCAPTCHA
- hCaptcha
- FunCAPTCHA
Audio CAPTCHAs
Enter text you hear in an audio clip:
Pros:
- Provides accessibility
Cons:
- Annoying distorted audio
- Speech recognition improving
Often used as a fallback for accessibility rather than default option.
Puzzle CAPTCHAs
Solve puzzles like simple math problems:
Pros:
- Fun and engaging for users
- Hard to automate
Cons:
- Limited puzzle types
- Accessibility challenges
Used by:
- KittenAuth
- FunCAPTCHA
Invisible CAPTCHAs
Runs silently in background looking for suspicious signals.
Pros:
- No user interaction needed
- Good user experience
Cons:
- High false positive rate
- Privacy concerns
Used by:
- reCAPTCHA v3
- hCaptcha (Sitekey option)
Social Media Sign Ins
Require login with Facebook, Google, etc to proceed:
Pros:
- Leverage existing accounts
- Very effective
Cons:
- Requires users to have accounts
- Privacy concerns
Used by:
- Disqus
- Medium
- NextDoor
This covers the most common CAPTCHA types you‘ll encounter when scraping. But there are always innovative new challenges being developed to thwart bots!
Popular CAPTCHA Systems
Now let‘s look at some of the top CAPTCHA systems used across the web:
reCAPTCHA v2
Google‘s reCAPTCHA is the most widely used system. reCAPTCHA v2 displays the "I‘m not a robot" checkbox:
It attempts to detect bots by analyzing:
- How the box was clicked
- Mouse movements
- Browser/hardware fingerprint
If it fails, reCAPTCHA presents an image, text or audio challenge.
reCAPTCHA v2 is infamous for increased friction and false positives on non-Chrome browsers due to Google bias.
Adoption:
- Over 300,000 participating sites
- Over 5 million CAPTCHAs solved daily!
reCAPTCHA v3
reCAPTCHA v3 is Google‘s latest version, released in 2018. It attempts to be completely invisible to users.
It runs in the background analyzing user actions, assigning a "bot score". Sites take action based on the score.
Adoption:
- Over 15% of top 10,000 sites
- 40+ million monthly active users
However, v3 has received criticism for being overzealous and using excessive fingerprinting and tracking.
hCaptcha
hCaptcha is an emerging alternative to reCAPTCHA focused more on user privacy.
The flow is similar – users click "I‘m not a robot" and hCaptcha analyzes the behavior.
hCaptcha has gained popularity in part by paying sites to use it. In 2020, Cloudflare began using hCaptcha by default.
Adoption:
- Used by over 26% of the top 1000 sites
- More than 5 billion CAPTCHAs solved yearly
The popularity of privacy-focused hCaptcha reflects rising concerns over Google‘s dominance and data practices.
Amazon CAPTCHA
Retail giant Amazon uses its own custom text-based CAPTCHA system:
It‘s known for being unpredictable – CAPTCHAs don‘t always correlate to blocks and can happen erratically.
The proliferation of different CAPTCHA systems makes bypassing them a moving target. What works for one system may not help with another. We‘ll need an array of techniques.
Bypassing CAPTCHAs
Alright, let‘s move on to the good stuff – how do you actually bypass CAPTCHAs?
There are two main approaches:
- Solve the challenges – Pass CAPTCHAs when presented
- Avoid triggers – Prevent CAPTCHAs from appearing
Let‘s explore techniques for each approach:
Using CAPTCHA Solving Services
The most straightforward way to bypass CAPTCHAs is using a CAPTCHA solving service.
Services like 2Captcha, AntiCaptcha, and CapMonster have large networks of human solvers that can solve CAPTCHAs via APIs.
How it works:
- When your bot encounters a CAPTCHA, send the challenge details to the API
- Human solvers in the network manually solve the challenge
- The service returns the solution to your bot to input
This allows practically any type of CAPTCHA to be solved without needing to figure out a programmatic solution.
Pros:
- Simple API integration
- Solves any CAPTCHA type thrown at it
- Reasonably affordable at scale
Cons:
- Additional setup and servers required
- Costs add up at high volumes
- Relies on 3rd party availability
CAPTCHA solving services are fantastic solutions for small to medium scraping volumes. But costs and limits make them impractical for large scale scraping.
Training Machine Learning Models
For large scale CAPTCHA solving, a more scalable solution is training AI and machine learning models.
This approach is viable for text and some image CAPTCHAs. It involves:
Text CAPTCHAs
- Extracting individual text characters from CAPTCHA images
- Cleaning and preprocessing images
- Training convolutional neural network on CAPTCHA text datasets
- Using CNN to recognize new text CAPTCHAs
Image CAPTCHAs
- Collecting categorized CAPTCHA images
- Training computer vision CNN on image datasets
- Using model to automatically solve image CAPTCHAs
With sufficient training data, these models can reliably solve text and image CAPTCHAs without human involvement:
Pros:
- No manual work required
- Scales to solve millions of CAPTCHAs
- Constant availability
Cons:
- Significant ML expertise needed
- Models degrade if CAPTCHAs change
- Training data requirements
For large organizations, developing custom ML CAPTCHA solvers is preferable to paying for services. But it‘s not feasible for smaller operations.
Abusing Audio CAPTCHAs
Many CAPTCHAs provide audio alternatives for accessibility purposes.
We can take advantage of these audio CAPTCHAs to bypass visual challenges:
Process:
- Detect audio CAPTCHA option is available
- Trigger and download audio file
- Extract speech text using speech-to-text API
- Submit recognized text to pass audio challenge
Speech-to-text services like Google Cloud Speech or AWS Transcribe will handle the speech recognition.
Pros
- Bypass visual CAPTCHAs
- Leverage cheap/free speech APIs
Cons:
- Audio not always available
- Speech quality is often poor
Audio CAPTCHAs can be a quick win if visual analysis is proving difficult.
Avoiding CAPTCHA Triggers
The best way to bypass CAPTCHAs is avoiding triggering them in the first place. This involves mimicking real user behavior as closely as possible:
Use residential proxies
Proxy IPs from residential ISPs appear more trustworthy than datacenter IPs:
With residential proxies, sites see requests coming hundreds or thousands of real users vs one datacenter.
Throttle requests
Gradually ramp up requests and add random delays between them. Don‘t barrage sites continuously.
Rotate user agents
Spoof a variety user agents from real browser/device combos. Don‘t use a single constant user agent.
Use headless browsers
Headless browsers like Puppeteer execute JavaScript to better impersonate real browsers.
Add human-like actions
Click page elements, scroll, and move mouse in organic patterns.
Pros:
- No CAPTCHAs to solve if you don‘t trigger them
- Very effective for less sophisticated sites
Cons:
- Doesn‘t work against advanced fingerprinting
- Complex setup and infrastructure
For lightly protected sites, meticulous human mimicry can prevent CAPTCHAs without solving challenges. But it quickly grows difficult to scale.
Key Takeaways
Here are the core lessons for bypassing CAPTCHAs effectively:
- Invest in quality residential proxies – Avoid simple IP-based triggers
- Leverage CAPTCHA solving services – Easy to implement for small/medium scraping
- Train ML models – For large scale operations and advanced CAPTCHAs
- Abuse audio challenges – When visual CAPTCHAs get tough
- Mimic users meticulously – Prevent simple triggers on less protected sites
- Combine multiple techniques – There‘s no silver bullet to bypass CAPTCHAs universally
With the right toolkit of proxies, services, automation, and mimicry, CAPTCHAs can be surpassed at any scale.
Hopefully this guide gives you a framework for tackling those pesky CAPTCHAs that seem intent on ruining the lives of web scrapers everywhere! Please reach out if you have any other questions.
Frequently Asked Questions
Here are answers to some common questions about bypassing CAPTCHAs:
Q: What is the best CAPTCHA bypass service?
There is no definitive "best" service, but some popular options include 2Captcha, AntiCaptcha, CapMonster, and DeathByCaptcha. The right service depends on your budget, language needs, and API options.
Q: Can machine learning solve all CAPTCHAs?
Not quite – ML can reliably solve text and some image CAPTCHAs, but many advanced CAPTCHAs use random images, video, or other challenges specifically designed to be difficult for computers. ML can help but may need to be combined with other solutions.
Q: Are residential proxies always better than datacenter?
Residential proxies are extremely useful for avoiding IP-based blocks and rate limiting. However, datacenter proxies can sometimes work for low volume scraping where residential IPs aren‘t critical. It depends on the site protections.
Q: Is it illegal to bypass CAPTCHAs?
CAPTCHA bypassing tools themselves are not illegal. However, how you ultimately use the access CAPTCHA bypassing provides may breach a site‘s terms of service or violate laws. You‘re responsible for ensuring your web scraping respects sites‘ access policies and applicable laws.
Q: What‘s better – solve CAPTCHAs or avoid them?
Avoiding CAPTCHAs is preferable if feasible, as it‘s more reliable and scalable. But on highly protected sites, solving challenges may be necessary. The best approach combines avoidance where possible with robust CAPTCHA solving for when they inevitably appear.