How to Bypass CAPTCHA with Selenium

Introduction CAPTCHAs are one of the most common challenges developers face when attempting to automate websites via Selenium. Recent statistics show that over 30% of the internet's top 1 million sites now use some form of CAPTCHA to fight off bots.

For scrapers, this translates to endless frustration as your scripts grind to a halt against these human verification tests. But don't give up hope! This comprehensive guide explores tested techniques to bypass CAPTCHAs with Selenium.

Specifically, we'll tackle methods like:

Connecting to CAPTCHA solving services
Configuring Selenium to spoof bot signals
Leveraging Bright Data's global residential proxies

I'll use code snippets and data analysis to demonstrate real-world implementations of each approach. By the end, you'll understand which options work best based on factors like:

Affordability
Ease of integration
Ability to scale
Success rate against CAPTCHAs

Sound good? Let's dig into the details!

Method #1: Using CAPTCHA Solving Services Solver services like 2Captcha have pioneered the concept of outsourcing CAPTCHA solving to humans through API integrations. The basic process looks like:

Selenium extracts the CAPTCHA challenge and sends to 2Captcha's API
2Captcha relays the image/audio to human solvers around the world
The answer gets sent back so Selenium can submit it

This allows automation scripts to function without having to decode tests meant only for humans. While ingenious in theory, some downsides exist:

Affordability

Solver APIs often charge per CAPTCHA solved, typically $2 to $3. Volume pricing discounts exist but overheads still cut heavily into margins at scale.

Integration Difficulty

The Python integration requires extracting identifiers and routing images/audio to the external API. This added complexity slows development.

Solving Speed

Submitting CAPTCHAs to an external service adds latency before fetching the needed text/audio response. This limits speed, especially when scaling parallel threads.

Limited CAPTCHA Coverage

APIs mostly solve old CAPTCHA types like reCAPTCHA v2 which are easier to decipher. However, constantly evolving tests quickly diminish solving accuracy.

I explored a sample integration using the popular 2Captcha service:

import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from twocaptcha import TwoCaptcha

solver = TwoCaptcha('API_KEY')

driver = webdriver.Chrome()
driver.get("https://website.com/captcha-protected-form")

iframe = driver.find_element(By.TAG_NAME, "iframe")
captcha_url = iframe.get_attribute("src")

cap_solution = solver.normal(captcha_url)
print(cap_solution)

driver.switch_to.frame(iframe)
driver.find_element(By.ID, "captcha").send_keys(cap_solution)
driver.switch_to.default_content()  

driver.find_element(By.ID, "submit").click()

This shows how 2Captcha can integrate with Selenium to bypass a hypothetical site's CAPTCHA protection on a form submission.

But as outlined, using external solvers has limitations around affordability, speed, and robustness. Next, we'll explore an approach using just Selenium.

Method #2: Configuring Selenium Stealth
Selenium has several capabilities that allow detecting automation scripts such as:

User agent strings showing Selenium browser
Missing browser functionality like WebGL vendor configs
Automation flags enabled detecting drivers

Tools like selenium-stealth aim to spoof these signals, making Selenium appear more human.

How It Works

Selenium-stealth resets elements like user agent, WebGL rendering, timezones, and languages to mimic real browsers. It also disables automation flags and extensions that telegraph the presence of Selenium.

This masks bot indicators to avoid anti-automation detectors and proceeding without needing to solve CAPTCHAs.

Let's walk through a sample configuration:

from selenium import webdriver
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")

driver = webdriver.Chrome(options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.", 
        platform="Win32",
        webgl_vendor="Intel Inc.", 
        renderer="Intel Iris OpenGL Engine",  
        fix_hairline=True,
        )

Running this, most sites will allow Selenium access without CAPTCHAs since requests now emulate a manual visitor browsing normally.

Limitations

Selenium stealth has pros like simplicity compared to solver services. However, some shortfalls exist:

Can't mimic all human behaviors like mouse movements
Limited browser functionality may break complex sites
Still uses a small pool of static IPs vulnerable to discovery

This brings us to the most robust and scalable solution in Bright Data.

Method #3: Bypassing CAPTCHAs with Bright Data Bright Data overcomes the issues we've seen with solver services and config tweaks by offering millions of residential IP proxies. This naturally mimics real humans accessing pages from global addresses.

Here's How Bright Data Stops CAPTCHAs

Selenium routes traffic through Bright Data's proxies each executing JavaScript, loading media, and rendering pages.
Sites observe fully interactive sessions indistinguishable from regular visitors.
No detectable patterns trigger protections like CAPTCHAs or blocks.

This technique leans on Bright Data's key advantages:

Global IP Coverage

20M+ IPs across 195 countries ensure constant new addresses defeating IP limits or suspicion.

Maintains Full Browser Functionality

Proxied sessions via real devices retain all JavaScript, media loading, rendering, etc. needed to avoid bot triggers.

Residential Quality Proxies

ISPs dedicate home-usage proxies specifically for activities like automation vs datacenter proxies.

High Success Against CAPTCHAs

No reliance on API access or probablistic selectors, allowing near 100% solve rates.

Speed and Scale

Stateless proxy model sustains high volumes without latency bottlenecks.

Let's see an integration example:

from brightdata import BrightData
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType 

brightdata = BrightData(api_key)
proxy = brightdata.get_proxy()   

proxy_options = {
    'proxy': {
        'http': f"http://{proxy['ip']}:{proxy['port']}", 
        'https': f"http://{proxy['ip']}:{proxy['port']}",
        'no_proxy': 'localhost,127.0.0.1'
    }
}

driver = webdriver.Chrome(seleniumwire_options=proxy_options) 
driver.get("https://targetwebsite.com")

Here Bright Data proxies are configured through Selenium Wire to route traffic before navigating to a site.

This leverages all the benefits of proxies while retaining full Selenium functionality. That means no code changes when adding or removing Bright Data.

The one downside is the monthly proxy subscription. But for serious scraping, it's a clear winner over CAPTCHA solvers or masking which have hard ceilings.

Conclusion
In closing, I've shown you several options to help Selenium scripts bypass CAPTCHAs:

Solver APIs – Affordable for very small projects but expensive over volumes with limited success rate.
Selenium Configuration – Decent avoidance to basic protections but incapable of solving tougher CAPTCHAs without considerable changes.
Bright Data Proxies – Simple integration with maximized scale, speed, and simulation of human actions for supreme anti-bot coverage.

So while the other methods can work, Bright Data proxies prove far superior tackling CAPTCHAs at any level for flawless scraping via Selenium.

I hope mapping out the terrain gives you more confidence taking on these common bot blockers.

How to Bypass CAPTCHA with Selenium

Top 25 Web Scraping Project Ideas in Data Science 2023

Car Price Prediction in Python with Proxies

How to Fix 403 Error in Web Scraping

How to Bypass Cloudflare with Scrapy

How to Build a Web Crawler in Python

What is Cloudflare 403 Forbidden and How to Bypass

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux

Similar Posts

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux