How to Use Selenium with Scrapy
Websites built with modern JavaScript frameworks have become the norm, offering smooth scrolling, dynamic content loading and reactive interfaces. However, this shift towards dynamic websites poses major challenges for scrapers written in Python. In fact, by some estimates over 85% of sites now rely on JavaScript to load data, build interfaces and handle interactivity.
Traditional scraping tools like Beautiful Soup, Scrapy and Requests shine for static websites, but aren’t equipped to handle pages rendered by JavaScript. While frameworks like Scrapy execute quickly and are easy to parallelize, they operate at the HTTP request-response level – never actually rendering the web pages fetched.
This is where Selenium enters the picture. It allows controlling actual browsers programmatically, helping interact with and scrape dynamic content. Combine it with a battle-tested web scraping framework like Scrapy, and you can scrape even the most complex single page apps with ease!
In this comprehensive guide, we’ll learn how to leverage Selenium to handle JavaScript web pages within Scrapy scrapers step-by-step:
Table of Contents
- Overview of Selenium & Scrapy
- Installing Required Libraries
- Integrating Selenium Middleware
- Creating Selenium Requests in Scrapy
- Scraping Challenges and Use Cases
- Infinite Scroll
- Clicking Elements
- Submitting Forms
- Solving CAPTCHAs
- Setting up Proxies for Rotation
- Tradeoffs of Using Selenium
- Considerations for Production
So let’s get started!
Why Combine Selenium & Scrapy?
Scrapy is one of the most popular open source crawling and scraping frameworks for building robust web spiders at scale. With its clean API, built-in selector engine and middleware pipeline – it makes extracting data a breeze.
However, here are some areas where Scrapy falls short:
❌ Doesn’t execute JavaScript code
❌ No browser emulation for dynamic actions
❌ Limited options for handling modern web protections
Selenium on the other hand is a renowned browser automation suite. It provides an elegant API for:
✅ Launching and controlling browsers like Chrome
✅ Interacting with page elements
✅ Executing JavaScript code on pages ✅ Can mimic complex user actions
Combining them together lets you leverage Selenium’s dynamic capabilities directly within Scrapy spiders!
This lets you build robust crawlers that can scrape any modern website – while retaining Scrapy’s speed, efficiency and pipelines.
Key Benefits
Here are some of the advantages you get:
👉 Handles JavaScript Heavy Sites – Executes JS code to render full pages with dynamic content loaded
👉 Interacts With Pages – Clicks buttons, fills forms, scrolls pages seamlessly
👉 Simulates User Journey – Mimics clicks, mouse movements and scrolls to avoid bot detection
👉 Manages Modern Protections – Can programmatically solve CAPTCHAs and other protections
👉 Leverages Existing Scrapy Pipelines – Integrates easily into existing Scrapy project and workflows
Now that you know why Selenium is critical for modern scrapers, let’s look at how to integrate it within Scrapy.
Installing Required Libraries
To get started, you need a Python 3 environment with Scrapy installed. Check out our guide if you need help.
Next, install the scrapy-selenium
middleware bridge:
pip install scrapy-selenium
This connects Scrapy to Selenium, passingscraped requests between them.
You’ll also need the Selenium bindings:
pip install selenium
I recommend always keeping it updated to leverage the latest browser and webdriver enhancements:
pip install --upgrade selenium
And that’s it for requirements!
Integrating Selenium Middleware
The scrapy-selenium middleware connects Scrapy to Selenium by intercepting requests and routing them to a controlled browser.
To enable, open settings.py file of your Scrapy project and add:
DOWNLOADER_MIDDLEWARES = { 'scrapy_selenium.SeleniumMiddleware': 800 }
This instructs Scrapy to use the selenium downloader middleware.
Next, configure the actual browser driver under SELENIUM_DRIVER_*
:
SELENIUM_DRIVER_NAME = 'chrome' SELENIUM_DRIVER_EXECUTABLE_PATH = '/path/to/chromedriver' SELENIUM_DRIVER_ARGUMENTS=['--headless']
Here we define Chrome as the target browser, and launch it in headless mode for faster performance.
On your machine, install ChromeDriver and provide the absolute path to the binary as shown above.
And done! Scrapy will now coordinate with Selenium through the middleware to launch browsers and execute JavaScript when needed.
Creating Selenium Requests
To instruct Scrapy to render pages using Selenium, use SeleniumRequest
instead of the default Request
object:
from scrapy_selenium import SeleniumRequest def start_requests(self): yield SeleniumRequest(url="https://www.example.com", callback=self.parse)
This replaces start_urls
, directly loading URLs in Chrome via Selenium.
Any action or data extraction on these pages has to happen through the Selenium driver, which you can access within callbacks via:
def parse(self, response): driver = response.meta['driver'] driver.find_element_by_css(...)
Equipped with these basics, you can start scraping pages with dynamic content served by JavaScript!
Now let’s look at some real-world examples.
Use Case 1: Scraping Infinite Scroll Pages
Infinite scrolling websites like Twitter and Facebook that load content continuously as you scroll are immensely popular. But a challenge for scrapers!
Here’s how to leverage Selenium to scroll through and extract data from such pages in Scrapy:
1. Create Selenium Request
Replace the initial request with a Selenium request:
def start_requests(self): yield SeleniumRequest( url="https://twitter.com", callback=self.parse )
2. Access Driver & Scroll
In parse(), access the driver from meta to interact with the page:
def parse(self, response): driver = response.meta['driver'] # Scroll to bottom of page for i in range(5): driver.execute_script("window.scrollTo(0, document.body.scrollHeight)") time.sleep(2)
Here we use execute_script()
to scroll down in between breaks to allow dynamic content to load.
3. Extract Data
Finally, extract scraped data using Selenium instead of Scrapy:
tweets = driver.find_elements_by_css_selector(".tweet-text") for tweet in tweets: text = tweet.text username = tweet.find_element_by_css(".username").text yield { 'text': text, 'username': username }
And you have successfully scraped an infinite scroll page leveraging Selenium within Scrapy!
While the concepts are similar, the specific strategy would differ based on the page. Let’s check out some more examples next.
Use Case 2: Clicking Elements
Websites often reveal or load additional data when specific buttons, tabs or other elements are clicked via JavaScript.
Selenium provides a robust way to interact with such click events. Here is an example:
1. Create Request
As always, initiate a Selenium powered request:
def start_requests(self): yield SeleniumRequest(url="https://www.example.com", callback=self.parse)
2. Click Button
In parse(), identify the element to click and trigger it:
def parse(self, response): driver = response.meta['driver'] driver.find_element_by_id("loadMore").click()
Here we locate the button by its ID, and invoke .click()
to trigger the event handler.
3. Wait For New Content
Since clicking async loads additional data, use waits to allow new content:
from selenium.webdriver.support import expected_conditions as EC WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".new-content")) )
Now new-content
class elements are guaranteed to be available in DOM.
4. Extract New Data
Finally, extract the freshly loaded content using Selenium:
panels = driver.find_elements_by_css_selector(".panel") for panel in panels: title = panel.find_element_by_tag_name("h3").text content = panel.find_element_by_css(".panel-text").text yield { 'title': title, 'content': content }
And that’s another real-world example under your belt leveraging click events!
Use Case 3: Submitting Forms
Websites often collect user data via multi-page forms that dynamically update as they are filled out.
Here is an example workflow to automate form filling:
1. Initialize Session
Start by creating a new Selenium session:
def start_requests(self): yield SeleniumRequest(url="https://example.com/form", callback=self.parse)
2. Identify Fields
In parse()
, locate each <input>
field to populate:
def parse(self, response): driver = response.meta['driver'] firstname = driver.find_element_by_name("firstname") lastname = driver.find_element_by_name("lastname") email = driver.find_element_by_name("email") phone = driver.find_element_by_name("phone")
3. Populate Values
Interact with each field by calling .send_keys()
:
firstname.send_keys("John") lastname.send_keys("Doe") email.send_keys("[email protected]") phone.send_keys("9876543210")
This will fill the entire form automatically!
4. Submit
Locate and click the submit button:
driver.find_element_by_tag_name("button").click()
Voila! Form submitted programmatically via Selenium in Scrapy.
Use Case 4: Bypassing CAPTCHAs
Often sites try to block bots using protections like CAPTCHA. Let’s see how to automate them.
1. Switch Frames
Identify and switch to the CAPTCHA iframe:
iframe = driver.find_element_by_tag_name("iframe") driver.switch_to.frame(iframe)
2. Parse Challenge
Analyze the CAPTCHA type – text, image, audio etc and extract challenge.
# If image captcha captcha_base64 = driver.find_element_by_id("captcha").screenshot_as_base64 # If audio captcha driver.find_element_by_id("audioChallenge")
3. Solve Challenge
Pass extracted challenge to external OCR service like Anti-CAPTCHA to solve.
solver = AnticaptchaClient("API_KEY") solution = solver.solve_captcha(captcha_base64)
4. Submit Solution
Finally, input the solved text and submit:
driver.find_element_by_id("captchaTextInput").send_keys(solution) driver.find_element_by_id("submitCaptcha").click()
This automated the entire CAPTCHA solving flow!
And that’s just a sample – you can handle reCAPTCHA, hCaptcha and more this way!
Using Proxies for Rotation
While Selenium mimics users well, scrapers can still get blocked by sophisticated bot mitigation services.
The easiest way to avoid this is to route traffic through proxy servers, making requests originate from diverse IPs.
Here’s how to configure a proxy service like Luminati or Smartproxy within scrapy-selenium:
1. Get Credentials
Register and get access credentials – a username/password combination typically.
2. Set Proxy Endpoint
Configure Selenium to use the proxy endpoint:
SELENIUM_COMMAND_EXECUTOR = 'http://<username>:<password>@proxy.luminati.io:22225'
This funnels all browser traffic through Luminati IPs.
Doing so hides the scraper origin across locations and avoids IP blocks!
For best results, use a backconnect rotating proxy service to keep switching assigned IPs automatically.
Tradeoffs of Using Selenium
While Selenium is extremely powerful, it also comes with some downsides including:
Performance Overhead – Controlling real browsers is slower than making direct HTML requests. Plan for longer page load times.
Resource Intensive – Browsers are memory and compute heavy compared to lean Scrapy spiders.
Config Complexity – With great flexibility comes more DevOps overhead to manage drivers, binaries etc.
Headless-Only – Browser GUIs add minimal value in scraping but consume many resources.
Scalability Limits – Hard to scale Selenium distributed across multiple machines.
This is where services like Scrapy Cloud help simplify production management.
Or you can investigate lower level libraries like Playwright as an alternative.
But for most standard use cases, Selenium + Scrapy offers the best blend of capabilities and ease of use.
Considerations for Production
Here are some best practices when deploying scrapy-selenium pipelines to production:
- Containerize environments with Docker for easy portability and deployment
- Leverage tools like Selenium Grid for distributed execution
- Enable headless mode and disable images/media for performance
- Use proxies and rotate IPs/User-Agents to avoid blocks
- Monitor resource usage, failures and errors closely
- Setup automatic restarts in case of crashes
- Consider moving to a fully managed scraping platform over time
And that’s a wrap! You are now equipped to harness the power of Selenium to build rock solid Scrapy spiders for even the most complex JavaScript-driven sites!
Key Takeaways
- Javascript usage has exploded making frameworks like Scrapy insufficient
- Selenium helps bridge this gap by controlling real browsers like Chrome
- Integrate via simple middleware bridge –
scrapy-selenium
- Create
SeleniumRequest
and interact viadriver
element - Helps scrape infinite scroll, click elements, submit forms
- Configure Luminati/Smartproxy proxies for IP rotation
- Containerize deployments and monitor resource usage