How to Use Selenium with Scrapy

Websites built with modern JavaScript frameworks have become the norm, offering smooth scrolling, dynamic content loading and reactive interfaces. However, this shift towards dynamic websites poses major challenges for scrapers written in Python. In fact, by some estimates over 85% of sites now rely on JavaScript to load data, build interfaces and handle interactivity.

Traditional scraping tools like Beautiful Soup, Scrapy and Requests shine for static websites, but aren’t equipped to handle pages rendered by JavaScript. While frameworks like Scrapy execute quickly and are easy to parallelize, they operate at the HTTP request-response level – never actually rendering the web pages fetched.

This is where Selenium enters the picture. It allows controlling actual browsers programmatically, helping interact with and scrape dynamic content. Combine it with a battle-tested web scraping framework like Scrapy, and you can scrape even the most complex single page apps with ease!

In this comprehensive guide, we’ll learn how to leverage Selenium to handle JavaScript web pages within Scrapy scrapers step-by-step:

Overview of Selenium & Scrapy
Installing Required Libraries
Integrating Selenium Middleware
Creating Selenium Requests in Scrapy
Scraping Challenges and Use Cases
- Infinite Scroll
- Clicking Elements
- Submitting Forms
- Solving CAPTCHAs
Setting up Proxies for Rotation
Tradeoffs of Using Selenium
Considerations for Production

So let’s get started!

Why Combine Selenium & Scrapy?

Scrapy is one of the most popular open source crawling and scraping frameworks for building robust web spiders at scale. With its clean API, built-in selector engine and middleware pipeline – it makes extracting data a breeze.

However, here are some areas where Scrapy falls short:

❌ Doesn’t execute JavaScript code
❌ No browser emulation for dynamic actions
❌ Limited options for handling modern web protections

Selenium on the other hand is a renowned browser automation suite. It provides an elegant API for:

✅ Launching and controlling browsers like Chrome
✅ Interacting with page elements
✅ Executing JavaScript code on pages ✅ Can mimic complex user actions

Combining them together lets you leverage Selenium’s dynamic capabilities directly within Scrapy spiders!

This lets you build robust crawlers that can scrape any modern website – while retaining Scrapy’s speed, efficiency and pipelines.

Key Benefits

Here are some of the advantages you get:

👉 Handles JavaScript Heavy Sites – Executes JS code to render full pages with dynamic content loaded

👉 Interacts With Pages – Clicks buttons, fills forms, scrolls pages seamlessly

👉 Simulates User Journey – Mimics clicks, mouse movements and scrolls to avoid bot detection

👉 Manages Modern Protections – Can programmatically solve CAPTCHAs and other protections

👉 Leverages Existing Scrapy Pipelines – Integrates easily into existing Scrapy project and workflows

Now that you know why Selenium is critical for modern scrapers, let’s look at how to integrate it within Scrapy.

Installing Required Libraries

To get started, you need a Python 3 environment with Scrapy installed. Check out our guide if you need help.

Next, install the scrapy-selenium middleware bridge:

pip install scrapy-selenium

This connects Scrapy to Selenium, passingscraped requests between them.

You’ll also need the Selenium bindings:

pip install selenium

I recommend always keeping it updated to leverage the latest browser and webdriver enhancements:

pip install --upgrade selenium

And that’s it for requirements!

Integrating Selenium Middleware

The scrapy-selenium middleware connects Scrapy to Selenium by intercepting requests and routing them to a controlled browser.

To enable, open settings.py file of your Scrapy project and add:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

This instructs Scrapy to use the selenium downloader middleware.

Next, configure the actual browser driver under SELENIUM_DRIVER_*:

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = '/path/to/chromedriver'  
SELENIUM_DRIVER_ARGUMENTS=['--headless']

Here we define Chrome as the target browser, and launch it in headless mode for faster performance.

On your machine, install ChromeDriver and provide the absolute path to the binary as shown above.

And done! Scrapy will now coordinate with Selenium through the middleware to launch browsers and execute JavaScript when needed.

Creating Selenium Requests

To instruct Scrapy to render pages using Selenium, use SeleniumRequest instead of the default Request object:

from scrapy_selenium import SeleniumRequest

def start_requests(self):
    yield SeleniumRequest(url="https://www.example.com", callback=self.parse)

This replaces start_urls, directly loading URLs in Chrome via Selenium.

Any action or data extraction on these pages has to happen through the Selenium driver, which you can access within callbacks via:

def parse(self, response):
    driver = response.meta['driver'] 
    driver.find_element_by_css(...)

Equipped with these basics, you can start scraping pages with dynamic content served by JavaScript!

Now let’s look at some real-world examples.

Use Case 1: Scraping Infinite Scroll Pages

Infinite scrolling websites like Twitter and Facebook that load content continuously as you scroll are immensely popular. But a challenge for scrapers!

Here’s how to leverage Selenium to scroll through and extract data from such pages in Scrapy:

1. Create Selenium Request

Replace the initial request with a Selenium request:

def start_requests(self):
   yield SeleniumRequest(
     url="https://twitter.com",
     callback=self.parse
   )

2. Access Driver & Scroll

In parse(), access the driver from meta to interact with the page:

def parse(self, response):
  driver = response.meta['driver']

  # Scroll to bottom of page
  for i in range(5): 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(2)

Here we use execute_script() to scroll down in between breaks to allow dynamic content to load.

3. Extract Data

Finally, extract scraped data using Selenium instead of Scrapy:

tweets = driver.find_elements_by_css_selector(".tweet-text")
for tweet in tweets:
  text = tweet.text
  username = tweet.find_element_by_css(".username").text
  
  yield {
      'text': text,
      'username': username
  }

And you have successfully scraped an infinite scroll page leveraging Selenium within Scrapy!

While the concepts are similar, the specific strategy would differ based on the page. Let’s check out some more examples next.

Use Case 2: Clicking Elements

Websites often reveal or load additional data when specific buttons, tabs or other elements are clicked via JavaScript.

Selenium provides a robust way to interact with such click events. Here is an example:

1. Create Request

As always, initiate a Selenium powered request:

def start_requests(self):
  yield SeleniumRequest(url="https://www.example.com", callback=self.parse)

2. Click Button

In parse(), identify the element to click and trigger it:

def parse(self, response):
  driver = response.meta['driver']

  driver.find_element_by_id("loadMore").click()

Here we locate the button by its ID, and invoke .click() to trigger the event handler.

3. Wait For New Content

Since clicking async loads additional data, use waits to allow new content:

from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 10).until(
  EC.presence_of_element_located((By.CSS_SELECTOR, ".new-content"))
)

Now new-content class elements are guaranteed to be available in DOM.

4. Extract New Data

Finally, extract the freshly loaded content using Selenium:

panels = driver.find_elements_by_css_selector(".panel")
for panel in panels:
  title = panel.find_element_by_tag_name("h3").text 
  content = panel.find_element_by_css(".panel-text").text

  yield {
      'title': title,
      'content': content
  }

And that’s another real-world example under your belt leveraging click events!

Use Case 3: Submitting Forms

Websites often collect user data via multi-page forms that dynamically update as they are filled out.

Here is an example workflow to automate form filling:

1. Initialize Session

Start by creating a new Selenium session:

def start_requests(self):
   yield SeleniumRequest(url="https://example.com/form", callback=self.parse)

2. Identify Fields

In parse(), locate each <input> field to populate:

def parse(self, response):
  driver = response.meta['driver']

  firstname = driver.find_element_by_name("firstname") 
  lastname = driver.find_element_by_name("lastname")

  email = driver.find_element_by_name("email")
  phone = driver.find_element_by_name("phone")

3. Populate Values

Interact with each field by calling .send_keys():

firstname.send_keys("John")
lastname.send_keys("Doe")

email.send_keys("[email protected]")
phone.send_keys("9876543210")

This will fill the entire form automatically!

4. Submit

Locate and click the submit button:

driver.find_element_by_tag_name("button").click()

Voila! Form submitted programmatically via Selenium in Scrapy.

Use Case 4: Bypassing CAPTCHAs

Often sites try to block bots using protections like CAPTCHA. Let’s see how to automate them.

1. Switch Frames

Identify and switch to the CAPTCHA iframe:

iframe = driver.find_element_by_tag_name("iframe")
driver.switch_to.frame(iframe)

2. Parse Challenge

Analyze the CAPTCHA type – text, image, audio etc and extract challenge.

# If image captcha
captcha_base64 = driver.find_element_by_id("captcha").screenshot_as_base64

# If audio captcha 
driver.find_element_by_id("audioChallenge")

3. Solve Challenge

Pass extracted challenge to external OCR service like Anti-CAPTCHA to solve.

solver = AnticaptchaClient("API_KEY")
solution = solver.solve_captcha(captcha_base64)

4. Submit Solution

Finally, input the solved text and submit:

driver.find_element_by_id("captchaTextInput").send_keys(solution)
driver.find_element_by_id("submitCaptcha").click()

This automated the entire CAPTCHA solving flow!

And that’s just a sample – you can handle reCAPTCHA, hCaptcha and more this way!

Using Proxies for Rotation

While Selenium mimics users well, scrapers can still get blocked by sophisticated bot mitigation services.

The easiest way to avoid this is to route traffic through proxy servers, making requests originate from diverse IPs.

Here’s how to configure a proxy service like Luminati or Smartproxy within scrapy-selenium:

1. Get Credentials

2. Set Proxy Endpoint

Configure Selenium to use the proxy endpoint:

SELENIUM_COMMAND_EXECUTOR = 'http://<username>:<password>@proxy.luminati.io:22225'

This funnels all browser traffic through Luminati IPs.

Doing so hides the scraper origin across locations and avoids IP blocks!

For best results, use a backconnect rotating proxy service to keep switching assigned IPs automatically.

Tradeoffs of Using Selenium

While Selenium is extremely powerful, it also comes with some downsides including:

Performance Overhead – Controlling real browsers is slower than making direct HTML requests. Plan for longer page load times.

Resource Intensive – Browsers are memory and compute heavy compared to lean Scrapy spiders.

Config Complexity – With great flexibility comes more DevOps overhead to manage drivers, binaries etc.

Headless-Only – Browser GUIs add minimal value in scraping but consume many resources.

Scalability Limits – Hard to scale Selenium distributed across multiple machines.

This is where services like Scrapy Cloud help simplify production management.

Or you can investigate lower level libraries like Playwright as an alternative.

But for most standard use cases, Selenium + Scrapy offers the best blend of capabilities and ease of use.

Considerations for Production

Here are some best practices when deploying scrapy-selenium pipelines to production:

Containerize environments with Docker for easy portability and deployment
Leverage tools like Selenium Grid for distributed execution
Enable headless mode and disable images/media for performance
Use proxies and rotate IPs/User-Agents to avoid blocks
Monitor resource usage, failures and errors closely
Setup automatic restarts in case of crashes
Consider moving to a fully managed scraping platform over time

And that’s a wrap! You are now equipped to harness the power of Selenium to build rock solid Scrapy spiders for even the most complex JavaScript-driven sites!

Key Takeaways

Javascript usage has exploded making frameworks like Scrapy insufficient
Selenium helps bridge this gap by controlling real browsers like Chrome
Integrate via simple middleware bridge – scrapy-selenium
Create SeleniumRequest and interact via driver element
Helps scrape infinite scroll, click elements, submit forms
Configure Luminati/Smartproxy proxies for IP rotation
Containerize deployments and monitor resource usage

How to Use Selenium with Scrapy

Table of Contents

Why Combine Selenium & Scrapy?

Key Benefits

Installing Required Libraries

Integrating Selenium Middleware

Creating Selenium Requests

Use Case 1: Scraping Infinite Scroll Pages

Use Case 2: Clicking Elements

Use Case 3: Submitting Forms

Use Case 4: Bypassing CAPTCHAs

Using Proxies for Rotation

Tradeoffs of Using Selenium

Considerations for Production

Key Takeaways

How to Make Selenium Faster: Improving Page Load Performance

What Is Anti-bot and How to Get Around

10 Best PHP Web Scraping Libraries for Crawling 2023

How to Use XPath for Web Scraping

How to Bypass “Please Verify You Are a Human”

How to Use Superagent-Proxy for Web Scraping

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux

Table of Contents

Why Combine Selenium & Scrapy?

Key Benefits

Installing Required Libraries

Integrating Selenium Middleware

Creating Selenium Requests

Use Case 1: Scraping Infinite Scroll Pages

Use Case 2: Clicking Elements

Use Case 3: Submitting Forms

Use Case 4: Bypassing CAPTCHAs

Using Proxies for Rotation

Tradeoffs of Using Selenium

Considerations for Production

Key Takeaways

Similar Posts

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux