How to Scrape with Scrapy Splash

Splash provides the missing browser piece that takes Scrapy web scraping to the next level. By controlling a headless browser with Splash, you can render and interact with the modern web – no matter how much JavaScript a site uses.

you'll gain true expertise for integrating and leveraging Splash in your Scrapy spiders. I'll share advanced techniques based on my years as a web scraping specialist, so you can overcome scrapers' biggest headaches.

Laying the Foundation with Prerequisites

Before utilizing Splash, you need a proper environmentsetup. Scrapy and Splash rely on specific components working together:

Python 3 – The latest Python 3 release includes critical bug fixes, optimizations, and features to support modern scraping. I highly recommend Python 3.10 or higher.

pip – The PyPA recommended tool to install Python packages. Make sure you have the latest version.

Basic Scrapy knowledge – You should understand spiders, crawling, and parsing before diving into more advanced Splash usage.

Docker Desktop – Docker containers provide isolated environments to easily run services like Splash. The convenience of the Desktop app is well worth installing.

Take the time to get these prerequisites installed and learn them now, before trying to use Splash. It'll save you lots of headache debugging fundamental environment issues down the road!

Next up, running the browser automation service…

Setting Up the Headless Browser Service

The Splash project provides an official Docker image to launch their browser instance. This neatly bundles everything you need in an isolated container:

Splash itself
The Chromium rendering engine
Supporting browsers like PyQt5 and QtWebengine
Python bindings to control it all

It's by far the fastest way to get Splash running.

Make sure Docker Desktop is launched, then pull the image:

docker pull scrapinghub/splash

Wait for the download to complete (it's ~600MB).

With the image ready, start a container that exposes the Splash service on port 8050:

docker run -p 8050:8050 scrapinghub/splash

Visit http://localhost:8050 and you should see the browser-like Splash loading page:

We now have a headless browser ready to automate!

With Splash running, we can shift focus to the Python integration…

Connecting Scrapy and Splash

To leverage Splash programmatically for browser automation, you need:

The Python package to interface with the service
Middleware configured in Scrapy settings

Use pip to install the scrapy-splash package:

pip install scrapy-splash

This provides a SplashRequest class and other helpers to communicate with the browser API.

Then configure settings.py so Scrapy knows how to access Splash:

SPLASH_URL = 'http://localhost:8050'  

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

This registers the middleware needed to redirect requests through Splash, and sets the URL of your local instance.

With that wiring in place,Scrapy spiders can now leverage Splash to render pages. Nice!

Using SplashRequest to Render Pages

The SplashRequest class gives your Scrapy spiders control over the headless browser. It behaves similar to Scrapy's built-in Request, but performs an isolated browser request and returns the rendered HTML.

For example:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
  name = 'myspider'

  def start_requests(self):
    yield SplashRequest(
      url="https://dynamicpage.com",  
      callback=self.parse,
    )

  def parse(self, response):
    # Scrape rendered page  
    for product in response.css('div.product'):
      yield {
        'name': product.css('h3::text').get() 
        'price': product.css('p.price::text').get(),
      }

This mimics a regular Scrapy crawler, but SplashRequest handles loading the site in Chromium, running any necessary JavaScript, and returning the fully-rendered HTML.

Your spider then parses the complete DOM – including dynamically loaded content!

Let's look at more ways to control the automated browser…

Scripting Browser Interactions in Lua

The rendered HTML from SplashRequest alone is a game changer. But often you need to simulate complex user actions like:

Scrolling through infinitely loading feeds
Filling out and submitting forms
Clicking elements or links to expand content

Splash makes this possible using Lua scripts for browser automation.

For example, here is a script to scroll down a page:

function main(splash)

  splash:go(splash.args.url)

  local num_scrolls = 10
  
  local scroll_to = splash:jsfunc("window.scrollTo")
  local get_body_height = splash:jsfunc("function() {return document.body.scrollHeight;}")

  for _ = 1, num_scrolls do
     scroll_to(0, get_body_height())
     splash:wait(1)
  end

  return splash:html()
end

It leverages Splash's jsfunc method to execute Scroll and Get Height JavaScript functions. By running this in a loop, you can scroll any number of times.

To use a Lua script, provide it in the lua_source argument:

yield SplashRequest(
    url=url, 
    callback=self.parse,
    endpoint='execute',
    args={
        'lua_source': lua_script
    }
)

The script runs within the context of the browser page, allowing complex interaction logic.

JavaScript can also be executed using js_source instead, for quick manipulation without Lua.

For example:

// Click the first product
document.querySelectorAll("div.product")[0].click();

So between Lua scripts and raw JS, you can simulate virtually any user action.

Scraper-Friendly Proxies for Avoiding Blocks

Dynamic content support makes Splash incredibly powerful. However, complex sites also frequently employ strict anti-bot defenses.

Aggressively blocking IP addresses is a common tactic to stop scrapers. Oftentimes from successfully rendering just a page or two initially!

Proxy rotation is necessary to avoid these blocks when using Splash. Each request should come from a different residential IP, ideally spread across different geographic locations.

Rotating free public proxies seems an easy fix, but these have severe downsides for web scraping:

Very unstable, often going offline mid-request
Get blacklisted frequently, rendering them useless
No geography control, so you may keep hitting the site from the same area
Major speed impacts for finding new ones

A paid proxy service like BrightData solves these problems, providing:

72+ million fresh residential IPs to rotate
99.99% uptime with automatic failover
Targeting specific cities, states, or countries
Blazing fast speeds by caching proxy assignments

And they work perfectly with Splash requests:

import brightdata

proxy = brightdata.get_proxy() 

yield SplashRequest(
   url=url,
   callback=self.parse, 
   args={'proxy': proxy}    
)

This handles rotating to a new IP for every rendered page. By appearing as thousands of worldwide visitors, you avoid triggering scrapers protections.

BrightData also has a built-in browser engine to further mimic users, as well as CAPTCHA solving APIs. Well worth checking out!

Take Splash Further with a Scraping Platform

Between orchestrating Docker, Scrapy, proxies, parsing, storing data, and monitoring – web scraping involves juggling many components.

An integrated platform like ParseHub brings everything together into one workflow.

The visual interface and templates make it easy to model complex sites. Under the hood, ParseHub uses Splash with a pool of 30,000 proxies for rendering and bot prevention.

You get out-of-box support for:

JavaScript sites and single page applications
Data exports to CSV, JSON, databases
Scheduled crawls and incremental scraping
Performance tracking and alerts

Definitely check them out to level up your Splash scraping!

Let's Recap Scrapy Splash Essentials

Wow, that was a lot of in-depth Splash guidance! Let's recap the key points:

Splash provides a headless browser to render JavaScript pages – Critical for scraping modern sites.
The Docker image bundles all dependencies – Chromium, PyQt, browsers, Python integration.
Lua scripting controls browser interactions – Scroll, click, wait, fill forms, etc.
JavaScript execution supports advanced logic – Facility raw JS functions.
Proxy rotation prevents blocking – Residential IPs like BrightData are ideal.
Platforms take care of orchestration – ParseHub delivers the full package.

Splash truly elevates Scrapy to support the dynamic web. I hope all these tips help you start leveraging it for your own projects!

How to Scrape with Scrapy Splash

Laying the Foundation with Prerequisites

Setting Up the Headless Browser Service

Connecting Scrapy and Splash

Using SplashRequest to Render Pages

Scripting Browser Interactions in Lua

Scraper-Friendly Proxies for Avoiding Blocks

Take Splash Further with a Scraping Platform

Let's Recap Scrapy Splash Essentials

How to Use cURL with a Proxy

How to Web Scraping with Selenium and Python

10 Best PHP Web Scraping Libraries for Crawling 2023

How to Build a Web Crawler in Python

7 Best Programming Languages for Web Scraping in 2023

XPath vs CSS Selector: Difference & How to Choose

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux

Laying the Foundation with Prerequisites

Setting Up the Headless Browser Service

Connecting Scrapy and Splash

Using SplashRequest to Render Pages

Scripting Browser Interactions in Lua

Scraper-Friendly Proxies for Avoiding Blocks

Take Splash Further with a Scraping Platform

Let's Recap Scrapy Splash Essentials

Similar Posts

Leave a Reply Cancel reply

Linuxhaxor.net – About Open Source & Linux