How to Scrape with Scrapy Splash
Splash provides the missing browser piece that takes Scrapy web scraping to the next level. By controlling a headless browser with Splash, you can render and interact with the modern web – no matter how much JavaScript a site uses.
you'll gain true expertise for integrating and leveraging Splash in your Scrapy spiders. I'll share advanced techniques based on my years as a web scraping specialist, so you can overcome scrapers' biggest headaches.
Laying the Foundation with Prerequisites
Before utilizing Splash, you need a proper environmentsetup. Scrapy and Splash rely on specific components working together:
Python 3 – The latest Python 3 release includes critical bug fixes, optimizations, and features to support modern scraping. I highly recommend Python 3.10 or higher.
pip – The PyPA recommended tool to install Python packages. Make sure you have the latest version.
Basic Scrapy knowledge – You should understand spiders, crawling, and parsing before diving into more advanced Splash usage.
Docker Desktop – Docker containers provide isolated environments to easily run services like Splash. The convenience of the Desktop app is well worth installing.
Take the time to get these prerequisites installed and learn them now, before trying to use Splash. It'll save you lots of headache debugging fundamental environment issues down the road!
Next up, running the browser automation service…
Setting Up the Headless Browser Service
The Splash project provides an official Docker image to launch their browser instance. This neatly bundles everything you need in an isolated container:
- Splash itself
- The Chromium rendering engine
- Supporting browsers like PyQt5 and QtWebengine
- Python bindings to control it all
It's by far the fastest way to get Splash running.
Make sure Docker Desktop is launched, then pull the image:
docker pull scrapinghub/splash
Wait for the download to complete (it's ~600MB).
With the image ready, start a container that exposes the Splash service on port 8050:
docker run -p 8050:8050 scrapinghub/splash
Visit http://localhost:8050 and you should see the browser-like Splash loading page:
We now have a headless browser ready to automate!
With Splash running, we can shift focus to the Python integration…
Connecting Scrapy and Splash
To leverage Splash programmatically for browser automation, you need:
- The Python package to interface with the service
- Middleware configured in Scrapy settings
Use pip to install the scrapy-splash package:
pip install scrapy-splash
This provides a SplashRequest
class and other helpers to communicate with the browser API.
Then configure settings.py so Scrapy knows how to access Splash:
SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
This registers the middleware needed to redirect requests through Splash, and sets the URL of your local instance.
With that wiring in place,Scrapy spiders can now leverage Splash to render pages. Nice!
Using SplashRequest to Render Pages
The SplashRequest
class gives your Scrapy spiders control over the headless browser. It behaves similar to Scrapy's built-in Request
, but performs an isolated browser request and returns the rendered HTML.
For example:
import scrapy from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): yield SplashRequest( url="https://dynamicpage.com", callback=self.parse, ) def parse(self, response): # Scrape rendered page for product in response.css('div.product'): yield { 'name': product.css('h3::text').get() 'price': product.css('p.price::text').get(), }
This mimics a regular Scrapy crawler, but SplashRequest
handles loading the site in Chromium, running any necessary JavaScript, and returning the fully-rendered HTML.
Your spider then parses the complete DOM – including dynamically loaded content!
Let's look at more ways to control the automated browser…
Scripting Browser Interactions in Lua
The rendered HTML from SplashRequest
alone is a game changer. But often you need to simulate complex user actions like:
- Scrolling through infinitely loading feeds
- Filling out and submitting forms
- Clicking elements or links to expand content
Splash makes this possible using Lua scripts for browser automation.
For example, here is a script to scroll down a page:
function main(splash) splash:go(splash.args.url) local num_scrolls = 10 local scroll_to = splash:jsfunc("window.scrollTo") local get_body_height = splash:jsfunc("function() {return document.body.scrollHeight;}") for _ = 1, num_scrolls do scroll_to(0, get_body_height()) splash:wait(1) end return splash:html() end
It leverages Splash's jsfunc
method to execute Scroll and Get Height JavaScript functions. By running this in a loop, you can scroll any number of times.
To use a Lua script, provide it in the lua_source
argument:
yield SplashRequest( url=url, callback=self.parse, endpoint='execute', args={ 'lua_source': lua_script } )
The script runs within the context of the browser page, allowing complex interaction logic.
JavaScript can also be executed using js_source
instead, for quick manipulation without Lua.
For example:
// Click the first product document.querySelectorAll("div.product")[0].click();
So between Lua scripts and raw JS, you can simulate virtually any user action.
Scraper-Friendly Proxies for Avoiding Blocks
Dynamic content support makes Splash incredibly powerful. However, complex sites also frequently employ strict anti-bot defenses.
Aggressively blocking IP addresses is a common tactic to stop scrapers. Oftentimes from successfully rendering just a page or two initially!
Proxy rotation is necessary to avoid these blocks when using Splash. Each request should come from a different residential IP, ideally spread across different geographic locations.
Rotating free public proxies seems an easy fix, but these have severe downsides for web scraping:
- Very unstable, often going offline mid-request
- Get blacklisted frequently, rendering them useless
- No geography control, so you may keep hitting the site from the same area
- Major speed impacts for finding new ones
A paid proxy service like BrightData solves these problems, providing:
- 72+ million fresh residential IPs to rotate
- 99.99% uptime with automatic failover
- Targeting specific cities, states, or countries
- Blazing fast speeds by caching proxy assignments
And they work perfectly with Splash requests:
import brightdata proxy = brightdata.get_proxy() yield SplashRequest( url=url, callback=self.parse, args={'proxy': proxy} )
This handles rotating to a new IP for every rendered page. By appearing as thousands of worldwide visitors, you avoid triggering scrapers protections.
BrightData also has a built-in browser engine to further mimic users, as well as CAPTCHA solving APIs. Well worth checking out!
Take Splash Further with a Scraping Platform
Between orchestrating Docker, Scrapy, proxies, parsing, storing data, and monitoring – web scraping involves juggling many components.
An integrated platform like ParseHub brings everything together into one workflow.
The visual interface and templates make it easy to model complex sites. Under the hood, ParseHub uses Splash with a pool of 30,000 proxies for rendering and bot prevention.
You get out-of-box support for:
- JavaScript sites and single page applications
- Data exports to CSV, JSON, databases
- Scheduled crawls and incremental scraping
- Performance tracking and alerts
Definitely check them out to level up your Splash scraping!
Let's Recap Scrapy Splash Essentials
Wow, that was a lot of in-depth Splash guidance! Let's recap the key points:
- Splash provides a headless browser to render JavaScript pages – Critical for scraping modern sites.
- The Docker image bundles all dependencies – Chromium, PyQt, browsers, Python integration.
- Lua scripting controls browser interactions – Scroll, click, wait, fill forms, etc.
- JavaScript execution supports advanced logic – Facility raw JS functions.
- Proxy rotation prevents blocking – Residential IPs like BrightData are ideal.
- Platforms take care of orchestration – ParseHub delivers the full package.
Splash truly elevates Scrapy to support the dynamic web. I hope all these tips help you start leveraging it for your own projects!