Playwright is a Python library that allows you to automate web browsers like Chromium, Firefox and WebKit using a single API. It can help you speed up web scraping and testing by blocking unnecessary resources and only loading what you need.
In this comprehensive guide, you'll learn:
- How Playwright makes it easy to track network requests
- Techniques to block different resource types
- Ways to measure the performance boost
Let's dive in!
To follow along, you'll need:
- Python 3 installed on your machine (often pre-installed)
- Playwright and browser binaries for Chromium, Firefox and WebKit:
pip install playwright playwright install
Intro to Playwright
With just a few lines of code, Playwright allows you to:
- Launch a headless browser
- Navigate to a web page
- Extract information
Here's an example to open Chromium, go to a page and print the title:
from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://www.example.com/") print(page.title()) page.close() browser.close()
This simplicity makes Playwright a popular choice for web automation and scraping projects.
Logging Network Events
To understand what resources a page loads, you can log the requests and responses.
Add these event handlers before navigating:
page.on("request", lambda req: print(">>", req.method, req.url)) page.on("response", lambda res: print("<<", res.status, res.url))
When you load a page, it'll print information like:
>> GET https://www.example.com/ << 200 https://www.example.com/ >> GET https://cdn.example.com/header.png << 200 https://cdn.example.com/header.png
For a typical site, this output will be dozens of lines long, with many requests for images, stylesheets, scripts, etc.
Clearly, we won't need all of this for our scraping needs. Next, let's see how to block unnecessary resources and speed things up!
Blocking by Glob Pattern
One way to block resources is by using glob patterns with the
For example, to block SVG images, you can use:
page.route("**/*.svg", lambda route: route.abort())
This will match any URL ending with “.svg” and abort the request.
When you load the page now, any SVG images won't be downloaded.
Blocking by Regex
For more flexibility, you can use regular expressions to match routes.
Say you want to block JPG, PNG and SVG image requests:
import re page.route(re.compile(r".*\.(jpg|png|svg)$"), lambda route: route.abort())
The regex will match and abort routes for those image extensions.
Blocking by Resource Type
An even better approach is to block by resource type directly.
route.request object exposes the
resource_type field that can be checked.
For example, to block all images:
page.route("**/.*", lambda route: route.abort() if route.request.resource_type == "image" else route.continue_())
This will match all routes, and abort any that are image type.
You can block stylesheet, script, font and other resource types this way too.
Instead of inline lambdas, you can define handler functions for better reusability:
def block_images(route): if route.request.resource_type == "image": route.abort() else: route.continue_() page.route("**/.*", block_images)
You can make this more aggressive by blocking a list of unnecessary resource types:
excluded = ["stylesheet", "script", "image", "font"] def block_unneeded(route): if route.request.resource_type in excluded: route.abort() else: route.continue_() page.route("**/.*", block_unneeded)
And to only allow the HTML document through:
def block_all_but_html(route): if route.request.resource_type != "document": route.abort() else: route.continue_() page.route("**/.*", block_all_but_html)
This gives you complete control over what gets loaded!
Measuring the Boost
Blocking resources improves performance. But how much exactly?
Let's look at some ways to measure and quantify the benefits.
Using HAR Files
HAR (HTTP Archive) files record details about all network requests made by a page.
To generate a HAR file with Playwright:
page = browser.new_page(record_har_path="playwright.har")
You can then import the HAR file into Chrome DevTools to analyze performance:
Compare the Waterfall, number of requests, and data transferred between a normal page load and one with blocked resources.
You'll clearly see the improvement from blocking unnecessary resources!
Browser Performance API
We can also use the Browser Performance API to get precise timing metrics:
page.goto("https://example.com") perf_data = page.evaluate("JSON.stringify(window.performance)")
This will return a JSON object with navigation start/end times, response end, DOM parsing and more.
loadEventEnd - navigationStart to see how much faster it loads with blocked resources.
For more advanced performance metrics, you can use Playwright's Chrome DevTools Protocol (CDP) interface.
Create a CDP session from the page context:
client = page.context.new_cdp_session(page) client.send("Performance.enable") page.goto("https://example.com") metrics = client.send("Performance.getMetrics")
This will return low-level performance data including:
- DOM Nodes
- Layout Count
- Recalc Style Count
- JS Event Listeners
- JS Heap Size used
And many other granular metrics to quantify the improvements from blocking resources.
Here are the core concepts to remember:
- Logging network requests helps identify resources loaded
- Route handlers allow blocking by pattern, regex or resource type
- Measure speedup with HAR files, performance API and CDP metrics
Unnecessary resources just waste bandwidth and slow things down.
With Playwright's blocking capabilities, you can scrape and test sites much faster by eliminating anything irrelevant.
Put these techniques into practice for your next web automation project!