How to Block Resources in Playwright

Playwright is a Python library that allows you to automate web browsers like Chromium, Firefox and WebKit using a single API. It can help you speed up web scraping and testing by blocking unnecessary resources and only loading what you need.

In this comprehensive guide, you'll learn:

  • How Playwright makes it easy to track network requests
  • Techniques to block different resource types
  • Ways to measure the performance boost

Let's dive in!

Prerequisites

To follow along, you'll need:

  • Python 3 installed on your machine (often pre-installed)
  • Playwright and browser binaries for Chromium, Firefox and WebKit:
pip install playwright 
playwright install

Intro to Playwright

With just a few lines of code, Playwright allows you to:

  • Launch a headless browser
  • Navigate to a web page
  • Extract information

Here's an example to open Chromium, go to a page and print the title:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:

  browser = p.chromium.launch()
  page = browser.new_page()

  page.goto("https://www.example.com/")
  print(page.title())

  page.close()
  browser.close()

This simplicity makes Playwright a popular choice for web automation and scraping projects.

Logging Network Events

To understand what resources a page loads, you can log the requests and responses.

Add these event handlers before navigating:

page.on("request", lambda req: print(">>", req.method, req.url)) 

page.on("response", lambda res: print("<<", res.status, res.url))

When you load a page, it'll print information like:

>> GET https://www.example.com/
<< 200 https://www.example.com/
>> GET https://cdn.example.com/header.png 
<< 200 https://cdn.example.com/header.png

For a typical site, this output will be dozens of lines long, with many requests for images, stylesheets, scripts, etc.

Clearly, we won't need all of this for our scraping needs. Next, let's see how to block unnecessary resources and speed things up!

Blocking by Glob Pattern

One way to block resources is by using glob patterns with the page.route() method.

For example, to block SVG images, you can use:

page.route("**/*.svg", lambda route: route.abort())

This will match any URL ending with “.svg” and abort the request.

When you load the page now, any SVG images won't be downloaded.

Blocking by Regex

For more flexibility, you can use regular expressions to match routes.

Say you want to block JPG, PNG and SVG image requests:

import re

page.route(re.compile(r".*\.(jpg|png|svg)$"), lambda route: route.abort())

The regex will match and abort routes for those image extensions.

Blocking by Resource Type

An even better approach is to block by resource type directly.

The route.request object exposes the resource_type field that can be checked.

For example, to block all images:

page.route("**/.*", lambda route: route.abort() if route.request.resource_type == "image" else route.continue_())

This will match all routes, and abort any that are image type.

You can block stylesheet, script, font and other resource types this way too.

Function Handler

Instead of inline lambdas, you can define handler functions for better reusability:

def block_images(route):
  if route.request.resource_type == "image":
    route.abort()
  else:
    route.continue_()

page.route("**/.*", block_images)

You can make this more aggressive by blocking a list of unnecessary resource types:

excluded = ["stylesheet", "script", "image", "font"]

def block_unneeded(route):
  if route.request.resource_type in excluded:
    route.abort()
  else:
    route.continue_()
    
page.route("**/.*", block_unneeded)

And to only allow the HTML document through:

def block_all_but_html(route):
  if route.request.resource_type != "document":
    route.abort()
  else:
    route.continue_()

page.route("**/.*", block_all_but_html)

This gives you complete control over what gets loaded!

Measuring the Boost

Blocking resources improves performance. But how much exactly?

Let's look at some ways to measure and quantify the benefits.

Using HAR Files

HAR (HTTP Archive) files record details about all network requests made by a page.

To generate a HAR file with Playwright:

page = browser.new_page(record_har_path="playwright.har")

You can then import the HAR file into Chrome DevTools to analyze performance:

Compare the Waterfall, number of requests, and data transferred between a normal page load and one with blocked resources.

You'll clearly see the improvement from blocking unnecessary resources!

Browser Performance API

We can also use the Browser Performance API to get precise timing metrics:

page.goto("https://example.com")

perf_data = page.evaluate("JSON.stringify(window.performance)")

This will return a JSON object with navigation start/end times, response end, DOM parsing and more.

Look at loadEventEnd - navigationStart to see how much faster it loads with blocked resources.

CDP Metrics

For more advanced performance metrics, you can use Playwright's Chrome DevTools Protocol (CDP) interface.

Create a CDP session from the page context:

client = page.context.new_cdp_session(page)

client.send("Performance.enable")
page.goto("https://example.com")

metrics = client.send("Performance.getMetrics")

This will return low-level performance data including:

  • DOM Nodes
  • Layout Count
  • Recalc Style Count
  • JS Event Listeners
  • Documents
  • JS Heap Size used

And many other granular metrics to quantify the improvements from blocking resources.

Key Takeaways

Here are the core concepts to remember:

  • Logging network requests helps identify resources loaded
  • Route handlers allow blocking by pattern, regex or resource type
  • Measure speedup with HAR files, performance API and CDP metrics

Unnecessary resources just waste bandwidth and slow things down.

With Playwright's blocking capabilities, you can scrape and test sites much faster by eliminating anything irrelevant.

Put these techniques into practice for your next web automation project!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *