How to Web Scraping with Playwright

Web scraping allows you to extract large amounts of data from websites automatically. However, many sites have implemented anti-scraping measures that can detect and block scrapers.

To overcome these protections, you need a browser automation tool like Playwright. With Playwright, you can mimic real user interactions to access dynamic content and evade detection.

In this comprehensive guide, you'll learn:

  • What is Playwright and why it's useful for web scraping
  • How to set up a Python Playwright scraper
  • Locating page elements and extracting text/data
  • Downloading images
  • Exporting scraped data to CSV
  • Navigating between pages
  • Taking automated screenshots

We'll also cover how Playwright compares to popular alternatives like Puppeteer and Selenium. Let's get started!

What is Playwright?

Playwright is an open-source web scraping library that allows controlling Chromium, Firefox and WebKit browsers via code.

It works across platforms like Windows, MacOS, and Linux – making Playwright scripts highly portable. You can code scripts in languages like JavaScript, Python, .NET C#, and Java.

The key advantage of Playwright is it enables headless browser automation. This means you can scrape dynamically rendered sites by programmatically interacting with pages like a real user would.

For example, Playwright scripts can:

  • Click buttons
  • Scroll down to load more content
  • Fill out and submit forms
  • All done behind-the-scenes without actually launching a visible browser GUI.

Headless browsers are extremely fast because they don't render a graphical interface. Playwright also provides a stealth mode that blocks ads, hides WebRTC IP leaks and modifies navigator fingerprints to avoid detection.

This makes it very effective for large scale web scraping while defeating anti-bot protections.

Installing & Launching Playwright with Python

We'll focus on Python code examples here, but the Playwright API is very similar across languages.

First install Playwright:

pip install playwright

Also install browser binaries for Chromium, Firefox and WebKit:

playwright install

Now import Playwright and launch a browser:

from playwright.async_api import async_playwright  

async def main():

    async with async_playwright() as p:
        browser = await p.chromium.launch() 

asyncio.run(main())

This launches a Playwright controlled Chromium instance in headless mode. Let's start scraping!

Step 1 – Locate Elements & Extract Text

Consider we want to scrape books listed on this site. The first step is to analyze the page structure and identify CSS selectors for data we want.

I want to grab:

  • Product name
  • Price
  • Image URL

Using developer tools, I can see product entries are under <li class="product">. Product names are inside <h2> tags, prices within <span class="woocommerce-Price-amount"> and image links are <img src=...>.

Let's grab these elements with Playwright selectors:

import asyncio
from playwright.async_api import async_playwright  

async def scrape(page):

  # Get all products
  products = await page.query_selector_all("li.product")

  for product in products:

      name = await product.query_selector("h2").inner_text()
      price = await product.query_selector("span.woocommerce-Price-amount").text_content()  
      image_url = await product.query_selector("img").get_attribute("src")

      print({
          'name': name,
          'price': price,  
          'img_url': image_url
      })

async def main():

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()  
        await page.goto("https://scrapeme.live/shop/")
        
        await scrape(page)
        
        await browser.close()

asyncio.run(main())

And we have successfully extracted key data points into a nice Python dictionary!

Step 2 – Downloading Images

What if you also wanted to download the product images locally?

We already have the image URLs. We just need to add a step to save them with Playwright's download API:

async def scrape(page):

  # Same logic as before 

  for product in products:
  
      # Extract fields  
      ...

      image_url = await product.query_selector("img").get_attribute("src")

      # Download image
      await page.dblclick("img")  
      download = await page.waitForEvent("download")  
      path = await download.path() # Get file path

      print(f'Saved image to {path}')

      print({
         'name': name,
         'price': price,
         'img_url': image_url,         
         'img_path': path 
      })

Here we simulate double clicking the image to trigger the browser's default download behavior. Playwright waits for the download event, extracts the file path once completed and we have the image saved locally!

Step 3 – Export Data to CSV

For data analytics, it's useful to export scraped data to a spreadsheet format like CSV.

First, let's create a Python list results = [] to store data entries. Then append dicts with each product's info:

results = []

async def scrape(page):

  products = await page.query_selector_all("li.product")

  for product in products:
    
    data = {
        'name': await product.query_selector("h2").inner_text(),
        'price': await product.query_selector("span.woocommerce-Price-amount").text_content(),
        # ...
    }

    results.append(data)

Finally, save the CSV:

import csv

with open('output.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=results[0].keys()) 
    writer.writeheader()  
    writer.writerows(results)

print('Saved CSV')

Now you've extracted and exported web data without touching a browser!

Step 4 – Browser Automation

A key benefit of Playwright is simulating user interactions like clicks, scrolls and form submissions. This allows scraping content that renders dynamically with JavaScript.

Let's try automating paginated results.

The site we're scraping shows 3 products per page. To scrape them all, we need to handle clicking between pages.

Using the page inspector, we can see the next page button has the text . Let's click it programmatically with Playwright using CSS selectors:

pages = await page.query_selector_all(".page-numbers")  
num_pages = len(pages)

for i in range(1, num_pages):
  
  await page.click(".next.page-numbers") # Click next
  
  products = await page.query_selector_all("li.product")
  
  for product in products:
    # Extract and store data

  await page.wait_for_selector("li.product") # Wait for next page to load

This loop automates clicking the next button, extracts data, and waits for the DOM to update across all pages.

Step 5 – Taking Automated Screenshots

Playwright can also capture screenshots of pages during scraping for visual testing:

This loop automates clicking the next button, extracts data, and waits for the DOM to update across all pages.

Step 5 - Taking Automated Screenshots
Playwright can also capture screenshots of pages during scraping for visual testing:

This saves a PNG image of the site to disk without needing to open a visible browser!

Playwright vs Puppeteer vs Selenium

Playwright is extremely versatile for web scraping. But how does it compare to popular alternatives like Puppeteer and Selenium?

Playwright

  • Supports Chromium, Firefox and WebKit
  • Works across Windows, macOS and Linux
  • APIs for Python, JavaScript, C# & Java
  • Fast and lightweight
  • Stealth mode to evade detection
  • Robust browser automation features

Puppeteer

  • JavaScript API
  • Chromium-only
  • Headless scraping
  • Fast performance
  • More limited compared to Playwright

Selenium

  • Python, Java, C#, Perl, JavaScript bindings
  • Supports Chrome, Firefox, Safari
  • Very heavy, impact performance
  • No native stealth options

While all three can scrape well, Playwright stands out with speed, scalability and flexibility. The easy-to-use API also has great documentation making it beginner friendly.

Conclusion

This tutorial covered the basics of using Playwright for Python based web scraping. You learned how to:

  • Setup headless Chromium and extract page data
  • Automate clicks, fills and navigation
  • Export structured data to CSV
  • Capture screenshots

With these capabilities, Playwright provides a complete web scraping toolbox. Even if sites use heavy JavaScript or try blocking bots via sneaky methods, Playwright has all the tools needed to extract data at scale.

To learn more, visit the Playwright site which has additional docs, API references and sample code repos to help you become a pro web scraper with Playwright!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *