How to Web Scraping with Playwright
Web scraping allows you to extract large amounts of data from websites automatically. However, many sites have implemented anti-scraping measures that can detect and block scrapers.
To overcome these protections, you need a browser automation tool like Playwright. With Playwright, you can mimic real user interactions to access dynamic content and evade detection.
In this comprehensive guide, you'll learn:
- What is Playwright and why it's useful for web scraping
- How to set up a Python Playwright scraper
- Locating page elements and extracting text/data
- Downloading images
- Exporting scraped data to CSV
- Navigating between pages
- Taking automated screenshots
We'll also cover how Playwright compares to popular alternatives like Puppeteer and Selenium. Let's get started!
What is Playwright?
Playwright is an open-source web scraping library that allows controlling Chromium, Firefox and WebKit browsers via code.
It works across platforms like Windows, MacOS, and Linux – making Playwright scripts highly portable. You can code scripts in languages like JavaScript, Python, .NET C#, and Java.
The key advantage of Playwright is it enables headless browser automation. This means you can scrape dynamically rendered sites by programmatically interacting with pages like a real user would.
For example, Playwright scripts can:
- Click buttons
- Scroll down to load more content
- Fill out and submit forms
- All done behind-the-scenes without actually launching a visible browser GUI.
Headless browsers are extremely fast because they don't render a graphical interface. Playwright also provides a stealth mode that blocks ads, hides WebRTC IP leaks and modifies navigator fingerprints to avoid detection.
This makes it very effective for large scale web scraping while defeating anti-bot protections.
Installing & Launching Playwright with Python
We'll focus on Python code examples here, but the Playwright API is very similar across languages.
First install Playwright:
pip install playwright
Also install browser binaries for Chromium, Firefox and WebKit:
playwright install
Now import Playwright and launch a browser:
from playwright.async_api import async_playwright async def main(): async with async_playwright() as p: browser = await p.chromium.launch() asyncio.run(main())
This launches a Playwright controlled Chromium instance in headless mode. Let's start scraping!
Step 1 – Locate Elements & Extract Text
Consider we want to scrape books listed on this site. The first step is to analyze the page structure and identify CSS selectors for data we want.
I want to grab:
- Product name
- Price
- Image URL
Using developer tools, I can see product entries are under <li class="product">
. Product names are inside <h2>
tags, prices within <span class="woocommerce-Price-amount">
and image links are <img src=...>
.
Let's grab these elements with Playwright selectors:
import asyncio from playwright.async_api import async_playwright async def scrape(page): # Get all products products = await page.query_selector_all("li.product") for product in products: name = await product.query_selector("h2").inner_text() price = await product.query_selector("span.woocommerce-Price-amount").text_content() image_url = await product.query_selector("img").get_attribute("src") print({ 'name': name, 'price': price, 'img_url': image_url }) async def main(): async with async_playwright() as p: browser = await p.chromium.launch() page = await browser.new_page() await page.goto("https://scrapeme.live/shop/") await scrape(page) await browser.close() asyncio.run(main())
And we have successfully extracted key data points into a nice Python dictionary!
Step 2 – Downloading Images
What if you also wanted to download the product images locally?
We already have the image URLs. We just need to add a step to save them with Playwright's download API:
async def scrape(page): # Same logic as before for product in products: # Extract fields ... image_url = await product.query_selector("img").get_attribute("src") # Download image await page.dblclick("img") download = await page.waitForEvent("download") path = await download.path() # Get file path print(f'Saved image to {path}') print({ 'name': name, 'price': price, 'img_url': image_url, 'img_path': path })
Here we simulate double clicking the image to trigger the browser's default download behavior. Playwright waits for the download event, extracts the file path once completed and we have the image saved locally!
Step 3 – Export Data to CSV
For data analytics, it's useful to export scraped data to a spreadsheet format like CSV.
First, let's create a Python list results = []
to store data entries. Then append dicts with each product's info:
results = [] async def scrape(page): products = await page.query_selector_all("li.product") for product in products: data = { 'name': await product.query_selector("h2").inner_text(), 'price': await product.query_selector("span.woocommerce-Price-amount").text_content(), # ... } results.append(data)
Finally, save the CSV:
import csv with open('output.csv', 'w') as f: writer = csv.DictWriter(f, fieldnames=results[0].keys()) writer.writeheader() writer.writerows(results) print('Saved CSV')
Now you've extracted and exported web data without touching a browser!
Step 4 – Browser Automation
A key benefit of Playwright is simulating user interactions like clicks, scrolls and form submissions. This allows scraping content that renders dynamically with JavaScript.
Let's try automating paginated results.
The site we're scraping shows 3 products per page. To scrape them all, we need to handle clicking between pages.
Using the page inspector, we can see the next page button has the text →
. Let's click it programmatically with Playwright using CSS selectors:
pages = await page.query_selector_all(".page-numbers") num_pages = len(pages) for i in range(1, num_pages): await page.click(".next.page-numbers") # Click next products = await page.query_selector_all("li.product") for product in products: # Extract and store data await page.wait_for_selector("li.product") # Wait for next page to load
This loop automates clicking the next button, extracts data, and waits for the DOM to update across all pages.
Step 5 – Taking Automated Screenshots
Playwright can also capture screenshots of pages during scraping for visual testing:
This loop automates clicking the next button, extracts data, and waits for the DOM to update across all pages. Step 5 - Taking Automated Screenshots Playwright can also capture screenshots of pages during scraping for visual testing:
This saves a PNG image of the site to disk without needing to open a visible browser!
Playwright vs Puppeteer vs Selenium
Playwright is extremely versatile for web scraping. But how does it compare to popular alternatives like Puppeteer and Selenium?
Playwright
- Supports Chromium, Firefox and WebKit
- Works across Windows, macOS and Linux
- APIs for Python, JavaScript, C# & Java
- Fast and lightweight
- Stealth mode to evade detection
- Robust browser automation features
Puppeteer
- JavaScript API
- Chromium-only
- Headless scraping
- Fast performance
- More limited compared to Playwright
Selenium
- Python, Java, C#, Perl, JavaScript bindings
- Supports Chrome, Firefox, Safari
- Very heavy, impact performance
- No native stealth options
While all three can scrape well, Playwright stands out with speed, scalability and flexibility. The easy-to-use API also has great documentation making it beginner friendly.
Conclusion
This tutorial covered the basics of using Playwright for Python based web scraping. You learned how to:
- Setup headless Chromium and extract page data
- Automate clicks, fills and navigation
- Export structured data to CSV
- Capture screenshots
With these capabilities, Playwright provides a complete web scraping toolbox. Even if sites use heavy JavaScript or try blocking bots via sneaky methods, Playwright has all the tools needed to extract data at scale.
To learn more, visit the Playwright site which has additional docs, API references and sample code repos to help you become a pro web scraper with Playwright!