How to Setting User Agents for Web Scraping
If you want to scrape websites successfully, using the proper user agent string is crucial. I'll explain why it matters, how to configure user agents in Python, and techniques to avoid bot blocks.
What Are User Agents and Why Do They Matter?
First, what exactly is a user agent?
A user agent string provides information about the browser, operating system, and device making a web request. Here's an example:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36
This user agent indicates:
- Browser: Chrome
- Browser version: 101.0.4951.64
- Operating system: Windows NT (Windows 10)
- OS architecture: x64 (64-bit)
User agents are sent in the HTTP headers of requests from all major browsers like Chrome, Firefox and Safari.
Websites rely on user agents for many reasons:
- To identify the browser and adapt content accordingly
- Detect mobile vs desktop visitors
- Block known bot user agents from scraping content
That last reason is why user agents are so important for web scraping.
Why Invalid User Agents Get Scrapers Blocked
Many sites actively analyze user agent strings to detect and block bots.
Using an incorrect user agent or leaving it blank is a dead giveaway that you're not a real browser. This quickly leads to blocks on sites protected by bot defense systems like Distil Networks or Imperva.
Bots may get blocked in various ways depending on the site:
- HTTP status code 403 Forbidden
- CAPTCHAs requiring human verification
- Re-directs to bot warning pages
- IP address blocking
- Javascript checking for valid browser properties
This hinders your ability to scrape target sites successfully.
The key is to set a valid user agent that impersonates a real browser. This makes your scraper appear to be normal website traffic. I'll cover the best practices next.
Recommended User Agents for Web Scraping
To avoid bot blocks, I suggest using legitimate user agent strings from major browsers like Chrome, Firefox, Safari and Edge.
Here are some examples I recommend for web scraping in 2023:
Chrome on Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36
Chrome on macOS
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36
Firefox on Linux
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0
Microsoft Edge on Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.41
I recommend sticking to current Chrome, Firefox, Safari and Edge browsers which have significant market share. The examples above are up-to-date as of February 2023 but will need to be refreshed over time as new browser versions release.
For a huge list of other user agents to choose from, I suggest checking sites like WhatIsMyBrowser and UserAgentString. Just be sure to always verify they represent real browser and OS combinations.
Next I'll go over how to set these user strings in Python scripts for web scraping.
Setting the User Agent in Python Using Requests
Python's requests module makes it straightforward to set a custom user agent.
First, create your headers dictionary including the User-Agent header:
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" }
Then pass the headers
parameter when making requests:
import requests url = 'https://example.com' headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"} response = requests.get(url, headers=headers)
This will send the request with your specified user agent.
To validate it works, you can check the headers received by a service like httpbin:
import requests custom_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" headers = {"User-Agent": custom_ua} url = 'https://httpbin.org/headers' response = requests.get(url, headers=headers) print(response.json()["headers"]["User-Agent"]) # prints custom_ua
The response will include the exact user agent received by the server, allowing you to confirm your scraper is sending the expected string.
Other Approaches to Setting the User Agent
Besides requests, here are some other ways to configure a custom user agent in Python:
- selenium: Set the
user_agent
attribute of DesiredCapabilities - pycurl: Use the
CURLOPT_USERAGENT
option - urllib: Add a
User-Agent
header to the request like requests - Scrapy: Set the
USER_AGENT
setting
No matter the approach, the same best practices apply around selecting valid browser-like user agents.
Now that we can set user agents, let's look at why rotating them is important.
Rotating User Agents to Avoid Detection
Using the exact same static user agent for all requests can still get detected as a bot. The key is rotating between multiple user agents.
Here is an example of how to randomly rotate user agents:
import requests import random user_agents = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0" ] url = 'https://example.com' for i in range(10): # Pick a random user agent user_agent = random.choice(user_agents) # Set it in the headers headers = {"User-Agent": user_agent} # Make request response = requests.get(url, headers=headers) print(response.status_code)
By varying the user agent with each request, your scraper appears more like a real person switching browsers.
Other tips for avoiding bot blocks:
- Add random delays between requests using
time.sleep()
- Frequently update your user agent list with new browser versions
- Use a proxy rotation service to vary IPs
- Don't overload sites with too many requests too quickly
Next, I'll quickly cover using web scraping APIs as an alternative to dealing with user agents.
Web Scraping APIs: Let Them Handle User Agents
While managing user agents yourself can work, an easier solution is to use a web scraping API.
Tools like ScrapingBee, ScraperAPI, ZenRows, and others handle user agents, proxies, and browsers under the hood.
For example, here's how to scrape a page with ZenRows without needing to worry about user agents at all:
import zenrows zr = zenrows.Client() response = zr.get("https://example.com") print(response.text)
The API abstracts away user agent rotation and other low-level details. Most commercial APIs offer free tiers to try them out.
Just be aware you still can't abuse APIs and need to respect sites' terms and data usage policies. I'd suggest keeping requests limited to a few thousand per day on any given domain as a safe amount.
Recap of Best Practices
To recap, here are my top tips for setting user agents when scraping:
- Use legitimate user agents from major browsers like Chrome and Firefox
- Frequently rotate between multiple user agents
- Add random delays between requests to appear human
- Keep your user agent list updated with latest versions
- Consider using a web scraping API to simplify things
Configuring your user agent properly is crucial for avoiding bot blocks and scraping sites successfully.