How to Setting User Agents for Web Scraping

If you want to scrape websites successfully, using the proper user agent string is crucial. I'll explain why it matters, how to configure user agents in Python, and techniques to avoid bot blocks.

What Are User Agents and Why Do They Matter?

First, what exactly is a user agent?

A user agent string provides information about the browser, operating system, and device making a web request. Here's an example:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36

This user agent indicates:

  • Browser: Chrome
  • Browser version: 101.0.4951.64
  • Operating system: Windows NT (Windows 10)
  • OS architecture: x64 (64-bit)

User agents are sent in the HTTP headers of requests from all major browsers like Chrome, Firefox and Safari.

Websites rely on user agents for many reasons:

  • To identify the browser and adapt content accordingly
  • Detect mobile vs desktop visitors
  • Block known bot user agents from scraping content

That last reason is why user agents are so important for web scraping.

Why Invalid User Agents Get Scrapers Blocked

Many sites actively analyze user agent strings to detect and block bots.

Using an incorrect user agent or leaving it blank is a dead giveaway that you're not a real browser. This quickly leads to blocks on sites protected by bot defense systems like Distil Networks or Imperva.

Bots may get blocked in various ways depending on the site:

  • HTTP status code 403 Forbidden
  • CAPTCHAs requiring human verification
  • Re-directs to bot warning pages
  • IP address blocking
  • Javascript checking for valid browser properties

This hinders your ability to scrape target sites successfully.

The key is to set a valid user agent that impersonates a real browser. This makes your scraper appear to be normal website traffic. I'll cover the best practices next.

Recommended User Agents for Web Scraping

To avoid bot blocks, I suggest using legitimate user agent strings from major browsers like Chrome, Firefox, Safari and Edge.

Here are some examples I recommend for web scraping in 2023:

Chrome on Windows

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36

Chrome on macOS

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36

Firefox on Linux

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0

Microsoft Edge on Windows

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.41

I recommend sticking to current Chrome, Firefox, Safari and Edge browsers which have significant market share. The examples above are up-to-date as of February 2023 but will need to be refreshed over time as new browser versions release.

For a huge list of other user agents to choose from, I suggest checking sites like WhatIsMyBrowser and UserAgentString. Just be sure to always verify they represent real browser and OS combinations.

Next I'll go over how to set these user strings in Python scripts for web scraping.

Setting the User Agent in Python Using Requests

Python's requests module makes it straightforward to set a custom user agent.

First, create your headers dictionary including the User-Agent header:

headers = {
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 
}

Then pass the headers parameter when making requests:

import requests

url = 'https://example.com'

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"}

response = requests.get(url, headers=headers)

This will send the request with your specified user agent.

To validate it works, you can check the headers received by a service like httpbin:

import requests

custom_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"

headers = {"User-Agent": custom_ua} 

url = 'https://httpbin.org/headers'
response = requests.get(url, headers=headers)

print(response.json()["headers"]["User-Agent"]) # prints custom_ua

The response will include the exact user agent received by the server, allowing you to confirm your scraper is sending the expected string.

Other Approaches to Setting the User Agent

Besides requests, here are some other ways to configure a custom user agent in Python:

  • selenium: Set the user_agent attribute of DesiredCapabilities
  • pycurl: Use the CURLOPT_USERAGENT option
  • urllib: Add a User-Agent header to the request like requests
  • Scrapy: Set the USER_AGENT setting

No matter the approach, the same best practices apply around selecting valid browser-like user agents.

Now that we can set user agents, let's look at why rotating them is important.

Rotating User Agents to Avoid Detection

Using the exact same static user agent for all requests can still get detected as a bot. The key is rotating between multiple user agents.

Here is an example of how to randomly rotate user agents:

import requests
import random 

user_agents = [
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0"
]

url = 'https://example.com'

for i in range(10):
  
  # Pick a random user agent
  user_agent = random.choice(user_agents)

  # Set it in the headers 
  headers = {"User-Agent": user_agent}

  # Make request
  response = requests.get(url, headers=headers)

  print(response.status_code)

By varying the user agent with each request, your scraper appears more like a real person switching browsers.

Other tips for avoiding bot blocks:

  • Add random delays between requests using time.sleep()
  • Frequently update your user agent list with new browser versions
  • Use a proxy rotation service to vary IPs
  • Don't overload sites with too many requests too quickly

Next, I'll quickly cover using web scraping APIs as an alternative to dealing with user agents.

Web Scraping APIs: Let Them Handle User Agents

While managing user agents yourself can work, an easier solution is to use a web scraping API.

Tools like ScrapingBee, ScraperAPI, ZenRows, and others handle user agents, proxies, and browsers under the hood.

For example, here's how to scrape a page with ZenRows without needing to worry about user agents at all:

import zenrows

zr = zenrows.Client()

response = zr.get("https://example.com")
print(response.text)

The API abstracts away user agent rotation and other low-level details. Most commercial APIs offer free tiers to try them out.

Just be aware you still can't abuse APIs and need to respect sites' terms and data usage policies. I'd suggest keeping requests limited to a few thousand per day on any given domain as a safe amount.

Recap of Best Practices

To recap, here are my top tips for setting user agents when scraping:

  • Use legitimate user agents from major browsers like Chrome and Firefox
  • Frequently rotate between multiple user agents
  • Add random delays between requests to appear human
  • Keep your user agent list updated with latest versions
  • Consider using a web scraping API to simplify things

Configuring your user agent properly is crucial for avoiding bot blocks and scraping sites successfully.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *