How to Rotate Proxies in Python (2023 Tutorial)

Rotating proxies is an indispensable technique for web scrapers. Using the same IP address for all requests acts like a homing signal, marking your traffic as a bot. But continuously changing the outgoing IP address masks your scraper to look like normal human traffic.

In this comprehensive guide, you’ll learn battle-tested techniques to build a robust proxy rotation system in Python. I’ll share the exact methods professional web scrapers use to avoid blocks at scale.

Here’s what we’ll cover:

  • Why proxy rotation is essential for web scraping
  • How to get lists of potential proxy servers
  • Techniques to check proxies and identify working ones
  • Organizing proxies into working, failed, and unchecked sets
  • Picking random working proxies for each request
  • Rechecking failed proxies periodically to find working ones
  • Integrating proxy rotation into a Python scraper

Follow along and you’ll have all the knowledge to implement bulletproof proxy rotation for your own scrapers. Let’s get started!

Why Proxy Rotation is Crucial for Web Scraping

Large sites actively try to detect and block scrapers. As soon as they notice unusual traffic levels or patterns, they’ll employ countermeasures to stop you.

The easiest way to identify a scraper is through the source IP address. Requests from a single IP, at higher speeds than humans can browse, are a dead giveaway.

For example, here’s traffic from a normal user over one hour:

Request 1 > IP: 82.193.113.203
Request 2 > IP: 82.193.113.203
Request 3 > IP: 82.193.113.203

Now compare that to a scraper without proxies:

Request 1 > IP: 92.223.87.192  
Request 2 > IP: 92.223.87.192
Request 3 > IP: 92.223.87.192
...repeated requests from 92.223.87.192...

The hack here is to use proxy servers. These act as intermediaries that forward your requests to the target site. So while your IP makes the request to the proxy, their outbound IP makes the actual call to the site.

For example:

Request 1 > Your IP > Proxy 1 > Target Site
Request 2 > Your IP > Proxy 2 > Target Site
Request 3 > Your IP > Proxy 3 > Target Site

Now from the site’s perspective, each request comes from a different source IP! This makes your scraper traffic blend in like normal users browsing the site.

The Problem With Static Proxies

Some tools like public proxy lists provide static proxies. They always use the same outbound IP address.

Rotating static proxies can work for a while. But if you repeatedly hit sites with the same IPs, those will eventually get flagged too.

That’s why we need rotating proxies that automatically change their outbound IPs. Either with each request, or after a set time period like every 5 minutes.

Butcaptcha – Rotating Proxy Network Butcaptcha provides over 2M rotating residential IPs worldwide, starting at $50/month. Easily integrate via API or browser extensions. Sign up today to keep your scrapers unblocked.

The best way to avoid blocks is combining proxy rotation with other evasion techniques like:

  • Random delays between requests
  • Mimicking human browsing patterns
  • Rotating user agents and other HTTP headers
  • Spreading scraping across IP ranges, like mobile vs residential vs datacenter

This guide focuses specifically on rotating proxies. But keep these other tips in mind to build a robust, full-proof scraping system.

Now that you know why proxy rotation is so important, let’s move on to building the system!

Getting a List of Proxy Servers

To start rotating proxies, we obviously need a list of potential proxies to use. Let’s go over the options:

Web Scraping Proxy Providers

For best results, you should use paid proxies from a reputable provider. These work more reliably compared to free public proxies.

Look for providers offering rotating residential or datacenter proxies. Popular options include:

  • Luminati – Over 40M IPs worldwide, starting at $500/month
  • Oxylabs – 2M+ rotating proxies, plans from $75/month
  • Smartproxy – 10M+ IPs, US proxies from $200/month
  • GeoSurf – Residential proxies in 130+ countries

These providers give you instant access to millions of proxies with just an API key. Under the hood, they maintain the infrastructure to provide fresh IPs.

Free Public Proxy Lists

You can also find free public proxy lists online. Sites like free-proxy-list.net publish lists anyone can use.

The downside is proxies from free lists tend to:

  • Have slower speeds
  • Get blacklisted more frequently on sites
  • Go offline without notice
  • Have no abuse detection so may be banned

But free lists can be useful for testing and experimentation. Just don’t rely on them for production scraping.

To get started, grab a list from any free source and save as proxies.txt. The format is one proxy per line:

143.198.184.10:8080
88.247.108.19:3128
...

We can load these into a Python list:

proxies_list = open("proxies.txt").read().splitlines()

This gives us a list of potential proxies to start testing.

Web Scraping Forums

Another option is getting proxies from web scraping forums and communities. Members often share working proxies for mutual benefit.

For example, on the Black Hat World forum, there are threads like:

  • “Fast USA Proxies For All”
  • “Looking for cheap proxies for scraping/botting”

You can find proxy lists in these discussions. Or make requests for proxies in your required locations.

The quality varies so be sure to thoroughly test any proxies obtained this way. But communities like Black Hat World can be a good free resource.

Scraping Proxy Aggregator Sites

If you want an automated solution, you can scrape proxy aggregators that compile lists from multiple sources.

For example, sites like free-proxy-list.net provide an API to get their latest proxies. You could easily scrape these sites and extract the proxy lists.

Here's a sample Python script using Requests:

import requests
from bs4 import BeautifulSoup 

url = 'https://free-proxy-list.net/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

proxies = []
for row in soup.select('table tr')[1:]:
    tds = row.find_all('td')
    ip = tds[0].text.strip() 
    port = tds[1].text.strip()
    proxies.append(f"{ip}:{port}")
    
print(proxies)

This parses the free proxy table on that site and extracts the IP:Ports. Run it periodically to check for new proxies.

The same method works for any site publishing proxy lists in HTML tables. You can expand it to scrape multiple aggregators and combine their proxies.

Checking Proxies and Identifying Working Ones

Simply having a list of potential proxies isn’t enough. Many of them will be dead, blocked, or too slow to use. We need to test each one to filter out the working proxies.

Here is a systematic process to validate proxies:

The Proxy Checker Script

First, we need a script that tests a proxy by sending a request. Let’s break it down:

1. Use the requests library to download a page:

import requests

proxy = "143.198.184.10:8080" 

response = requests.get(
    "http://example.com", 
    proxies={"http": f"http://{proxy}"},
    timeout=30
)

2. Print the status code to check the result:

print(response.status_code)
# 200 means success, 500 or other mean failure

3. Wrap in a try/except to catch errors:

try:
   response = requests.get(...)
   print(response.status_code)
except Exception as e:
   print('Request failed:', e)

4. Put it together into a reusable check_proxy() method:

import requests

def check_proxy(proxy):
   try:
      response = requests.get(
         "http://example.com",
         proxies={"http": f"http://{proxy}"}, 
         timeout=30, 
      )
      print(response.status_code)
   except Exception as e:
      print('Request failed:', e)

This gives us a script to test any proxy. Time to check the list!

Checking the Full Proxy List

To validate all proxies, we simply loop through them calling the checker:

proxy_list = open("proxies.txt").read().splitlines()

for proxy in proxy_list:
  check_proxy(proxy)

This will print either the status code or any errors for each proxy.

We can consider proxies returning status 200 as alive and working. The rest failed for some reason like timeout, connection issues, or being blocked.

Now we need to organize the working ones separately…

Separating Working Proxies from Failed Ones

To keep track of proxy states, we'll use three sets:

from set import set

working = set() 
failed = set()
unchecked = set(proxy_list)
  • working: proxies that returned 200
  • failed: proxies that failed or timed out
  • unchecked: newly added proxies not yet checked

Then we can write functions to move proxies between the sets:

def set_working(proxy):
  working.add(proxy)
  unchecked.discard(proxy)
  failed.discard(proxy) 

def set_failed(proxy):
  failed.add(proxy)
  working.discard(proxy)
  unchecked.discard(proxy)

For example, call set_working() after a successful check:

def check_proxy(proxy):
  try:
    # check proxy
    if response.status_code == 200:
      set_working(proxy)
    else:
      set_failed(proxy) 
  except:
    set_failed(proxy)

This will automatically organize proxies after checking them.

Retesting Failed Proxies

Once a proxy fails once, we don't discard it permanently. Temporary network problems or downtime could have caused a one-off failure.

To give them another chance, we can recheck failed proxies after some time:

from threading import Timer

def revive(proxy):
  failed.discard(proxy)
  unchecked.add(proxy)
  
def set_failed(proxy):
  failed.add(proxy)
  
  Timer(120, revive, [proxy]).start() # recheck after 2min

This moves proxies back to unchecked list after 2 minutes. Now when we pull a random proxy, they'll get tried again.

Ok, now we have a robust system to validate proxy lists and identify the working ones. Let’s put it to use!

Using Working Proxies in a Python Scraper

To implement proxy rotation, our scraper needs a way to get a random working proxy for each request. Here’s how to achieve that:

Pick a Random Working Proxy

First, write a method to return a random proxy from the pool:

import random

def get_proxy():

  proxies = list(working | unchecked) # combine working and unchecked
  
  if not proxies:
    raise Exception("No proxies available!")
  
  return random.choice(proxies)

This will go down if no proxies are working or unchecked.

To try out new proxies, we can include the unchecked ones instead of discarding them immediately. But for a conservative strategy, use only working proxies.

Make Requests Through the Random Proxy

Now we can update our get() method to use a random proxy each call:

import requests

def get(url):

  proxy = get_proxy() # fetch a random proxy
  
  try:
    response = requests.get(url, proxies={"http": proxy})
    return response
  
  except Exception as e:
    # mark proxy as failed
    set_failed(proxy) 
    raise e

If a proxy fails, it gets discarded from the working pool by calling set_failed(). And a new random proxy will get used for the next request.

This is a simple but effective approach to rotate across working proxies.

User Agents Are Key Too

A common mistake is focusing only on IP rotation using proxies. Sites also look at other headers like User-Agent to identify scrapers.

So make sure to also rotate random user agents with each request. For example:

import random 

user_agents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
  'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'  
]

def get(url):

  # Pick random user agent
  user_agent = random.choice(user_agents) 
  
  headers = {
    "User-Agent": user_agent
  }
  
  # Make request using rotating proxy and user agent
  response = requests.get(url, headers=headers, proxies={"http": proxy})

This helps make your traffic appear even more human.

Integrating Proxy Rotation Into a Scraper

Let's say we want to scrape a site and extract some data from each page. Here is a simple scraper with proxy rotation:

import requests
from bs4 import BeautifulSoup

# Storage for scraped data
results = [] 

def scrape(url):
  proxy = get_proxy()
  response = requests.get(url, proxies={"http": proxy})

  soup = BeautifulSoup(response.text, 'html.parser')
  
  # Scrape data here...
  
  results.append({'title': soup.title.text})

# Run scraper across multiple URLs 
for url in urls_list:
  scrape(url)

print(results)

The key points are:

  • Call get_proxy() to fetch a random working proxy before each request
  • Pass the proxy into requests.get()
  • If any fail, get_proxy() will provide a new proxy automatically

This will let you scale up scraping without worrying about IP blocks. The proxies provide an endless supply of fresh IPs to rotate through.

Proxy Rotation Recap

Here are the key steps we covered to implement proxy rotation in Python:

  1. Get a list of potential proxy servers
  2. Check proxies and identify working ones
  3. Organize working, unchecked, and failed proxies into sets
  4. Write a method to provide a random working proxy
  5. Make requests using the rotating proxies
  6. Monitor failures and revive failed proxies periodically

Following this blueprint will let you integrate proxy rotation into any scraper. Combine it with other evasion techniques and you can scrape large sites without detections.

While I used basic Python tools like Requests here, the concepts work for any language or framework like Selenium, Puppeteer, or Scrapy.

The core logic remains the same – continuously pull from a pool of proven working proxies. This prevents the target site from seeing repeat traffic from the same IP.

Advanced Proxy Rotation Tips

Here are some additional tips to make your proxy rotation even more robust:

Asynchronous Checking

Checking proxies sequentially can be slow with a huge list. Speed it up by validating them asynchronously.

For example, with Python's asyncio module:

import asyncio

async def check(proxy):
  # await request 

proxies = [...] 

async def main():
  
  await asyncio.gather(*[check(proxy) for proxy in proxies])

asyncio.run(main())

This fires all the checks in parallel.

Prioritize Higher Performing Proxies

Instead of picking proxies completely randomly, prioritize higher performing ones.

Keep metrics on proxies like success rate, average response time, locations, etc. Then bias selection toward faster, more reliable proxies.

Proxy Health Checks

Actively check known working proxies by making periodic test requests. If they start failing, immediately remove from the working set.

This catches proxies going bad before they ruin actual scraping requests.

Proxy Authentication

Some providers require authentication to use their proxies, like username/password or API keys.

Make sure to handle the authentication accordingly when integrating proxies into your scraper.

Proxy Provider APIs

For production scraping, use a paid proxy API instead of static lists. These make management easier by providing a single endpoint to request fresh proxies.

Behind the scenes, the provider handles rotating their huge IP pools through different technologies like residential VPNs, datacenters, etc.

Build a Scalable Proxy Infrastructure

Maintaining proxy health checks, rotation logic, authentication, etc. takes work.

For large scale scraping, it's best to use a ready-made proxy management system instead of coding from scratch.

Platforms like Bright Data handle all proxy operations.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *