How to Scraping From a List of URLs

When you need to scrape a large number of URLs, having a list of those URLs ready can streamline and optimize the scraping process. In this guide, we'll explore methods for scraping URLs from a list using Bright Data proxies for anonymity and scale.

Why Use Proxies for Scraping?

Scraping large numbers of pages without taking precautions can get your IP address blocked by target sites. Using rotating proxies helps avoid blocks by masking the originating IP address with thousands of IP addresses from the Bright Data proxy network.

Proxies also enable parallel scraping to speed up the process when scraping a long list of URLs. Checking URLs sequentially would take quite a long time for a large list, but making concurrent requests through multiple proxies lets you fetch pages simultaneously.

Getting Set Up

To follow the examples, you'll need:

  • Python 3
  • Pip installed
  • Bright Data API key

Install the required libraries:

pip install requests aiohttp bs4

Import what we need:

import requests
import aiohttp
from bs4 import BeautifulSoup

Scraping Sequentially

A simple way to process a list of URLs is sequentially in a loop. We'll walk through an example using the requests library.

First, define the Bright Data API endpoint and your API key:

PROXY_HOST = 'proxy.brightdata.com' 
API_KEY = 'your_api_key'

We'll scrape a made up list of product URLs:

urls = [
    'https://www.example.com/products/1',
    'https://www.example.com/products/2',
    'https://www.example.com/products/3' 
]

Then we can loop through each URL and make a scraping request using a Bright Data proxy:

for url in urls:
    proxy_url = f'http://{API_KEY}:{PROXY_HOST}'
    proxies = {'http': proxy_url, 'https': proxy_url}
    response = requests.get(url, proxies=proxies)

We provide the proxy URL constructed from the API key and proxy host to the proxies parameter. This routes the request through the Bright Data proxy network.

To extract data, we can parse the response with BeautifulSoup:

soup = BeautifulSoup(response.text, 'html.parser')
name = soup.select_one('.product-name').text
price = soup.select_one('.price').text
print(name, price)

This will sequentially scrape each page through a Bright Data proxy and extract the product name and price.

The sequential approach is simple, but can be slow for large lists of URLs. Next we'll look at scaling with concurrent requests.

Scraping Concurrently with aiohttp

To speed up scraping, we can leverage the aiohttp library to make asynchronous concurernt requests through multiple proxies.

The following example shows how to scrape the list of product URLs concurrently:

import aiohttp
import asyncio

async def scrape_url(session, url):
    proxy_url = f'http://{API_KEY}:{PROXY_HOST}'
    proxy = 'http://' + proxy_url
    async with session.get(url, proxy=proxy) as response:
        soup = BeautifulSoup(await response.text(), 'html.parser')
        name = soup.select_one('.product-name').text
        price = soup.select_one('.price').text
        print(url, name, price)

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            task = asyncio.ensure_future(scrape_url(session, url))
            tasks.append(task)
        
        await asyncio.gather(*tasks)
        
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

The key points:

  • Create an async with aiohttp.ClientSession() to handle connections
  • Define an async scrape function to fetch each URL
  • Build a list of tasks for each URL to scrape
  • Use asyncio.gather to concurrently wait for all tasks to complete
  • Pass the ClientSession to each scrape function to reuse connections

By awaiting the tasks concurrently, we end up making all the scraping requests in parallel through the Bright Data proxies instead of waiting for each one to finish before moving to the next.

This allows us to scrape a large list of URLs much faster!

Handling Errors

In both examples, we aren't doing any error handling for simplicity. But in a real scraper you'll want to anticipate and handle errors like connectivity issues, proxy blocks, timeouts, HTTP errors and more.

Here's an example with some basic error handling:

async def scrape_url(session, url):
    try:
        # scraping logic here
    except aiohttp.ClientConnectorError as e:
        print(f'Connection error for {url}: {e}')        
    except aiohttp.ClientResponseError as e:
        print(f'HTTP error {e.status} for {url}')
    except Exception as e:
        print(f'Error scraping {url}: {e}')

This prints any errors encountered while scraping each URL so you can handle them appropriately.

Some other things you may want to implement:

  • Retrying failed requests 2-3 times before giving up
  • Saving failed URLs to retry later
  • Using custom timeout and retry parameters
  • Handling rate limiting specifically

Robust error handling will ensure you gather as much data as possible from all the URLs.

In Summary

Scraping from a pre-made list of URLs is a common need for large scraping projects. By leveraging Bright Data proxies, you can:

  • Rotate IP addresses to avoid blocks
  • Scrape concurrently for improved speed
  • Build robust error handling for reliability

Whether scraping a few pages or a few million, proxies help ensure successful extraction of data from all your target URLs.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *