How to Scraping From a List of URLs
When you need to scrape a large number of URLs, having a list of those URLs ready can streamline and optimize the scraping process. In this guide, we'll explore methods for scraping URLs from a list using Bright Data proxies for anonymity and scale.
Why Use Proxies for Scraping?
Scraping large numbers of pages without taking precautions can get your IP address blocked by target sites. Using rotating proxies helps avoid blocks by masking the originating IP address with thousands of IP addresses from the Bright Data proxy network.
Proxies also enable parallel scraping to speed up the process when scraping a long list of URLs. Checking URLs sequentially would take quite a long time for a large list, but making concurrent requests through multiple proxies lets you fetch pages simultaneously.
Getting Set Up
To follow the examples, you'll need:
- Python 3
- Pip installed
- Bright Data API key
Install the required libraries:
pip install requests aiohttp bs4
Import what we need:
import requests import aiohttp from bs4 import BeautifulSoup
Scraping Sequentially
A simple way to process a list of URLs is sequentially in a loop. We'll walk through an example using the requests library.
First, define the Bright Data API endpoint and your API key:
PROXY_HOST = 'proxy.brightdata.com' API_KEY = 'your_api_key'
We'll scrape a made up list of product URLs:
urls = [ 'https://www.example.com/products/1', 'https://www.example.com/products/2', 'https://www.example.com/products/3' ]
Then we can loop through each URL and make a scraping request using a Bright Data proxy:
for url in urls: proxy_url = f'http://{API_KEY}:{PROXY_HOST}' proxies = {'http': proxy_url, 'https': proxy_url} response = requests.get(url, proxies=proxies)
We provide the proxy URL constructed from the API key and proxy host to the proxies
parameter. This routes the request through the Bright Data proxy network.
To extract data, we can parse the response with BeautifulSoup:
soup = BeautifulSoup(response.text, 'html.parser') name = soup.select_one('.product-name').text price = soup.select_one('.price').text print(name, price)
This will sequentially scrape each page through a Bright Data proxy and extract the product name and price.
The sequential approach is simple, but can be slow for large lists of URLs. Next we'll look at scaling with concurrent requests.
Scraping Concurrently with aiohttp
To speed up scraping, we can leverage the aiohttp
library to make asynchronous concurernt requests through multiple proxies.
The following example shows how to scrape the list of product URLs concurrently:
import aiohttp import asyncio async def scrape_url(session, url): proxy_url = f'http://{API_KEY}:{PROXY_HOST}' proxy = 'http://' + proxy_url async with session.get(url, proxy=proxy) as response: soup = BeautifulSoup(await response.text(), 'html.parser') name = soup.select_one('.product-name').text price = soup.select_one('.price').text print(url, name, price) async def main(): async with aiohttp.ClientSession() as session: tasks = [] for url in urls: task = asyncio.ensure_future(scrape_url(session, url)) tasks.append(task) await asyncio.gather(*tasks) loop = asyncio.get_event_loop() loop.run_until_complete(main())
The key points:
- Create an
async with aiohttp.ClientSession()
to handle connections - Define an
async
scrape function to fetch each URL - Build a list of
tasks
for each URL to scrape - Use
asyncio.gather
to concurrently wait for all tasks to complete - Pass the
ClientSession
to each scrape function to reuse connections
By awaiting the tasks concurrently, we end up making all the scraping requests in parallel through the Bright Data proxies instead of waiting for each one to finish before moving to the next.
This allows us to scrape a large list of URLs much faster!
Handling Errors
In both examples, we aren't doing any error handling for simplicity. But in a real scraper you'll want to anticipate and handle errors like connectivity issues, proxy blocks, timeouts, HTTP errors and more.
Here's an example with some basic error handling:
async def scrape_url(session, url): try: # scraping logic here except aiohttp.ClientConnectorError as e: print(f'Connection error for {url}: {e}') except aiohttp.ClientResponseError as e: print(f'HTTP error {e.status} for {url}') except Exception as e: print(f'Error scraping {url}: {e}')
This prints any errors encountered while scraping each URL so you can handle them appropriately.
Some other things you may want to implement:
- Retrying failed requests 2-3 times before giving up
- Saving failed URLs to retry later
- Using custom timeout and retry parameters
- Handling rate limiting specifically
Robust error handling will ensure you gather as much data as possible from all the URLs.
In Summary
Scraping from a pre-made list of URLs is a common need for large scraping projects. By leveraging Bright Data proxies, you can:
- Rotate IP addresses to avoid blocks
- Scrape concurrently for improved speed
- Build robust error handling for reliability
Whether scraping a few pages or a few million, proxies help ensure successful extraction of data from all your target URLs.