How to Build a Web Crawler in Python
Web crawling is an important technique for gathering large amounts of data from the web. Python is one of the most popular languages for writing web crawlers thanks to its fast development cycle, rich ecosystem of libraries, and easy-to-read syntax. In this comprehensive guide, you'll learn step-by-step how to build a web crawler in Python.
What is a Web Crawler?
A web crawler is an automated program that browses the internet in a methodical and automated manner. The crawler starts with a list of seed URLs to visit, identifies all the hyperlinks in the page content, and adds them to the list of URLs to crawl. This process continues recursively, following links between web pages to index large swathes of the internet.
Web crawlers power search engines, archive sites, price comparison services, news aggregators, and more. Without web crawling technology, many essential online services would not exist.
The key components of a web crawler include:
- A URL frontier which contains the list of URLs to crawl. This is seeded with starting URLs and expanded by extracting links from downloaded pages.
- A fetcher that downloads the content of a URL using the HTTP protocol.
- A parser that extracts links and data from the downloaded content.
- A duplicate eliminator that ensures the same URL is not crawled multiple times.
- A data store to record extracted information.
- A scheduler that prioritizes which URLs to crawl next.
By assembling these components intelligently, we can build a scalable web crawler in Python able to index even huge sites with millions of pages.
Prerequisites
Before you start building a Python web crawler, you should have:
- Basic knowledge of Python.
- Experience with HTML and CSS selectors.
- Familiarity with HTTP requests and responses.
Additionally, you'll need to install the following libraries:
pip install requests beautifulsoup4
Requests will be used to download web pages, while Beautiful Soup parses the HTML content.
Create a Basic Crawler
Let's start by building a simple crawler to download and process pages from a test site.
First, we need to import the libraries:
import requests from bs4 import BeautifulSoup
Define a list urls
to hold the URLs to crawl and populate it with the starting page:
urls = ["https://example.com"]
Then enter an infinite loop – we will crawl URLs forever until interrupted:
while True: # Get next URL to crawl current_url = urls.pop() # Download page content response = requests.get(current_url) # Parse HTML soup = BeautifulSoup(response.text, 'html.parser') # Find all links on page for anchor in soup.find_all("a"): link = anchor.attrs["href"] if "href" in anchor.attrs else '' # Add new URLs to the list if link.startswith('/'): link = current_url + link urls.append(link)
This implements a basic crawler following all links between pages on a single site. We pop each URL from the list, download the content, parse the HTML, extract further links, and add them back to the list.
However, it has some issues:
- Crawls infinitely without stopping.
- May crawl duplicates if pages link to each other.
- Doesn't record any useful data.
Let's look at enhancing the crawler to address these problems.
Refinements
Here are some ways we can improve our basic crawler:
Limit Crawl Scope
To prevent infinite crawling, we can define a maximum number of pages to crawl:
max_urls = 100 while len(urls) < max_urls: # Crawling logic
Alternatively, we can crawl to a set depth by tracking the current depth:
max_depth = 2 current_depth = 0 while current_depth < max_depth: # Crawling logic current_depth += 1
This will restrict the crawler to a finite section of the site.
Avoid Duplicates
To prevent re-crawling pages, we can maintain a set of visited URLs:
visited_urls = set() while True: current_url = urls.pop() if current_url in visited_urls: continue # Crawling logic visited_urls.add(current_url)
Before processing a URL, we check if it has been visited already.
Store Data
We can store useful data from each page in a database. For example:
from sqlalchemy import create_engine engine = create_engine('sqlite:///crawler.db') while True: # Crawl page data = { 'url': current_url, 'title': soup.title.text, 'body': soup.get_text() } engine.execute( "INSERT INTO pages VALUES (:url, :title, :body)", data )
This will save the URL, title and body content to a SQLite database table.
Prioritize Links
Instead of a list, we can use a priority queue to crawl important pages first:
from queue import PriorityQueue url_queue = PriorityQueue() url_queue.put((1, start_url)) while True: priority, url = url_queue.get() # Crawl page for link in page_links: url_queue.put((0.5, link))
The starting URL has priority 1. Discovered links have lower priority 0.5, so they will be crawled later.
Parallelize Crawling
To speed up crawling, we can launch multiple threads which call the crawling function in parallel:
from threading import Thread # Crawl function def crawl(url_queue): while True: url = url_queue.get() # Process url # Create thread pool threads = [] for i in range(10): t = Thread(target=crawl, args=(url_queue,)) threads.append(t) t.start() # Join all threads for t in threads: t.join()
By running multiple crawlers concurrently, we can achieve near-linear scaling in performance.
Production Crawler Tips
Here are some additional tips for building a production-grade crawler:
- Respect robots.txt: Read this file on each site to avoid blocked or restricted areas.
- Limit request rate: Don't overload servers by delaying requests and limiting concurrency.
- Cache content: Save a local copy of downloaded pages to avoid hitting the network for duplicates.
- Randomize user-agent: Use a mix of user-agent strings to appear like a real browser.
- Use a headless browser: For sites that require JavaScript, use Selenium or Playwright to drive a real browser.
- Distribute crawling: Spread load across multiple servers to scale horizontally.
- Persist data efficiently: Use a dedicated store like Elasticsearch to hold structured crawl data.
By incorporating these practices, you can crawl huge sites while avoiding problems like getting blocked or overloading resources.
Python Crawling Tools
Instead of coding a crawler from scratch, you can use Python libraries and frameworks that provide pre-built functionality:
Scrapy – A popular web scraping framework with built-in support for crawling. It handles queues, requests, parsing and more.
pyspider – Lightweight crawling framework with web UI and clustering support.
CrawlPy – Simple crawling library built on top of Requests and Beautiful Soup.
MechanicalSoup – Created for web scraping tasks that require browser state and forms.
Portia – Visual web scraper that autogenerates a Scrapy crawler for a site.
Splash – JavaScript rendering service that can be used with Scrapy for dynamic content.
Scrapoxy – Scraper API acting as a proxy between your crawler and target sites.
For large scale production systems, Scrapy is a great choice. For small to medium projects, libraries like CrawlPy and MechanicalSoup are simpler to get started with.
Conclusion
In this guide, you learned how to build a web crawler in Python step-by-step:
- Web crawlers automatically browse the web to index or extract data from sites.
- Python is a great language for writing crawlers thanks to its ecosystem of libraries.
- A basic crawler downloads pages, extracts links, and repeats recursively.
- Enhancements like avoiding duplicates, setting scope, and parallelizing improve the crawler.
- Production crawlers require additional features like respecting robots.txt, caching, and browser emulation.
- Scraping frameworks like Scrapy provide pre-built support for Python crawling.
Now you have all the building blocks to start writing Python crawlers for gathering data across the internet!