How to Web Scrap With Axios and Cheerio

Web scraping is growing exponentially in popularity and necessity. As increasing amounts of valuable data get embedded across the deep web and client-side web apps rely on heavy JavaScript, the ability to properly extract information through automated scripts becomes even more critical.

Why Web Scraping Matters

The volume of data published on websites has exploded over the last decade. Research shows there are over 1.9 billion websites online as of 2022, with millions of new pages getting created daily.

Unfortunately websites don't always make their underlying data easily accessible. Important content gets embedded across templates and flows rather than hosted in structured databases.

Valuable business data like customer reviews, pricing information, directory listings, social media conversations require extracting (scraping) from front-end websites.

As the shift towards complex JavaScript front-ends and single page applications continues, rendering and manipulating this data on the client-side browser becomes critical.

APIs alleviate these complexities by providing direct access to an application’s logic – the structured data and functionality you actually want. But even with accelerating API adoption, currently only around 8% of websites have APIs.

This massive growth of data combined with the continued reliance on scraping it from complex UIs makes web scraping essential. When APIs aren’t available, scraping is the next best option for extracting large volumes of web data.

The Cost of Blocks

The challenge modern websites face is balancing open access with security. Too many scrapers slamming servers with requests triggers costly performance issues and risks data leaks.

Hence the reason sites implement bot mitigation protections and end up blocking out legitimate scraping traffic. Research shows:

  • 20%+ of web traffic now comes from scrapers and crawlers rather than real users according to sources like PerimeterX
  • Estimated >$720 million yearly loss attributed solely to blocked scraping bots according to Distil Networks

Getting blocked mid-scrape leads to lost data, wasted engineering time, and restarting projects from scratch. So utilizing patterns that are intentional, well-designed, and minimally intrusive becomes critical.

This guide will demonstrate robust, ethical techniques for scraping modern sites using Axios for requests and Cheerio for parsing – two light-weight libraries perfect for the job.

Why Use Axios and Cheerio for Web Scraping?

Axios is an HTTP client designed for easily making requests and handling responses. It runs both in Node.js backend environments and browser-based frontends, providing a consistent interface between the two:

// Make GET request 
axios.get('/products')
  .then(res => {
    // res.data holds response data
  })

// Make POST request
axios.post('/cart', { item: 'Shampoo' }) 
  .then(res => {
   // res.data holds response data   
  })

This makes Axios a very natural fit for querying APIs and websites. You can quickly fire off requests just like a browser does to fetch responses and pass that HTML to other tools like Cheerio.

Some helpful Axios features include:

Transforming Responses: Automatically parse JSON data or alter result formats

axios.get('/products.json', {
  transformResponse: data => {
    // Alter JSON response
    return parsedData
  }   
})

Interceptors: Globally alter requests and responses

axios.interceptors.request.use(config => {

  // Modify headers 
  config.headers['Authorization'] = 'token'

  return config
  
})

Concurrency: Make multiple parallel requests to improve performance

function getUser(id) { 
  // Fetch user logic
}

const requests = ids.map(id => getUser(id))

axios.all(requests)
  .then(axios.spread(responses => {
    // All requests completed  
  }))

Cheerio on the other hand, provides fast, flexible DOM manipulation modeled after the jQuery library. This makes it perfect for parsing scraped HTML content from Axios and extracting the parts you need.

For example, here is some sample website HTML:

<html>
 <body>
   <h1>My Site</h1>
   
   <div class="products">
     <div class="item">
       <h2 class="name">Product 1</h2>
       <p class="price">$29.95</p>
     </div>
     
     <div class="item">
      <h2 class="name">Product 2</h2>
       <p class="price">$39.95</p>
     </div>
   </div>
 </body>
</html>

And using Cheerio we can traverse through this structure and pull out data:

const $ = cheerio.load(html); 

const prices = $('.price').map((i, el) => $(el).text()).get();
// ["$29.95", "$39.95"]

const names = $('.name').map((i, el) => $(el).text()).get(); 
// ["Product 1", "Product 2"]

This makes Cheerio perfect for pairing with Axios to scrape content, walk DOM structures, extract info, and create datasets.

Detailed Web Scraping Walkthrough

Let's go through a more advanced web scraping example using Axios to fetch content and Cheerio to parse it.

We'll scrape product listings from an example ecommerce store, extract images, details, handle pagination, and more.

Setup

Install dependencies:

npm install axios cheerio

Require packages:

const axios = require('axios');
const cheerio = require('cheerio');  

// Config
const url = 'https://store.example.com';

Fetch Page HTML

Use Axios to grab the initial page HTML:

let page = 1;

async function fetchHtml() {

  try {

    const { data } = await axios.get(`${url}/products?page=${page}`);
    return data;

  } catch(error) {
  
    console.log(error);

  }
 
}

Pass the URL with pagination parameter along with async/await for readability.

Parse Product Listings

Use Cheerio to parse HTML and find products:

function getProducts(html) {

  const $ = cheerio.load(html);

  const items = $('.product'); // Selector 

  const products = [];

  items.each((index, element) => {

    const detailsUrl =  $(element).attr('href'); // Get URLs
    const imageUrl = $(element).find('img').attr('src'); // Get Images

    products.push({
       url: detailsUrl,  
       image: imageUrl     
    });

  });

  return products;

}

Loop through the listings, extract key attributes like image source and links, and create an array of objects.

Get Additional Data

Follow the product URLs, make additional requests, and extract more data:

async function getDetails(product) {

  const html = await fetchHtml(product.url);
  const $ = cheerio.load(html);

  return {
    ...product, 
    title: $('.title').text(),
    description: $('.description').text(),
  }

}

Build on existing product objects by requesting their pages and selectively pulling info using more Cheerio selectors.

Paginate

Automatically fetch the next page as products are processed:

function processPage(html) {

  // 1. Extract products 
  const products = getProducts(html); 
  
  // 2. Enhance with additional data
  const enhanced = []; 
  
  products.forEach(async (product) => {
    const details = await getDetails(product);  
    enhanced.push(details);
  });
  
  // 3. Pagination
  page++; 
  fetchHtml(); 

}

When all products are iterated, increment the page counter to continue scraping new listings automatically.

Full Script Flow

The overall scraping script flow then looks like:

async function scraper() {

  let html = await fetchHtml(); 

  while(html) {

    // Extract and enhance products 
    processPage(html);  

    // Get next page  
    html = await fetchHtml();

  }

}

scraper();

Keep processing while there is still HTML data returned from the requests.

This provides a reusable scraper recipe covering:

  • Initial page fetching
  • Listings extraction
  • Detail enhancement
  • Automated pagination

The full code for this example is available on GitHub at axios-cheerio-scraper.

Handling Anti-Scraping Defenses

Large sites invest heavily in bot mitigation protections to prevent large scraping efforts impacting their infrastructure. Some of these methods include:

Blocking Traffic: Flagging suspicious levels of requests from one source
ReCAPTCHAs: Forcing challenges to determine human vs bot
Obfuscating Text: Scrambling content that appears clear to browsers
Honey Traps: Trapping requests for unused pages as bot indicators

There’s a reason why websites take measures to protect themselves. A single scraper can bring down critical systems, expose data, lower performance for all visitors, and lead to loss of revenue.

Luckily there are programmatic ways of working responsibly within anti-scraping systems.

Proxy Rotation

One of the most common signals a scraper gets detected by is too many requests coming from one IP address. All traffic looks highly suspicious if thousands of product pages are hit every second solely from your location.

Implementing proxy rotation routes your requests through multiple IP addresses around the world.

Here is a simplified example using the Bright Data Proxy Manager API:

const { ProxyManager } = require('brightdata-proxy-manager');

const manager = new ProxyManager('YOUR_API_KEY');

async function fetch(url) {

  // Residential proxies from random subnets
  const proxy = await manager.fetchResidentialProxy(); 
  
  return axios.get(url, {
    proxy: {
      host: proxy.ipAddress,  
      port: proxy.port
    },
    headers: {
      'User-Agent': proxy.userAgent  
    } 
  })

}

This allows each request to originate from a different proxy with corresponding user agent.

The key benefit this provides is your scraper appears as normal valid human traffic vs one bot from a single location. Over 24 million proxies available help reduce the chance of blocks.

Additional capabilities like targeted City/Country proxies, custom subdomain session handling, sticky routing, and failover provide advanced targeting and rotation capabilities.

Throttling Requests

Another signal beyond location is request velocity. If thousands of requests fire every second from one proxy, that will still appear highly irregular.

Implementing throttling to limit concur request volume avoids spikes in traffic:

let requests = 0;
const MAX_REQUESTS = 5; 

function handleRequest() {
  
  requests++; 

  if(requests == MAX_REQUESTS) {
    // Pause sending new requests
    wait()     
  }
    
  axios.get(url)
    .then(() => { 
       requests--;
       handleRequest();    
    })

}

function wait() {

  setTimeout(() => {
     requests = 0;
     handleRequest();
  }, 10000); // 10s pause 

}

This example limits to handling 5 concurrent requests. When max is reached, it waits 10 seconds before sending more.

Tuning these thresholds allows high volume without spikes that appear bot-like.

Mimicking Browsers

Websites also look for headers, user agents and other signals to vary between browsers and scrapers:

Browser

User-Agent: Mozilla/5.0...
Accept: text/html

Scraper

User-Agent: Python/3.8  
Accept: */*

Use a tool like whatismybrowser.com to mimic a chosen browser:

axios.get(url, {
  headers: {
    'User-Agent': 'Mozilla/5.0...', 
    'Accept': 'text/html, */*; q=0.01',
    // etc
  }  
})

Setting other headers like Accept, Encoding, Language, Canvas, WebGL, and Timezone all help appear more user-like.

Rotating user agents from a library like faker.js improves variability:

const faker = require('faker');

function getUserAgent() {

  const browserUserAgents = [
     // List of realistic user agents  
  ];

  return faker.random.arrayElement(browserUserAgents);

}


axios.get(url, {
  headers: {
     'User-Agent': getUserAgent()
  }
})

Dealing with JavaScript Sites

Heavily client-side driven sites pose challenges for scrapers relying on server-rendered HTML.

Newer approaches like React and Vue only serve minimal JavaScript boilerplate then dynamically fetch and render actual content on the browser.

So while Axios will receive a response from the initial request, no usable data gets rendered in the source by default.

Solutions include:

Headless Browsers

Headless Chrome and Puppeteer execute JS in an actual browser, so full rendering occurs before scraping resulting content.

Downsides are reduced performance, needing to install browsers, and extra DevOps complexity.

JavaScript Rendering Services

Managed web scraping APIs like ScrapeHero and ProxyCrawl handle JavaScript execution internally and return final rendered HTML.

This alleviates needing to orchestrate browser infrastructure at the cost of relying on external services.

API Reverse Engineering

For SPAs, identify the actual AJAX calls retrieving content dynamically. Rather than scrape initial shells, directly request those API endpoints for structured JSON data.

Powerful tools like Replay record site activity to uncover these goldmines hidden in client-side traffic.

Conclusion

This guide covered a wide range of techniques for using Axios and Cheerio to build scalable web scrapers. Key topics included:

  • Scraping HTML pages
  • Extracting and transforming data
  • Automating pagination
  • Avoiding blocks with proxies
  • Scaling via throttling and browsers
  • Tackling single page applications

For straightforward sites, these libraries provide an easy way to automate data extraction using JavaScript without needing browser emulation.

When dealing with more sophisticated browser-side applications and defenses, additional considerations around mimicking user patterns are required. Proxies, throttling, headers, and rendering challenges all come into play.

Compared to building custom scraping infrastructure, leveraging fully managed scraping APIs through services like ScrapeHero and Octoparse can save huge engineering effort and cost. But understanding core techniques is still essential.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *