How to Web Scraping in R

Web scraping in R is one of the most popular methods data scientists use to extract data from websites. In this comprehensive tutorial, you'll learn how to do web scraping in R thanks to libraries like rvest and RSelenium.

We'll also cover how to leverage Bright Data proxy to avoid getting blocked while scraping.

Is R Good for Web Scraping?

Yes, R is an great programming language for web scraping! R is designed for data analysis and has many data-oriented libraries to support your web scraping goals.

Prerequisites

Here are the main tools we'll use:

  • R 4+: any recent version of R will work. We used R 4.2.2.
  • RStudio: A free and open source IDE for R.
  • Bright Data proxy: Rotating proxy service to avoid blocks.

If you don't have RStudio installed, download it from rstudio.com. Sign up for a free Bright Data trial here.

Set Up an R Project in RStudio

After setting up the environment, let's initialize an R project in RStudio:

  1. Launch RStudio
  2. Click File > New Project > New Directory > R Project
  3. Give project a name and click Create Project

This will create a new folder for your project.

Getting Started with rvest

rvest is the most popular R library for web scraping. It allows you to download an HTML page, parse it, select elements, and extract data.

To install:

install.packages("rvest")

Then load it:

library(rvest)

Retrieve the HTML Page

Let's scrape books.toscrape.com:

# Retrieve the page
page <- read_html("https://books.toscrape.com/")   

# Select products 
products <- page %>% 
  html_elements(".product_pod")

read_html() downloads and parses the HTML. We use the CSS selector .product_pod to select products.

Extract Data

To extract data, we chain together html_element() to select an element, and functions like html_text() and html_attr() to extract text or attribute values:

# Get title
title <- products %>%
  html_element("h3") %>%
  html_text()

# Get price 
price <- products %>% 
  html_element(".price_color") %>%
  html_text()

Let's put this into a function to scrape all products on a page:

scrape_products <- function(page) {

  products <- page %>% 
    html_elements(".product_pod")
    
  title <- products %>%
    html_element("h3") %>%
    html_text()
  
  price <- products %>% 
    html_element(".price_color") %>%
    html_text()
  
  return(data.frame(title, price))
}

Test it out:

> products <- scrape_products(page)
> head(products)
                 title price
1 A Light in the ... £51.77
2 Tipping the Velvet £53.74 
3 Soumission       £50.10

It works! 🎉

Now we can scrape all products by scraping each page.

Crawl Paginated Pages

To scrape all pages:

  1. Extract links to other pages
  2. Add to a queue
  3. Loop through queue to scrape each page

Here is the full script:

# Set Bright Data authentication 
httr::set_config(httr::authenticate("username", "password"))

# List of pages to scrape  
pages_to_scrape <- "https://books.toscrape.com/"
pages_crawled <- character()

all_products <- data.frame()

while(length(pages_to_scrape) > 0) {

  # Pop next page 
  current_page <- pages_to_scrape[1]
  
  # Remove page 
  pages_to_scrape <- pages_to_scrape[-1]
  
  # Read page
  page <- read_html(current_page, httr::use_proxy()) 
  
  # Scrape products
  products <- scrape_products(page)
  
  # Add new page links   
  links <- page %>% 
    html_elements(".next a") %>%
    html_attr("href")
  
  pages_to_scrape <- c(pages_to_scrape, links) 
  
  # Track crawled
  pages_crawled <- c(pages_crawled, current_page)
  
  # Combine all data
  all_products <- rbind(all_products, products)
  
}

# Remove duplicates
all_products <- all_products[!duplicated(all_products),]

# Write to CSV
write.csv(all_products, "products.csv")

This script:

  • Uses Bright Data authentication
  • Keeps list of pages to scrape
  • Pops and scrapes each page
  • Finds next page links and adds them to the queue
  • Removes duplicate pages
  • Combines all data into one dataframe
  • Writes data to a CSV file

By leveraging Bright Data with httr::use_proxy(), it avoids blocks while crawling the entire site.

Scraping JavaScript Sites

For sites that load content dynamically with JavaScript, we need to use a headless browser.

Let's see how to scrape with RSelenium.

First, install RSelenium:

install.packages("RSelenium")

Now load Selenium and launch a browser:

# Load selenium
library(RSelenium)

# Launch headless Chrome
driver <- rsDriver(browser = "chrome",
                   chromever = "108.0.5359.22", 
                   extraCapabilities = list(
                     "goog:chromeOptions" = list(
                       args = c("--headless")
                     )
                   )
                  )
                  
web_driver <- driver[["client"]]

Next, navigate to the target page:

web_driver$navigate("https://books.toscrape.com/")

We can now scrape dynamically loaded content the same way by using the RSelenium element finder functions like findElement().

Let's put it together into a script:

# Start driver 
driver <- RSelenium::rsDriver(browser = "chrome", headless = TRUE)
web_driver <- driver[["client"]]

# Bright Data proxy authentication
web_driver$extraCapabilities$setProxy("HOST:PORT")
web_driver$extraCapabilities$setAuthenticationCredentials("username", "password")

# Navigate 
web_driver$navigate("https://books.toscrape.com/")

# Get products
products <- web_driver$findElements(using = "css", ".product_pod") 

# Iterate and extract data 
for (product in products) {
  
  title <- product$findChildElement(using = "css", "h3") %>% 
    getElementText()
    
  price <- product$findChildElement(using = "css", ".price_color") %>%
    getElementText() 
    
  # Save data
  ...
  
}

This uses the Chrome driver to dynamically render the page and selects elements to scrape data.

With Bright Data proxy authentication, it rotates IPs with each request to avoid blocks.

Conclusion

In this comprehensive tutorial, we learned:

  • Basic web scraping in R with rvest
  • Crawling techniques to scrape entire sites
  • Using RSelenium for dynamic JavaScript sites
  • Avoiding blocks with Bright Data proxy

R is a very versatile language for web scraping. With a few key libraries like rvest and RSelenium, you can build scrapers for almost any website.

The main challenge is avoiding blocks. This is where a proxy service like Bright Data helps manage IP rotation seamlessly.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *