How to Use Goquery for Web Scraping in Golang

Web scraping is growing exponentially as companies rely on data extraction from the web to gain a competitive edge. Some stats:

  • 72% of organizations use web scraping as a source of data collection according to Parsehub.
  • The web scraping market is projected to grow at over 20% CAGR and hit $13 billion by 2027 according to Mordor Intelligence.

With the right tools, web scraping can be easy to implement. This brings us to goquery – one of the most popular libraries for scraping HTML content using Golang.

  • What is goquery and how it works
  • Step-by-step guide to install and set up goquery
  • CSS selector basics to extract any data
  • Practical real-world examples of using goquery
  • Tips to avoid getting blocked while scraping
  • How to integrate with other Go scraping libraries
  • When to use goquery vs other similar tools

Let's get started!

What is Goquery?

Goquery is an open source Go library that allows scraping data from HTML documents using a query syntax similar to jQuery.

It provides a very simple API to:

  • Load HTML pages
  • Find matching elements using CSS selectors
  • Extract data like text, attributes, HTML
  • Navigate and manipulate the DOM

Goquery builds on top of Go's built-in net/html parser and the robust cascadia CSS selector library under the hood.

This enables writing query strings to select matching elements from HTML pages just like jQuery. Once the desired content is matched, goquery provides various functions to extract and manipulate data from the selected elements.

Overall, goquery is a lightweight yet powerful library for scraping HTML content without needing browser emulation or JavaScript execution (which are complex to set up).

Let's now see how to install and start using it for your web scraping projects.

Getting Started with Goquery

To install and set up a Go scraper project using goquery, follow these steps:

Set up a Go Project

Create a dedicated folder for your project:

mkdir goquery-scraper
cd goquery-scraper

Next initialize a Go module which allows you to track dependencies:

go mod init scraper

This will create a go.mod file tracking the module details.

Now install the goquery library:

go get github.com/PuerkitoBio/goquery

This will download goquery and add it to the go.mod file.

Let's create a main Go file called scraper.go with a basic skeleton:

This will download goquery and add it to the go.mod file.

Let's create a main Go file called scraper.go with a basic skeleton:

This sets up the project structure and imports. We are ready to start using goquery now!

Make a GET Request

To download the HTML document of any webpage, we can use Go's built-in net/http package.

First import the http package:

import (
  "net/http"
)

Then use the http.Get() method to make a request and store the result:

res, err := http.Get("https://example.com")

if err != nil {
  log.Fatal(err)
}

This sends a GET request to the URL and returns the result in a http.Response struct res.

We should also handle any errors in case the request fails:

if res.StatusCode != 200 {
  log.Fatalf("Status code error: %d %s", res.StatusCode, res.Status)
}

The HTML document of the webpage is stored in the res.Body buffer.

Parse HTML using Goquery

To load the HTML into a goquery Document, we can use the goquery.NewDocumentFromReader() method:

doc, err := goquery.NewDocumentFromReader(res.Body)

This parses the HTML content from res.Body into a goquery Document which we can now traverse and extract data from.

Let's put it together in our scraper.go file:

package main

import (
  "fmt"
  "log"
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

func main() {

  // HTTP GET request 
  res, err := http.Get("https://example.com")

  if err != nil {
    log.Fatal(err) 
  }

  if res.StatusCode != 200 {
    log.Fatalf("Status code error: %d %s", res.StatusCode, res.Status)
  }
  
  // Parse HTML
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  // Web scraping logic...

}

This sends a GET request, checks for errors, and loads the HTML into a goquery document.

Now we are ready to implement the web scraping logic using goquery selectors and data extraction functions!

Goquery Selectors for Scraping HTML

Goquery allows selecting HTML elements using jQuery/CSS style selectors.

For example:

/* ID selector */
#container 

/* Class selector */
.item

/* Attribute selector */
a[href="https://www.example.com"] 

/* Contains text */  
div:contains(Text)

Some commonly used selection methods in goquery are:

Find – Finds all matching elements

doc.Find(".product")

First – Gets first matched element

Last – Gets last element

Eq – Select element at index

Filter – Refine selection further

Has – Filter elements containing descendant

Not – Exclude elements

Is – Check if element matches selector

Contains – Get elements containing text

Using these functions you can select any elements within an HTML document parsed by goquery.

Extracting Data from HTML using Goquery

Once you have selected required elements, goquery provides various functions to extract data from them:

Text – Extract text content from element:

name := doc.Find("h1").Text()

Html – Get inner HTML:

html := doc.Find(".content").Html()

Attr – Get attribute value:

href := doc.Find("a").Attr("href")

Each – Iterate through selection:

doc.Find("p").Each(func(i int, s *goquery.Selection) {
  // Extract data from <p>
})

Map – Map to different type:

links := doc.Find("a").Map(func(i int, s *goquery.Selection) string {
  return s.Attr("href")
})

Slice – Slice selection:

first3 := doc.Find("div").Slice(0, 3)

This covers the basic data extraction functions provided by goquery. Now let's look at some practical examples of using it for web scraping.

Real-World Web Scraping Examples Using Goquery

Let's go through some common scenarios of extracting data from HTML pages using goquery:

Scrape Job Listings From a Job Board

For example, to scrape job listing titles and companies from a site like Monster.com:

package main

import (
  "github.com/PuerkitoBio/goquery"
)

type JobListing struct {
  Title string
  Company string
}

func main() {

  doc, _ := goquery.NewDocument("https://www.monster.com/jobs/search/")

  var jobs []JobListing

  doc.Find(".job-tittle").Each(func(i int, s *goquery.Selection) {
    
    job := JobListing{}
    job.Title = s.Find("h2").Text()
    job.Company = s.Find(".company").Text()
    
    jobs = append(jobs, job)

  })

}

This example demonstrates:

  • Defining a struct JobListing to store scraped data
  • Using .job-tittle CSS selector to find all listings
  • Extracting title and company into struct
  • Appending results into jobs slice

Similarly, you can build scrapers for any job site to extract vacancies, skills, salaries etc.

Scrape Product Prices From Ecommerce Sites

Here is how to scrape product names and prices from an ecommerce site like Walmart into a CSV file:

package main

import (
  "encoding/csv"
  "os"
  
  "github.com/PuerkitoBio/goquery"
)

type Product struct {
  Name string
  Price string 
}

func main() {

  file, err := os.Create("products.csv")

  writer := csv.NewWriter(file)
  defer writer.Flush()

  headers := []string{"Name", "Price"}  
  writer.Write(headers)

  doc, _ := goquery.NewDocument(“https://www.walmart.com/browse/electronics”)

  doc.Find(".prod-title").Each(func(i int, s *goquery.Selection) {

    product := Product{} 

    product.Name = s.Text()
    product.Price = s.Find(".price").Text()

    writer.Write(product.Name, product.Price)

  })

}

This shows:

  • Opening and initializing a CSV file
  • Defining headers
  • Selecting product elements
  • Extracting name, price fields
  • Writing to CSV row

Webpages these days are heavily JavaScript rendered so goquery might not always work. In such cases, a headless browser API like BrightData is recommended.

Extract Reviews From Sites Like Trustpilot

To scrape customer reviews from a site like Trustpilot:

package main

import (
  "encoding/csv"
  "os"

  "github.com/PuerkitoBio/goquery"  
)

type Review struct {
  Name string
  Content string
  Rating int
}

func main() {

  file, _ := os.Create("reviews.csv")
  
  defer file.Close()

  writer := csv.NewWriter(file)
  defer writer.Flush()

  headers := []string{"Name", "Content", "Rating"}

  writer.Write(headers)

  doc, _ := goquery.NewDocument("https://www.trustpilot.com/review/acme.com")

  doc.Find(".review-content").Each(func(i int, s *goquery.Selection) {
    
    review := Review{}

    review.Name = s.Find(".consumer-information__name").Text()
    review.Content = s.Find(".review-content__text").Text()
    rating, _ := s.Find(".star-rating").Attr("data-rating")
    review.Rating, _ = strconv.Atoi(rating)

    writer.Write([]string{review.Name,review.Content,strconv.Itoa(review.Rating)})

  })

}

Here we are:

  • Defining struct Review to store fields
  • Selecting all review containers
  • Extracting name, content, rating
  • Writing to CSV file

This can be extended to scrape reviews from any site like Amazon, BestBuy etc.

Build a Simple Web Crawler with Goquery

Goquery can also be used to build simple web crawlers by recursively following links on pages:

package main

import (
  "fmt"

  "github.com/PuerkitoBio/goquery"
)

func main() {

  startingUrl := "https://golang.org"
  
  visited := make(map[string]bool)

  var crawl func(url string)

  crawl = func(url string) {

    if visited[url] {
      return 
    }

    visited[url] = true

    fmt.Println("Crawling: ", url)

    doc, _ := goquery.NewDocument(url)

    doc.Find("a").Each(func(i int, s *goquery.Selection) {
      link, _ := s.Attr("href")
      crawl(link) 
    })

  }

  crawl(startingUrl)

}

Here:

  • visited map tracks crawled URLs
  • crawl() function recursively follows links
  • New links extracted using goquery.Find("a")

This basic crawl can be extended into a large scale crawler using goroutines and channels.

Scrape Google SERPs

Goquery can also be used to parse and extract data from Google search result pages:

package main

import (
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

type Result struct {
  Title string 
  Link string
}

func main() {

  googleUrl := "https://www.google.com/search?q=goquery"

  res, _ := http.Get(googleUrl)

  doc, _ := goquery.NewDocumentFromReader(res.Body)

  var results []Result

  doc.Find("div.g").Each(func(i int, s *goquery.Selection) {

    result := Result{}
    result.Title = s.Find("h3").Text()
    result.Link, _ = s.Find("a").Attr("href")
    results = append(results, result)

  })

}

This extracts title and link from each search result into a Result struct.

With some additional logic, you can build Google scrapers, SERP rank trackers and more.

These were just a few examples of how goquery can be used for practical web scraping tasks. Let's now look at how to avoid getting blocked while scraping.

Avoiding Blocks and Captchas While Scraping With Goquery

While scraping data from websites, you need to deal with anti-scraping mechanisms like:

  • IP bans
  • CAPTCHAs
  • Blocking scrapers and bots

Goquery itself doesn't provide any methods to bypass these blocks. To avoid getting blocked while scraping, here are some tips:

  • Use proxies – Rotate different IP addresses with each request
  • Custom headers – Mimic browsers by setting user-agent, accept headers etc.
  • Limit request rate – Add delays between requests to respect site load
  • Handle CAPTCHAs – Optical character recognition to solve CAPTCHAs
  • Headless browser – Use browser automation tools like Selenium to render JavaScript

But even after applying these precautions, advanced anti-bot systems can identify and block scrapers.

The most reliable way to scrape any site while avoiding tough anti-scraping measures is to use a web scraping API like BrightData.

BrightData handles all the challenges of web scraping for you automatically:

  • Rotating Proxies – 10000+ premium residential IPs from diverse geo locations
  • Browser Simulation – Real Chrome browsers and automatic header rotation
  • Anti-CAPTCHA – Machine learning based optical character recognition to solve CAPTCHAs
  • Inbuilt Anti-Bot Technology – Avoid blocks from protections like Cloudflare and Imperva
  • JavaScript Rendering – Built-in headless browser to render dynamic sites

Here is how you can use BrightData with goquery:

import (
  "net/http"
  "github.com/PuerkitoBio/goquery"  
)

// Initialize BrightData client
client := brightdata.NewClient("YOUR_API_KEY”)

req, _ := http.NewRequest("GET", "https://targetpage.com", nil)

// Make request via BrightData  
res, _ := client.Do(req) 

// Pass HTML to goquery as usual
doc, _ := goquery.NewDocumentFromReader(res.Body)

// Extract data using

Integrating Goquery with Other Go Libraries

Goquery provides a simple API for parsing HTML and extracting data. To build more advanced scrapers, it can be combined with other useful Go libraries:

Colly – Web Scraping Framework

Colly provides a higher level scraping framework for tasks like:

  • Managing multiple concurrent requests
  • Controlling rate limits
  • Automatic handling of cookies, sessions etc
  • Intercepting and manipulating requests
  • Managing queues for large scale scraping

It can be used along with goquery for parsing responses:

import (
  "github.com/gocolly/colly"

  "github.com/PuerkitoBio/goquery"
)

func main() {

  c := colly.NewCollector()

  c.OnHTML("html", func(h *colly.HTMLElement){
    
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(h.Text))
    
    // Extract data using goquery
    
  })

  c.Visit("https://example.com")

}

This way you can build complex scrapers leveraging capabilities of both libraries.

Gokogiri – XML/HTML Parser

Gokogiri provides an XML/HTML parser built on libxml2. It can be used as an alternative to goquery for parsing.

Key differences are:

  • goquery is a port of jQuery while gokogiri mirrors the Ruby library Nokogiri
  • gokogiri provides XPath support in addition to CSS selectors

An example parsing HTML using gokogiri:

import (
  "github.com/moovweb/gokogiri"
)  

doc, _ := gokogiri.ParseHtml(html)

// CSS selector
doc.Css("div.content")

// XPath expression  
doc.Search("//div[@class='content']")

So if you need XPath selectors, gokogiri is an option along with goquery.

Go-rod/rod – Headless Browser

go-rod is a headless browser automation library to control Chrome/Firefox.

It can render JavaScript heavy sites where simple HTTP requests might not work.

goquery can then extract data from the rendered HTML:

import (
  "github.com/go-rod/rod"
  "github.com/PuerkitoBio/goquery"
)

func main() {

  browser := rod.New().MustConnect()

  page := browser.MustPage("https://example.com")

  html := page.MustHTML()

  doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))

  // Extract data using goquery from rendered HTML

}

So goquery and go-rod together provide complete scraping capabilities.

Goquery vs Similar Golang Libraries

Some other notable HTML/XML parsing libraries in Go and how they compare with goquery:

net/html – Golang's built-in HTML parser. Provides DOM traversal APIs. Goquery is built on top of it and adds jQuery-like functionality.

gokogiri – XML/HTML parser modeled after the Nokogiri Ruby library. Supports XPath along with CSS selector APIs.

Colly – High level web scraping framework. Can be used with goquery for selection and extraction.

go-rod – Headless browser automation for JS rendering. Useful with goquery for dynamic sites.

gjson – JSON parser that can extract values using dot notation and regex. No DOM selection capabilities.

So in summary:

  • Use goquery when you need a simple and familiar CSS selector based API for scraping HTML documents.
  • Choose Colly for a higher level scraping framework to handle requests, throttling, queues etc.
  • Pick gokogiri if you specifically need XPath selector support.
  • Use go-rod when you need to render JavaScript for scraping interactive sites.
  • Consider BrightData as a complete web data extraction API to avoid any anti-scraping blocks.

Goquery fills a useful niche as a Golang equivalent of popular libraries like BeautifulSoup and jQuery.

Conclusion

To summarize, here are the key things we learned about using goquery for web scraping:

  • Goquery provides simple jQuery-like syntax for parsing HTML and extracting data in Golang
  • It uses CSS selectors to match elements and traversing/manipulation methods similar to jQuery
  • Goquery can be used to easily scrape data from static HTML pages as well as build crawlers
  • For robust web-scale scraping, it can be combined with tools like Colly and go-rod
  • To avoid anti-scraping blocks, BrightData is the most reliable solution
  • Goquery is great for quickly writing scrapers but reaches its limits for complex sites

Overall, goquery is an indispensable tool to have in your Golang web scraping toolbox.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *