How to Web Scraping in Golang

Web scraping in Golang is a popular approach to automatically retrieve data from the web. This step-by-step tutorial will teach you how to easily scrape data in Go using popular libraries like Colly and chromedp.

Prerequisites

Here are the prerequisites you need to follow this tutorial:

  • Go 1.19+: Any Go version greater than or equal to 1.19 will work. You'll see Go 1.19 used here as it's the latest at the time of writing.
  • A Go IDE: Visual Studio Code with the Go extension is recommended.

Before you start scraping, make sure you have Go and a Go IDE installed and configured on your machine.

Set Up a Go Project

After installing Go, initialize your Golang web scraper project:

mkdir web-scraper-go
cd web-scraper-go

Then initialize a Go module called web-scraper:

go mod init web-scraper

This will create a go.mod file with:

module web-scraper
go 1.19

You're now ready to write your web scraping script. Create a scraper.go file and initialize it:

package main

import (
  "fmt"
)

func main() {

  fmt.Println("Let's scrape!")
  
}

This sets up the entry point main() function where your scraping logic will go.

Run it with go run scraper.go to verify it works.

Scrape a Website with Colly

To learn scraping, we'll use ScrapeMe as our target. It's an example Pokémon shop:

Our mission is to extract all the product data from this site.

Getting Started with Colly

Colly is a popular Golang scraping library. It makes it easy to scrape web pages with a clean API.

Install Colly:

go get github.com/gocolly/colly

Then import it:

import (
  "fmt"

  "github.com/gocolly/colly"
)

The main Colly entity is a Collector which performs HTTP requests. Initialize one:

c := colly.NewCollector()

Use it to visit and download a page:

c.Visit("https://scrapeme.live/shop/")

Attach callback functions to handle events:

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Visiting", r.URL) 
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Visited", r.URL)  
})

These callbacks execute on request and response events.

Now let's scrape some data!

Visit the Target Page

First, visit the target page:

c.Visit("https://scrapeme.live/shop/")

This will fire the OnRequest callback we defined earlier.

Find HTML Elements to Scrape

We want to extract all product data. Right click any product and inspect it:

We see key data is inside:

  • a – Product URL
  • img – Image
  • h2 – Name
  • .price – Price

All these elements live under a li.product container.

Use this CSS selector in Colly to find all products:

c.OnHTML("li.product", func(e *colly.HTMLElement) {

})

This callback will now execute for every li.product found.

Extract Data

First define a struct to store scraped data:

type Product struct {
  url, img, name, price string 
}

Then initialize a slice to collect data:

var products []Product

Now extract data inside the callback:

c.OnHTML("li.product", func(e *colly.HTMLElement) {

  p := Product{}
  
  p.url = e.ChildAttr("a", "href")
  p.img = e.ChildAttr("img", "src") 
  p.name = e.ChildText("h2")
  p.price = e.ChildText(".price")

  products = append(products, p)

})

We use ChildAttr() and ChildText() to extract attribute values and text from child elements.

The extracted Product struct is appended to the results slice with append().

And we're done scraping!

Export Data to CSV

Let's export the scraped data to CSV:

file, _ := os.Create("products.csv")
defer file.Close()

w := csv.NewWriter(file)

headers := []string{"url", "img", "name", "price"} 

w.Write(headers)

for _, p := range products {
  
  record := []string{
    p.url, 
    p.img,
    p.name,
    p.price,
  }

  w.Write(record)
}

w.Flush()

We create a file, initialize a CSV writer, add the headers, convert products to CSV records, and write to file.

Run your scraper and you'll get a products.csv file with all the extracted data!

This covers the basics of web scraping in Go with Colly. Next let's look at some more advanced techniques.

Advanced Web Scraping in Golang

Web Crawling

The ScrapeMe product listings are paginated across multiple URLs:

https://scrapeme.live/shop/
https://scrapeme.live/shop/page/2/
https://scrapeme.live/shop/page/3/

We see they use the a.page-numbers selector.

Here is some crawling logic:

var pages []string

firstPage := "https://scrapeme.live/shop/"
pages = append(pages, firstPage) 

limit := 10

c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {

  page := e.Attr("href")
  
  if !contains(pages, page) {
    pages = append(pages, page) 
  }
  
})

c.OnResponse(func(r *colly.Response) {
  
  if len(pages) > 0 && len(pages) < limit  {
    
    nextPage := pages[0]
    pages = pages[1:]
    
    c.Visit(nextPage)
  }
  
})

c.Visit(firstPage)

We start from the first page and use the pagination link callback to find new pages to visit. We add pages to a queue.

The response callback then visits the next page in the queue if more pages remain under the limit.

This logic will crawl across all pages scraping data from each one!

Avoid Getting Blocked

Websites try to block scrapers by checking the User-Agent header:

c.UserAgent = "Mozilla/5.0 ..."

This sets a valid browser User-Agent.

But there are many other anti-scraping systems you'll have to deal with. The best way is to use a web scraping API like BrightData which handles all anti-bot systems for you with just an API request.

Parallel Crawling

To speed up scraping we can visit pages in parallel:

c := colly.NewCollector(
  colly.Async(true),   
)

c.Limit(&colly.LimitRule{
  Parallelism: 2,
})

c.OnHTML(...) 

for _, page := range pages {
  c.Visit(page) 
} 

c.Wait()

We enable async mode to allow parallel requests up to a limit of 2.

We fire visits without waiting for them to complete. Finally c.Wait() allows Colly to complete.

This concurrent crawling extracts data much faster!

Again for best results it's better to use a web scraping API which handles these complexities automatically.

Scraping JavaScript Pages

So far we've scraped simple static HTML pages. Many modern sites rely on JavaScript to render content.

To scrape these pages we need a headless browser like chromedp which runs Chromium behind the scenes.

Let's scrape ScrapeMe with chromedp:

import (
  "context"
  
  "github.com/chromedp/chromedp"
)

func main() {

  ctx, cancel := chromedp.NewContext(context.Background())
  defer cancel() 

  var nodes []*cdp.Node
  err := chromedp.Run(ctx,
    chromedp.Navigate("https://scrapeme.live/shop"),
    chromedp.WaitVisible(".product", chromedp.ByQueryAll),
    chromedp.Nodes(".product", &nodes, chromedp.ByQueryAll),
  )
  
  if err != nil {
    log.Fatal(err)
  }  
  
  for _, node := range nodes {

    url := scrapeAttributeValue(ctx, node, "a", "href")  
    name := scrapeText(ctx, node, "h2")
    // ...
  }

} 

func scrapeText(ctx context.Context, node *cdp.Node, selector string) {
  
  var text string 
  err := chromedp.Run(ctx,
    chromedp.Text(selector, &text, chromedp.NodeVisible, chromedp.FromNode(node)),    
  )

  if err != nil {
    log.Fatal(err)
  }
  
  return text
}

Key differences vs Colly:

  • We navigate to the target page
  • Wait for elements to be visible
  • Extract nodes with JavaScript enabled
  • Query data from each node specifically

chromedp allows us to scrape dynamic JavaScript-rendered websites just like a real browser!

Conclusion

And there you have it! In this complete Golang web scraping tutorial you learned:

  • Scraping basics with Colly
  • Crawling paginated listings
  • Scraping JavaScript pages with a headless browser
  • Exporting scraped data to CSV

Web scraping can get complex with anti-bot systems and JavaScript rendering. The easiest way to scrape any site is to use a web scraping API service like BrightData.

BrightData handles all proxies, browsers, CAPTCHAs and other anti-scraping systems automatically so you can scrape any site easily.

It also offers a blazing fast scraping API and handles JavaScript rendering under the hood. I'd highly recommend BrightData over building and maintaining your own scrapers.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *