webgrab

package module

v0.1.6 Latest Latest Go to latest Published: Aug 24, 2025 License: BSD-3-Clause Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/aljoni/webgrab

Links

Open Source Insights

README ¶

🌍🤏 WebGrab

GitHub GitHub release (with filter)

WebGrab is a simple Go library which allows for easy scraping of web pages. It is built on top of the GoQuery library.

Installation

go get github.com/aljoni/webgrab

Usage

package main

import (
    "fmt"

    "github.com/aljoni/webgrab"
)

type Page struct {
    Title    string `grab:"title"`
    Body     string `grab:"body"`
    Keywords string `grab:"meta[name=keywords]" attr:"content"`
}

func main() {
    page := Page{}
    
    grabber := webgrab.New()
    grabber.Timeout = 30
    grabber.MaxRedirects = 10
    grabber.Grab("http://example.com", &page)

    fmt.Println(page.Title)
    fmt.Println(page.Body)
    fmt.Println(page.Keywords)
}

Tag Syntax

The defined tags are:

grab:"selector" - The selector to use to grab the value.
attr:"attribute" - The attribute of the selected element to grab.
extract:"regexp" - A regular expression to extract a value from a string.
filter:"regexp" - A regular expression to filter the value of a field.
context:"selector" - Restricts the context for a nested struct or slice of structs.

The selector is a GoQuery selector. The attribute is an optional attribute of the selected element to grab. If no attribute is specified, the text of the selected element will be grabbed.

Arrays

If the field is an array, all matching elements will be grabbed. For example, to grab all links from a page:

type Page struct {
    Links []string `grab:"a[href]" attr:"href"`
}

Nested Structs and Context Windows

You can use nested structs to grab values from a specific section of the page.
With the new context tag, you can restrict the scraping context for a struct or a slice of structs.

Single Nested Struct with Context

type Profile struct {
    Name  string `grab:".name"`
    Email string `grab:".email"`
}

type Page struct {
    Profile Profile `context:".profile-section"`
}

This will only search for .name and .email inside the first .profile-section element.

Slice of Structs with Context

type Item struct {
    Title string `grab:".title"`
    Link  string `grab:"a" attr:"href"`
}

type Page struct {
    Items []Item `context:".item"`
}

This will find all elements matching .item, and for each, scrape the .title and the first <a>'s href inside that .item.

Extract

The extract tag can be used to extract a value from a string using a regular expression. For example, to extract the title from a Wikipedia page:

type Page struct {
    Title string `grab:"title" extract:"(.+) - Wikipedia"`
}

Filter

The filter tag can be used to filter the value of a field. For example, to get all links that end with .html:

type Page struct {
    Links []string `grab:"a[href]" attr:"href" filter:".*\\.html$"`
}

Documentation ¶

Index ¶

type Grabber
- func New() *Grabber
- func (g Grabber) Grab(url string, data any) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Grabber ¶

type Grabber struct {
	Timeout      int
	MaxRedirects int
	UserAgent    string
}

Grabber configures and performs web scraping.

func New ¶

func New() *Grabber

New returns a new Grabber with default values.

func (Grabber) Grab ¶

func (g Grabber) Grab(url string, data any) error

Grab fetches and populates data from a URL into the provided struct pointer.

Source Files ¶

View all Source files

grab.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL