scrape

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 20, 2024 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index

Constants

View Source
const (
	TextExtractTag     = "text"     // get a text of children's text nodes
	DeepTextExtractTag = "deeptext" // get a text of descendants' text nodes
	AttrExtractTag     = "@"        // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

View Source
const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

Variables

This section is empty.

Functions

func ExtractAttribute added in v0.2.0

func ExtractAttribute(node *html.Node, attr string) (string, error)

ExtractAttribute returns the value of the given attribute. If the attribute is absent it returns an error.

func ExtractDeepText added in v0.2.0

func ExtractDeepText(node *html.Node) string

ExtractDeepText returns the text of all descendants' text nodes.

func ExtractText added in v0.2.0

func ExtractText(node *html.Node) string

ExtractDeepText returns the text of all children's text nodes.

func GetExtractorMap added in v0.2.0

func GetExtractorMap() map[*Match]Extractor

GetExtractorMap returns the default map to match extracting tags and extracting functions (or extractors).

func ValidateNotNil added in v0.4.0

func ValidateNotNil(o any, varName string) error

ValidateNotNil checks if the given o variable is nil or is a pointer that points to nil and returns [NillErr] in the true case. varName is a string the specifies the name of the variable that is nill.

Types

type AttributeNotFoundErr added in v0.4.0

type AttributeNotFoundErr struct {
	Attr string
}

func (AttributeNotFoundErr) Error added in v0.4.0

func (e AttributeNotFoundErr) Error() string

type ExtractTagErr added in v0.4.0

type ExtractTagErr struct {
	ExtractTag string
}

func (ExtractTagErr) Error added in v0.4.0

func (e ExtractTagErr) Error() string

type Extractor

type Extractor func(node *html.Node, extract string) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type KindErr added in v0.4.0

type KindErr struct {
	Var     any
	KindExp any
	KindAct any
}

func (KindErr) Error added in v0.4.0

func (e KindErr) Error() string

type Match added in v0.2.0

type Match func(extract string) (string, bool)

Match wraps boolean logic of matching values of extracting tags with extracting function (or extractors). Match returns already processed value of extracting tag. (an example "@href" -> "href").

func GetEqualMatch added in v0.2.0

func GetEqualMatch(expected string) Match

GetEqualMatch creates a Match function that compares the given value with the value of the extracting tag.

func GetPrefixMatch added in v0.2.0

func GetPrefixMatch(prefix string) Match

GetPrefixMatch creates a Match function that checks whether the extracting tag value has the given prefix and returns a boolean result with the extracting tag value. In true case, it cuts the matched prefix from the extracted value (an example "@href" -> "href")

type Mode added in v0.4.0

type Mode uint
const (
	Strict Mode = iota
	Tolerant
	Silent
)

type NilErr added in v0.4.0

type NilErr struct {
	Var string
}

func (NilErr) Error added in v0.4.0

func (e NilErr) Error() string

type NoNodesFoundErr added in v0.4.0

type NoNodesFoundErr struct{}

func (NoNodesFoundErr) Error added in v0.4.0

func (e NoNodesFoundErr) Error() string

type ScrapeErr added in v0.4.0

type ScrapeErr struct {
	Cause error
}

func (ScrapeErr) Error added in v0.4.0

func (e ScrapeErr) Error() string

type Scraper

type Scraper struct {

	//Mode can take three states [Strict], [Tolerant], and [Silent].
	// - [Strict] mode assumes that any error caused during scraping is fatal
	// and stops the following scraping.
	// - [Tolerant] mode assumes that scraping should not be prevented but
	// errors where possible and all errors are returned.
	// - [Silent] mode assumes that scraping should not be stopped by errors
	// where possible and these errors are not returned.
	Mode Mode

	// Extractors is a map that matches custom user extractors to extract tags.
	// Do not use reserved extractor tag names and patterns ([TextExtractTag],
	// [AttrExtractTag], and others), otherwise, the default implementation is executed.
	Extractors map[*Match]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain pointers, strings, slices, and structs but the end value must be a string.

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

type ScrapingErr added in v0.4.0

type ScrapingErr struct {
	Selector string
	Cause    error
}

func (ScrapingErr) Error added in v0.4.0

func (e ScrapingErr) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL