Documentation
¶
Overview ¶
Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.
Here is a simple example, scraping a Product struct from a html document.
htmlData := `<div class="product"> <img src="https://via.placeholder.com/200" alt="Product 1"> <h2>Product 1</h2> <p>Great product for your needs.</p> <p class="price">$29.99</p> </div>` r := bytes.NewBufferString(htmlData) doc, _ := goquery.NewDocumentFromReader(r) scraper := scrape.Scraper{} // scraping type Product struct { Name string `select:"h2" extract:"text"` Description string `select:"p" extract:"text"` Price string `select:".price" extract:"text"` Image string `select:"img" extract:"@src"` } var product Product err := scraper.Scrape(doc, &product, ".product", "") // get output fmt.Println("Got Error:", err) fmt.Println("Got Output:") fmt.Println(product)
Index ¶
- Constants
- func ExtractAttribute(node *html.Node, attr string) (string, error)
- func ExtractDeepText(node *html.Node) string
- func ExtractText(node *html.Node) string
- func GetExtractorMap() map[*Match]Extractor
- func ValidateNotNil(o any, varName string) error
- type AttributeNotFoundErr
- type ExtractTagErr
- type Extractor
- type KindErr
- type Match
- type Mode
- type NilErr
- type NoNodesFoundErr
- type ScrapeErr
- type Scraper
- type ScrapingErr
Constants ¶
const ( TextExtractTag = "text" // get a text of children's text nodes DeepTextExtractTag = "deeptext" // get a text of descendants' text nodes AttrExtractTag = "@" // get a value of an attribute ("@href", "@src") )
Extractor tags to specify extract operations.
const ( SelectorTag = "select" // jQuery-like selector to find the node ExtractorTag = "extract" // extract operation to get useful data from the node )
The tags that let you to specify where the valuable data is and how to get it from the html.Node.
Variables ¶
This section is empty.
Functions ¶
func ExtractAttribute ¶ added in v0.2.0
ExtractAttribute returns the value of the given attribute. If the attribute is absent it returns an error.
func ExtractDeepText ¶ added in v0.2.0
ExtractDeepText returns the text of all descendants' text nodes.
func ExtractText ¶ added in v0.2.0
ExtractDeepText returns the text of all children's text nodes.
func GetExtractorMap ¶ added in v0.2.0
GetExtractorMap returns the default map to match extracting tags and extracting functions (or extractors).
func ValidateNotNil ¶ added in v0.4.0
ValidateNotNil checks if the given o variable is nil or is a pointer that points to nil and returns [NillErr] in the true case. varName is a string the specifies the name of the variable that is nill.
Types ¶
type AttributeNotFoundErr ¶ added in v0.4.0
type AttributeNotFoundErr struct {
Attr string
}
func (AttributeNotFoundErr) Error ¶ added in v0.4.0
func (e AttributeNotFoundErr) Error() string
type ExtractTagErr ¶ added in v0.4.0
type ExtractTagErr struct {
ExtractTag string
}
func (ExtractTagErr) Error ¶ added in v0.4.0
func (e ExtractTagErr) Error() string
type Extractor ¶
Extractor is a function that processes the given node and returns the valuable data in string format.
type Match ¶ added in v0.2.0
Match wraps boolean logic of matching values of extracting tags with extracting function (or extractors). Match returns already processed value of extracting tag. (an example "@href" -> "href").
func GetEqualMatch ¶ added in v0.2.0
GetEqualMatch creates a Match function that compares the given value with the value of the extracting tag.
func GetPrefixMatch ¶ added in v0.2.0
GetPrefixMatch creates a Match function that checks whether the extracting tag value has the given prefix and returns a boolean result with the extracting tag value. In true case, it cuts the matched prefix from the extracted value (an example "@href" -> "href")
type NoNodesFoundErr ¶ added in v0.4.0
type NoNodesFoundErr struct{}
func (NoNodesFoundErr) Error ¶ added in v0.4.0
func (e NoNodesFoundErr) Error() string
type Scraper ¶
type Scraper struct { //Mode can take three states [Strict], [Tolerant], and [Silent]. // - [Strict] mode assumes that any error caused during scraping is fatal // and stops the following scraping. // - [Tolerant] mode assumes that scraping should not be prevented but // errors where possible and all errors are returned. // - [Silent] mode assumes that scraping should not be stopped by errors // where possible and these errors are not returned. Mode Mode // Extractors is a map that matches custom user extractors to extract tags. // Do not use reserved extractor tag names and patterns ([TextExtractTag], // [AttrExtractTag], and others), otherwise, the default implementation is executed. Extractors map[*Match]Extractor }
Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).
func (Scraper) Scrape ¶
Scrape scrapes the given doc and writes the useful information into o.
o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain pointers, strings, slices, and structs but the end value must be a string.
selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.
extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.
type ScrapingErr ¶ added in v0.4.0
func (ScrapingErr) Error ¶ added in v0.4.0
func (e ScrapingErr) Error() string