scrape

package

v0.4.0 Latest Latest Go to latest Published: Sep 20, 2024 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/branow/htmlscraper

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index ¶

Constants
func ExtractAttribute(node *html.Node, attr string) (string, error)
func ExtractDeepText(node *html.Node) string
func ExtractText(node *html.Node) string
func GetExtractorMap() map[*Match]Extractor
func ValidateNotNil(o any, varName string) error
type AttributeNotFoundErr
- func (e AttributeNotFoundErr) Error() string
type ExtractTagErr
- func (e ExtractTagErr) Error() string
type Extractor
type KindErr
- func (e KindErr) Error() string
type Match
- func GetEqualMatch(expected string) Match
- func GetPrefixMatch(prefix string) Match
type Mode
type NilErr
- func (e NilErr) Error() string
type NoNodesFoundErr
- func (e NoNodesFoundErr) Error() string
type ScrapeErr
- func (e ScrapeErr) Error() string
type Scraper
- func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error
type ScrapingErr
- func (e ScrapingErr) Error() string

Constants ¶

View Source

const (
	TextExtractTag     = "text"     // get a text of children's text nodes
	DeepTextExtractTag = "deeptext" // get a text of descendants' text nodes
	AttrExtractTag     = "@"        // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

View Source

const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

Variables ¶

This section is empty.

Functions ¶

func ExtractAttribute ¶ added in v0.2.0

func ExtractAttribute(node *html.Node, attr string) (string, error)

ExtractAttribute returns the value of the given attribute. If the attribute is absent it returns an error.

func ExtractDeepText ¶ added in v0.2.0

func ExtractDeepText(node *html.Node) string

ExtractDeepText returns the text of all descendants' text nodes.

func ExtractText ¶ added in v0.2.0

func ExtractText(node *html.Node) string

ExtractDeepText returns the text of all children's text nodes.

func GetExtractorMap ¶ added in v0.2.0

func GetExtractorMap() map[*Match]Extractor

GetExtractorMap returns the default map to match extracting tags and extracting functions (or extractors).

func ValidateNotNil ¶ added in v0.4.0

func ValidateNotNil(o any, varName string) error

ValidateNotNil checks if the given o variable is nil or is a pointer that points to nil and returns [NillErr] in the true case. varName is a string the specifies the name of the variable that is nill.

Types ¶

type AttributeNotFoundErr ¶ added in v0.4.0

type AttributeNotFoundErr struct {
	Attr string
}

func (AttributeNotFoundErr) Error ¶ added in v0.4.0

func (e AttributeNotFoundErr) Error() string

type ExtractTagErr ¶ added in v0.4.0

type ExtractTagErr struct {
	ExtractTag string
}

func (ExtractTagErr) Error ¶ added in v0.4.0

func (e ExtractTagErr) Error() string

type Extractor ¶

type Extractor func(node *html.Node, extract string) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type KindErr ¶ added in v0.4.0

type KindErr struct {
	Var     any
	KindExp any
	KindAct any
}

func (KindErr) Error ¶ added in v0.4.0

func (e KindErr) Error() string

type Match ¶ added in v0.2.0

type Match func(extract string) (string, bool)

Match wraps boolean logic of matching values of extracting tags with extracting function (or extractors). Match returns already processed value of extracting tag. (an example "@href" -> "href").

func GetEqualMatch ¶ added in v0.2.0

func GetEqualMatch(expected string) Match

GetEqualMatch creates a Match function that compares the given value with the value of the extracting tag.

func GetPrefixMatch ¶ added in v0.2.0

func GetPrefixMatch(prefix string) Match

GetPrefixMatch creates a Match function that checks whether the extracting tag value has the given prefix and returns a boolean result with the extracting tag value. In true case, it cuts the matched prefix from the extracted value (an example "@href" -> "href")

type Mode ¶ added in v0.4.0

type Mode uint

const (
	Strict Mode = iota
	Tolerant
	Silent
)

type NilErr ¶ added in v0.4.0

type NilErr struct {
	Var string
}

func (NilErr) Error ¶ added in v0.4.0

func (e NilErr) Error() string

type NoNodesFoundErr ¶ added in v0.4.0

type NoNodesFoundErr struct{}

func (NoNodesFoundErr) Error ¶ added in v0.4.0

func (e NoNodesFoundErr) Error() string

type ScrapeErr ¶ added in v0.4.0

type ScrapeErr struct {
	Cause error
}

func (ScrapeErr) Error ¶ added in v0.4.0

func (e ScrapeErr) Error() string

type Scraper ¶

type Scraper struct {

	//Mode can take three states [Strict], [Tolerant], and [Silent].
	// - [Strict] mode assumes that any error caused during scraping is fatal
	// and stops the following scraping.
	// - [Tolerant] mode assumes that scraping should not be prevented but
	// errors where possible and all errors are returned.
	// - [Silent] mode assumes that scraping should not be stopped by errors
	// where possible and these errors are not returned.
	Mode Mode

	// Extractors is a map that matches custom user extractors to extract tags.
	// Do not use reserved extractor tag names and patterns ([TextExtractTag],
	// [AttrExtractTag], and others), otherwise, the default implementation is executed.
	Extractors map[*Match]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape ¶

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain pointers, strings, slices, and structs but the end value must be a string.

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

type ScrapingErr ¶ added in v0.4.0

type ScrapingErr struct {
	Selector string
	Cause    error
}

func (ScrapingErr) Error ¶ added in v0.4.0

func (e ScrapingErr) Error() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL