scraper package - github.com/luxas/speakerdeck-scraper/scraper

Documentation ¶

Overview ¶

The scraper package contains generic, high-level scraper functionality built on top of github.com/gocolly/colly In order to use it, create a struct (e.g. MyScraper) that embeds the BaseScraper, and implements the Scraper interface, e.g.

type MyScraper struct {}

func (ms *MyScraper) InitialData() interface{} {
	return &MyScrapedData{}
}

func (ms *MyScraper) Hooks() []scraper.Hook {
	return []scraper.Hook{
		{
			DOMPath: "#my-awesome-element",
			Handler: extractImportantMessage,
		},
	}
}

func extractImportantMessage(e *colly.HTMLElement, data interface{}) (*string, error) {
	myData := data.(*MyScrapedData)
	myData.ImportantMessage = e.Text
	return nil, nil
}

type MyScrapedData struct {
	ImportantMessage string
}

func main() {
	s := &MyScraper{}
	data, err := scraper.Scrape("example.com", s, nil)
	if err != nil {
		log.Fatal(err)
	}
	myData := data.(*MyScrapedData)
	log.Printf("Important message is: %s", myData.ImportantMessage)
}

Index ¶

func Scrape(url string, s Scraper, opts *ScrapeOptions) (interface{}, error)
type Extension
type Hook
type HookFn
type ScrapeOptions
type Scraper

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Scrape ¶

func Scrape(url string, s Scraper, opts *ScrapeOptions) (interface{}, error)

Scrape takes in a Scraper struct, an URL to scrape, and optionally extra options. This function calls handlers from the the Scraper.Hooks() for the given DOM paths, and shares the Scraper.InitialData() struct pointer between them. The return value is that struct pointer, and/or possibly an error.

Types ¶

type Extension ¶

type Extension interface {
	// Name returns the name of the extension
	Name() string

	// Hook is the hook registered by this extension
	Hook() Hook
}

Extension is an interface which allows for adding extensions on-demand to scraping implementations. Upon calling Scrape(), you may pass extra extension implementations in ScrapeOptions. The extension can register its own extra hook for processing the DOM. The extension shares/manipulates the same data as the Scraper it's used together with.

type Hook ¶

type Hook struct {
	// DOMPath specifies one or many elements in the DOM tree using a CSS selector
	DOMPath string

	// Handler specifies the handler to be invoked for all of the elements on the HTML page matched by the CSS selector
	Handler HookFn
}

Hook maps a handler of type HookFn to a DOMPath in the tree. The DOMPath can be any valid CSS selector.

type HookFn ¶

type HookFn func(e *colly.HTMLElement, data interface{}) (*string, error)

HookFn is a callback function for processing HTML data at a given place in the DOM tree The first e argument gives access to the DOM, and the second data argument carries a pointer to the data struct you want to save important information in. You can cast data to what's returned by Scraper.InitialData(). The return values are an optional string which tells the scraper to also scrape an other page, and an error.

type ScrapeOptions ¶

type ScrapeOptions struct {
	// Extensions allows registering extensions to a Scrape() call
	Extensions []Extension
	// LogLevel specifies the logrus log level for the Scrape() function
	LogLevel *log.Level
}

ScrapeOptions contains extra parameters used when scraping

type Scraper ¶

type Scraper interface {
	// Name returns an user-friendly name of the scraper
	Name() string

	// Hooks returns the hooks for all HTML elements that should be matched and their handlers.
	Hooks() []Hook

	// InitialData returns the struct pointer which is then shared between/passed to all hook handlers.
	InitialData() interface{}
}

Scraper is an interface which scraping implementations should implement. Any struct that satisfies this interface, may be passed to the generic Scrape function in this package.

Source Files ¶

View all Source files

scraper.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL