listeater

package module

v0.0.0-...-fe636d4 Latest Latest Go to latest Published: Apr 9, 2016 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/peterdeka/listeater

Links

Open Source Insights

README ¶

listeater

A simple crawler aimed at eating paginated lists of elements.

A lot of crawlers exist outside, however i needed a simple and configurable crawler to do the hard job of crawling different list types.

As lists are always the same but list elements are quite different, the element crawl is delegated to your custom function. Every element crawling is run in its own goroutine.

Have fun.

Documentation ¶

Index ¶

Variables
type CrawlDescriptor
type CrawlResult
type ElementCrawler
type HrefPaginationHandler
- func (hph HrefPaginationHandler) Paginate(r *http.Response) (*http.Request, bool, error)
type ListEater
- func (le *ListEater) Crawl(resChan chan CrawlResult, elementCrawler ElementCrawler, ...) error
type ListEaterConfig
type LoginCredentials
type LoginDescriptor
type PaginationHandler

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrCannotLogin = errors.New("Could not login, check your credentials.")

View Source

var ErrInvalidConfig = errors.New("Invalid config, missing crawl ops.")

View Source

var ErrInvalidSelector = errors.New("Invalid selector for pagination.")

View Source

var ErrInvalidUrlHrefInPagination = errors.New("Pagination found but url in href not found.")

View Source

var ErrNoHrefInPagination = errors.New("Pagination found but no href inside.")

View Source

var ErrNoLoginCreds = errors.New("No login credentials provided, login needed.")

Functions ¶

This section is empty.

Types ¶

type CrawlDescriptor ¶

type CrawlDescriptor struct {
	ListUrl string `json:"url"`
	Element string `json:"element"`
}

type CrawlResult ¶

type CrawlResult struct {
	Element interface{}
	Error   error
	Done    bool
}

a single crawling result with relative error if the case

type ElementCrawler ¶

type ElementCrawler interface {
	Extract(r *http.Response, resChan chan CrawlResult)
}

the interface that must be implemented to crawl a single element of the list

type HrefPaginationHandler ¶

type HrefPaginationHandler struct {
	Selector string `json:"selector"`
}

a simple pagination handler that extracts the href from the specified element

func (HrefPaginationHandler) Paginate ¶

func (hph HrefPaginationHandler) Paginate(r *http.Response) (*http.Request, bool, error)

type ListEater ¶

type ListEater struct {
	LoginDesc *LoginDescriptor
	CrawlDesc *CrawlDescriptor
	Client    *http.Client
	Paginator PaginationHandler
}

the listeater type, the main type of this package

func (*ListEater) Crawl ¶

func (le *ListEater) Crawl(resChan chan CrawlResult, elementCrawler ElementCrawler, creds *LoginCredentials) error

the main listeater function, does the actual crawling

type ListEaterConfig ¶

type ListEaterConfig struct {
	Login *LoginDescriptor `json:"login"`
	Crawl *CrawlDescriptor `json:"crawl"`
}

type LoginCredentials ¶

type LoginCredentials struct {
	// contains filtered or unexported fields
}

credentials for login (if needed)

type LoginDescriptor ¶

type LoginDescriptor struct {
	Url           string `json:"url"`
	UserField     string `json:"user_field`
	PasswordField string `json:"psw_field"`
}

type PaginationHandler ¶

type PaginationHandler interface {
	Paginate(r *http.Response) (*http.Request, bool, error)
}

the interface that must be implemented to paginate from a page. Returns request for next page, hasNext bool, and an error

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL