listeater

package module
v0.0.0-...-fe636d4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 9, 2016 License: MIT Imports: 12 Imported by: 0

README

listeater Build Status Coverage Status

A simple crawler aimed at eating paginated lists of elements.

A lot of crawlers exist outside, however i needed a simple and configurable crawler to do the hard job of crawling different list types.

As lists are always the same but list elements are quite different, the element crawl is delegated to your custom function. Every element crawling is run in its own goroutine.

Have fun.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrCannotLogin = errors.New("Could not login, check your credentials.")
View Source
var ErrInvalidConfig = errors.New("Invalid config, missing crawl ops.")
View Source
var ErrInvalidSelector = errors.New("Invalid selector for pagination.")
View Source
var ErrInvalidUrlHrefInPagination = errors.New("Pagination found but url in href not found.")
View Source
var ErrNoHrefInPagination = errors.New("Pagination found but no href inside.")
View Source
var ErrNoLoginCreds = errors.New("No login credentials provided, login needed.")

Functions

This section is empty.

Types

type CrawlDescriptor

type CrawlDescriptor struct {
	ListUrl string `json:"url"`
	Element string `json:"element"`
}

type CrawlResult

type CrawlResult struct {
	Element interface{}
	Error   error
	Done    bool
}

a single crawling result with relative error if the case

type ElementCrawler

type ElementCrawler interface {
	Extract(r *http.Response, resChan chan CrawlResult)
}

the interface that must be implemented to crawl a single element of the list

type HrefPaginationHandler

type HrefPaginationHandler struct {
	Selector string `json:"selector"`
}

a simple pagination handler that extracts the href from the specified element

func (HrefPaginationHandler) Paginate

func (hph HrefPaginationHandler) Paginate(r *http.Response) (*http.Request, bool, error)

type ListEater

type ListEater struct {
	LoginDesc *LoginDescriptor
	CrawlDesc *CrawlDescriptor
	Client    *http.Client
	Paginator PaginationHandler
}

the listeater type, the main type of this package

func (*ListEater) Crawl

func (le *ListEater) Crawl(resChan chan CrawlResult, elementCrawler ElementCrawler, creds *LoginCredentials) error

the main listeater function, does the actual crawling

type ListEaterConfig

type ListEaterConfig struct {
	Login *LoginDescriptor `json:"login"`
	Crawl *CrawlDescriptor `json:"crawl"`
}

type LoginCredentials

type LoginCredentials struct {
	// contains filtered or unexported fields
}

credentials for login (if needed)

type LoginDescriptor

type LoginDescriptor struct {
	Url           string `json:"url"`
	UserField     string `json:"user_field`
	PasswordField string `json:"psw_field"`
}

type PaginationHandler

type PaginationHandler interface {
	Paginate(r *http.Response) (*http.Request, bool, error)
}

the interface that must be implemented to paginate from a page. Returns request for next page, hasNext bool, and an error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL