brink

package module
v0.0.0-...-44ee8c0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2018 License: MIT Imports: 16 Imported by: 0

README

brink

Web crawler library

Documentation

Index

Constants

View Source
const (
	AuthNone = iota
	AuthBasic
)

AuthType constants represent what type of authentication to use when visiting the pages

Variables

This section is empty.

Functions

This section is empty.

Types

type ContentTooLarge

type ContentTooLarge struct {
	// contains filtered or unexported fields
}

ContentTooLarge error is returned by Fetch when the content length of a page is larger than what is allowed.

func (ContentTooLarge) Error

func (ctl ContentTooLarge) Error() string

type CrawlOptions

type CrawlOptions struct {
	AuthType int    `toml:"auth-type"`
	User     string `toml:"user"`
	Pass     string `toml:"pass"`

	// URLBufferSize is the amount of URLs that can be waiting to be visited.
	URLBufferSize int `toml:"url-buffer-size"`

	// WorkerCount specifies the number of goroutines that will work on crawling the domains.
	WorkerCount int `toml:"worker-count"`

	// IdleWorkCheckInterval configures how frequently the crawler checks if there is any work
	// to do. If there is no url to be processed, it will gracefully stop itself. Setting it to
	// 0 will use the default value of 5000 milliseconds.
	IdleWorkCheckInterval int `toml:"idle-work-check-interval"`

	// MaxContentLength specifies the maximum size of pages to be crawled. Setting it to 0
	// will default to 512Kb. Set it to -1 to allow unlimited size
	MaxContentLength int64 `toml:"max-content-length"`

	// Entrypoint is the first url that will be fetched.
	EntryPoint string `toml:"entrypoint"`

	// AllowedDomains will be used to check whether a domain is allowed to be crawled or not.
	AllowedDomains []string `toml:"allowed-domains"`

	// Cookies holds a list of cookies to be added to all requests in addition to the one
	// sent by the servers
	Cookies map[string]*http.Cookie `toml:"cookies"`

	// Headers holds a mapping for key->values to be added to all requests
	Headers map[string]string `toml:"headers"`

	// Ignore certain GET parameters when comparing whether an URL has been visited or not
	IgnoreGETParameters []string `toml:"ignore-get-parameters"`

	// FuzzyGETParameterChecks will decide whether to try to do exact matches for parameters.
	// If set to false, GET parameters are only ignored if they are an exact match. If set
	// to true, they are checked with a substring fashion.
	FuzzyGETParameterChecks bool `toml:"fuzzy-get-parameter-checks"`

	// Ignore certain URL Paths. URLs containing Paths that contain sections that are specified
	// in this list will not be visited.
	ForbiddenPaths []string `toml:"ignore-path-visits"`

	// SessionCookieNames holds all the cookie names that can represent a sessionId. It is
	// necessary in order to check whether authorization has been successful to make sure
	// not to try and re-authorize on every request.
	SessionCookieNames []string `toml:"session-cookie-names"`
}

CrawlOptions contains options for the crawler

type Crawler

type Crawler struct {
	RootDomain string
	// contains filtered or unexported fields
}

Crawler represents a web crawler, starting from a RootDomain and visiting all the links in the AllowedDomains map. It will only download the body of an URL if it is less than MaxContentLength.

func NewCrawler

func NewCrawler(rootDomain string) (*Crawler, error)

NewCrawler returns an Crawler initialized with default values.

func NewCrawlerFromToml

func NewCrawlerFromToml(filename string) (*Crawler, error)

NewCrawlerFromToml reads up a file and parses it as a toml property file.

func NewCrawlerWithOpts

func NewCrawlerWithOpts(rootDomain string, userOptions CrawlOptions) (*Crawler, error)

NewCrawlerWithOpts returns a Crawler initialized with the provided CrawlOptions struct.

func (*Crawler) AllowDomains

func (c *Crawler) AllowDomains(domains ...string)

AllowDomains instructs the crawler which domains it is allowed to visit. The RootDomain is automatically added to this list. Domains not allowed will be checked for http status, but will not be traversed.

Subsequent calls to AllowDomains adds to the list of domains allowed to the crawler to traverse.

func (*Crawler) Fetch

func (c *Crawler) Fetch(url string) (status int, body []byte, err error)

Fetch fetches the URL and returns its status, body and/or any errors it encountered.

func (*Crawler) HandleDefaultFunc

func (c *Crawler) HandleDefaultFunc(h func(linkedFrom string, url string, status int, body string, cached bool))

HandleDefaultFunc will be called for all pages returned by a status which doesn't have a seperate handler defined by HandleFunc. Subsequent calls to HandleDefaultFunc will overwrite the previously set handlers, if any.

func (*Crawler) HandleFunc

func (c *Crawler) HandleFunc(status int, h func(linkedFrom string, url string, status int, body string, cached bool))

HandleFunc is used to register a function to be called when a new page is found with the specified status. Subsequent calls to register functions to the same statuses will silently overwrite previously set handlers, if any.

func (*Crawler) Start

func (c *Crawler) Start() error

Start starts the crawler at the specified rootDomain. It will scrape the page for links and then visit each of them, provided the domains are allowed. It will keep repeating this process on each page until it runs out of pages to visit.

Start requires at least one handler to be registered, otherwise errors out.

func (*Crawler) Stop

func (c *Crawler) Stop()

Stop attempts to stop the crawler.

type Link struct {
	LinkedFrom string
	Href       string
	Target     string
}

Link represents a very basic HTML anchor tag. LinkedFrom is the page on which it is found, Href is where it is pointing to.

func AbsoluteLinksIn

func AbsoluteLinksIn(hostURL, linkedFrom string, body []byte, ignoreAnchors bool) ([]Link, error)

AbsoluteLinksIn expects a valid HTML to parse and returns a slice of the links (anchors) contained inside. If "ignoreAnchors" is set to true, then links which point to "#someAnchor" type locations are ignored.

If any links within the HTML start with a forward slash (e.g. is a dynamic link), it will get prepended with the passed url.

func LinksIn

func LinksIn(linkedFrom string, body []byte, ignoreAnchors bool) []Link

LinksIn expects a valid HTML to parse and returns a slice of the links (anchors) contained inside. If "ignoreAnchors" is set to true, then links which point to "#someAnchor" type locations are ignored.

type NotAllowed

type NotAllowed struct {
	// contains filtered or unexported fields
}

NotAllowed error is returned by Fetch when a domain is not allowed to be visited.

func (NotAllowed) Error

func (na NotAllowed) Error() string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL