brink

package module

v0.0.0-...-44ee8c0 Latest Latest Go to latest Published: Feb 27, 2018 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/djavorszky/brink

Links

Open Source Insights

README ¶

brink

Web crawler library

Documentation ¶

Index ¶

Constants
type ContentTooLarge
- func (ctl ContentTooLarge) Error() string
type CrawlOptions
type Crawler
type Link
- func AbsoluteLinksIn(hostURL, linkedFrom string, body []byte, ignoreAnchors bool) ([]Link, error)
- func LinksIn(linkedFrom string, body []byte, ignoreAnchors bool) []Link
type NotAllowed
- func (na NotAllowed) Error() string

Constants ¶

View Source

const (
	AuthNone = iota
	AuthBasic
)

AuthType constants represent what type of authentication to use when visiting the pages

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ContentTooLarge ¶

type ContentTooLarge struct {
	// contains filtered or unexported fields
}

ContentTooLarge error is returned by Fetch when the content length of a page is larger than what is allowed.

func (ContentTooLarge) Error ¶

func (ctl ContentTooLarge) Error() string

type CrawlOptions ¶

type CrawlOptions struct {
	AuthType int    `toml:"auth-type"`
	User     string `toml:"user"`
	Pass     string `toml:"pass"`

	// URLBufferSize is the amount of URLs that can be waiting to be visited.
	URLBufferSize int `toml:"url-buffer-size"`

	// WorkerCount specifies the number of goroutines that will work on crawling the domains.
	WorkerCount int `toml:"worker-count"`

	// IdleWorkCheckInterval configures how frequently the crawler checks if there is any work
	// to do. If there is no url to be processed, it will gracefully stop itself. Setting it to
	// 0 will use the default value of 5000 milliseconds.
	IdleWorkCheckInterval int `toml:"idle-work-check-interval"`

	// MaxContentLength specifies the maximum size of pages to be crawled. Setting it to 0
	// will default to 512Kb. Set it to -1 to allow unlimited size
	MaxContentLength int64 `toml:"max-content-length"`

	// Entrypoint is the first url that will be fetched.
	EntryPoint string `toml:"entrypoint"`

	// AllowedDomains will be used to check whether a domain is allowed to be crawled or not.
	AllowedDomains []string `toml:"allowed-domains"`

	// Cookies holds a list of cookies to be added to all requests in addition to the one
	// sent by the servers
	Cookies map[string]*http.Cookie `toml:"cookies"`

	// Headers holds a mapping for key->values to be added to all requests
	Headers map[string]string `toml:"headers"`

	// Ignore certain GET parameters when comparing whether an URL has been visited or not
	IgnoreGETParameters []string `toml:"ignore-get-parameters"`

	// FuzzyGETParameterChecks will decide whether to try to do exact matches for parameters.
	// If set to false, GET parameters are only ignored if they are an exact match. If set
	// to true, they are checked with a substring fashion.
	FuzzyGETParameterChecks bool `toml:"fuzzy-get-parameter-checks"`

	// Ignore certain URL Paths. URLs containing Paths that contain sections that are specified
	// in this list will not be visited.
	ForbiddenPaths []string `toml:"ignore-path-visits"`

	// SessionCookieNames holds all the cookie names that can represent a sessionId. It is
	// necessary in order to check whether authorization has been successful to make sure
	// not to try and re-authorize on every request.
	SessionCookieNames []string `toml:"session-cookie-names"`
}

CrawlOptions contains options for the crawler

type Crawler ¶

type Crawler struct {
	RootDomain string
	// contains filtered or unexported fields
}

Crawler represents a web crawler, starting from a RootDomain and visiting all the links in the AllowedDomains map. It will only download the body of an URL if it is less than MaxContentLength.

func NewCrawler ¶

func NewCrawler(rootDomain string) (*Crawler, error)

NewCrawler returns an Crawler initialized with default values.

func NewCrawlerFromToml ¶

func NewCrawlerFromToml(filename string) (*Crawler, error)

NewCrawlerFromToml reads up a file and parses it as a toml property file.

func NewCrawlerWithOpts ¶

func NewCrawlerWithOpts(rootDomain string, userOptions CrawlOptions) (*Crawler, error)

NewCrawlerWithOpts returns a Crawler initialized with the provided CrawlOptions struct.

func (*Crawler) AllowDomains ¶

func (c *Crawler) AllowDomains(domains ...string)

AllowDomains instructs the crawler which domains it is allowed to visit. The RootDomain is automatically added to this list. Domains not allowed will be checked for http status, but will not be traversed.

Subsequent calls to AllowDomains adds to the list of domains allowed to the crawler to traverse.

func (*Crawler) Fetch ¶

func (c *Crawler) Fetch(url string) (status int, body []byte, err error)

Fetch fetches the URL and returns its status, body and/or any errors it encountered.

func (*Crawler) HandleDefaultFunc ¶

func (c *Crawler) HandleDefaultFunc(h func(linkedFrom string, url string, status int, body string, cached bool))

HandleDefaultFunc will be called for all pages returned by a status which doesn't have a seperate handler defined by HandleFunc. Subsequent calls to HandleDefaultFunc will overwrite the previously set handlers, if any.

func (*Crawler) HandleFunc ¶

func (c *Crawler) HandleFunc(status int, h func(linkedFrom string, url string, status int, body string, cached bool))

HandleFunc is used to register a function to be called when a new page is found with the specified status. Subsequent calls to register functions to the same statuses will silently overwrite previously set handlers, if any.

func (*Crawler) Start ¶

func (c *Crawler) Start() error

Start starts the crawler at the specified rootDomain. It will scrape the page for links and then visit each of them, provided the domains are allowed. It will keep repeating this process on each page until it runs out of pages to visit.

Start requires at least one handler to be registered, otherwise errors out.

func (*Crawler) Stop ¶

func (c *Crawler) Stop()

Stop attempts to stop the crawler.

type Link ¶

type Link struct {
	LinkedFrom string
	Href       string
	Target     string
}

Link represents a very basic HTML anchor tag. LinkedFrom is the page on which it is found, Href is where it is pointing to.

func AbsoluteLinksIn ¶

func AbsoluteLinksIn(hostURL, linkedFrom string, body []byte, ignoreAnchors bool) ([]Link, error)

AbsoluteLinksIn expects a valid HTML to parse and returns a slice of the links (anchors) contained inside. If "ignoreAnchors" is set to true, then links which point to "#someAnchor" type locations are ignored.

If any links within the HTML start with a forward slash (e.g. is a dynamic link), it will get prepended with the passed url.

func LinksIn ¶

func LinksIn(linkedFrom string, body []byte, ignoreAnchors bool) []Link

LinksIn expects a valid HTML to parse and returns a slice of the links (anchors) contained inside. If "ignoreAnchors" is set to true, then links which point to "#someAnchor" type locations are ignored.

type NotAllowed ¶

type NotAllowed struct {
	// contains filtered or unexported fields
}

NotAllowed error is returned by Fetch when a domain is not allowed to be visited.

func (NotAllowed) Error ¶

func (na NotAllowed) Error() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
store

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL