Documentation ¶
Index ¶
- Constants
- type ContentTooLarge
- type CrawlOptions
- type Crawler
- func (c *Crawler) AllowDomains(domains ...string)
- func (c *Crawler) Fetch(url string) (status int, body []byte, err error)
- func (c *Crawler) HandleDefaultFunc(h func(linkedFrom string, url string, status int, body string, cached bool))
- func (c *Crawler) HandleFunc(status int, ...)
- func (c *Crawler) Start() error
- func (c *Crawler) Stop()
- type Link
- type NotAllowed
Constants ¶
const ( AuthNone = iota AuthBasic )
AuthType constants represent what type of authentication to use when visiting the pages
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ContentTooLarge ¶
type ContentTooLarge struct {
// contains filtered or unexported fields
}
ContentTooLarge error is returned by Fetch when the content length of a page is larger than what is allowed.
func (ContentTooLarge) Error ¶
func (ctl ContentTooLarge) Error() string
type CrawlOptions ¶
type CrawlOptions struct { AuthType int `toml:"auth-type"` User string `toml:"user"` Pass string `toml:"pass"` // URLBufferSize is the amount of URLs that can be waiting to be visited. URLBufferSize int `toml:"url-buffer-size"` // WorkerCount specifies the number of goroutines that will work on crawling the domains. WorkerCount int `toml:"worker-count"` // IdleWorkCheckInterval configures how frequently the crawler checks if there is any work // to do. If there is no url to be processed, it will gracefully stop itself. Setting it to // 0 will use the default value of 5000 milliseconds. IdleWorkCheckInterval int `toml:"idle-work-check-interval"` // MaxContentLength specifies the maximum size of pages to be crawled. Setting it to 0 // will default to 512Kb. Set it to -1 to allow unlimited size MaxContentLength int64 `toml:"max-content-length"` // Entrypoint is the first url that will be fetched. EntryPoint string `toml:"entrypoint"` // AllowedDomains will be used to check whether a domain is allowed to be crawled or not. AllowedDomains []string `toml:"allowed-domains"` // Cookies holds a list of cookies to be added to all requests in addition to the one // sent by the servers Cookies map[string]*http.Cookie `toml:"cookies"` // Headers holds a mapping for key->values to be added to all requests Headers map[string]string `toml:"headers"` // Ignore certain GET parameters when comparing whether an URL has been visited or not IgnoreGETParameters []string `toml:"ignore-get-parameters"` // FuzzyGETParameterChecks will decide whether to try to do exact matches for parameters. // If set to false, GET parameters are only ignored if they are an exact match. If set // to true, they are checked with a substring fashion. FuzzyGETParameterChecks bool `toml:"fuzzy-get-parameter-checks"` // Ignore certain URL Paths. URLs containing Paths that contain sections that are specified // in this list will not be visited. ForbiddenPaths []string `toml:"ignore-path-visits"` // SessionCookieNames holds all the cookie names that can represent a sessionId. It is // necessary in order to check whether authorization has been successful to make sure // not to try and re-authorize on every request. SessionCookieNames []string `toml:"session-cookie-names"` }
CrawlOptions contains options for the crawler
type Crawler ¶
type Crawler struct { RootDomain string // contains filtered or unexported fields }
Crawler represents a web crawler, starting from a RootDomain and visiting all the links in the AllowedDomains map. It will only download the body of an URL if it is less than MaxContentLength.
func NewCrawler ¶
NewCrawler returns an Crawler initialized with default values.
func NewCrawlerFromToml ¶
NewCrawlerFromToml reads up a file and parses it as a toml property file.
func NewCrawlerWithOpts ¶
func NewCrawlerWithOpts(rootDomain string, userOptions CrawlOptions) (*Crawler, error)
NewCrawlerWithOpts returns a Crawler initialized with the provided CrawlOptions struct.
func (*Crawler) AllowDomains ¶
AllowDomains instructs the crawler which domains it is allowed to visit. The RootDomain is automatically added to this list. Domains not allowed will be checked for http status, but will not be traversed.
Subsequent calls to AllowDomains adds to the list of domains allowed to the crawler to traverse.
func (*Crawler) Fetch ¶
Fetch fetches the URL and returns its status, body and/or any errors it encountered.
func (*Crawler) HandleDefaultFunc ¶
func (c *Crawler) HandleDefaultFunc(h func(linkedFrom string, url string, status int, body string, cached bool))
HandleDefaultFunc will be called for all pages returned by a status which doesn't have a seperate handler defined by HandleFunc. Subsequent calls to HandleDefaultFunc will overwrite the previously set handlers, if any.
func (*Crawler) HandleFunc ¶
func (c *Crawler) HandleFunc(status int, h func(linkedFrom string, url string, status int, body string, cached bool))
HandleFunc is used to register a function to be called when a new page is found with the specified status. Subsequent calls to register functions to the same statuses will silently overwrite previously set handlers, if any.
func (*Crawler) Start ¶
Start starts the crawler at the specified rootDomain. It will scrape the page for links and then visit each of them, provided the domains are allowed. It will keep repeating this process on each page until it runs out of pages to visit.
Start requires at least one handler to be registered, otherwise errors out.
type Link ¶
Link represents a very basic HTML anchor tag. LinkedFrom is the page on which it is found, Href is where it is pointing to.
func AbsoluteLinksIn ¶
AbsoluteLinksIn expects a valid HTML to parse and returns a slice of the links (anchors) contained inside. If "ignoreAnchors" is set to true, then links which point to "#someAnchor" type locations are ignored.
If any links within the HTML start with a forward slash (e.g. is a dynamic link), it will get prepended with the passed url.
type NotAllowed ¶
type NotAllowed struct {
// contains filtered or unexported fields
}
NotAllowed error is returned by Fetch when a domain is not allowed to be visited.
func (NotAllowed) Error ¶
func (na NotAllowed) Error() string