crawler

package
v0.0.0-...-27509e8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 17, 2022 License: MIT Imports: 10 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrInvalidURL is the error thrown if visiting URL
	// is invalid format.
	ErrInvalidURL = errors.New("Invalid URL")
	// ErrForbidden is the error thrown if the url is not allowed to visit
	ErrForbidden = errors.New("Forbidden")
	// ErrAlreadyVisitedDomain is the error for already visited URL
	ErrAlreadyVisited = errors.New("Already visited")
)

Functions

This section is empty.

Types

type CrawlResult

type CrawlResult struct {
	URL   *url.URL
	Body  string
	Links []string
}

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

func NewCrawler

func NewCrawler(URL string, maxDepth int) *Crawler

NewCrawler returns `*Crawler`.

func NewCrawlerWithLimitRule

func NewCrawlerWithLimitRule(URL string, maxDepth int, limitRule *LimitRule) *Crawler

NewCrawlerWithLimitRule returns `*Crawler` with LimitRule.

func (*Crawler) Crawl

func (c *Crawler) Crawl()

Crawl start crawling

func (*Crawler) OnError

func (c *Crawler) OnError(f ErrorCallback)

OnError register a function. Function will be executed on error occured

func (*Crawler) OnVisit

func (c *Crawler) OnVisit(f VisitCallback)

OnVisit register a function. Function will be executed on visiting web site.

func (*Crawler) OnVisited

func (c *Crawler) OnVisited(f VisitedCallback)

OnVisited register a function. Function will be executed on after visit web site.

func (*Crawler) SetParallelism

func (c *Crawler) SetParallelism(n int)

SetParallelism set limit of crawling parallelism. By default, parallelism is 5.

func (*Crawler) UseHeadlessChrome

func (c *Crawler) UseHeadlessChrome()

UseHeadlessChrome use headless chrome at the time of request. By default, using `http.Get()`.

type ErrorCallback

type ErrorCallback func(error)

ErrorCallback is a type of alias for OnError callback functions

type Fetcher

type Fetcher interface {
	Fetch(URL string) (body []byte, err error)
}

Fetcher sends GET request to the given URL and returns response body

type LimitRule

type LimitRule struct {
	// AllowedHosts define accessible hosts.
	// When AllowedHosts is empty, all hosts are allowed.
	AllowedHosts []string
	AllowedUrls  []*regexp.Regexp
}

func NewLimitRule

func NewLimitRule() *LimitRule

NewLimitRule returns empty LimitRule. You can add rule to it to call AddAllowedHosts

func (*LimitRule) AddAllowedHosts

func (lr *LimitRule) AddAllowedHosts(hosts ...string)

AddAllowedHosts add rule define accessible hosts.

func (*LimitRule) AddAllowedUrls

func (lr *LimitRule) AddAllowedUrls(re *regexp.Regexp)

func (*LimitRule) IsAllow

func (lr *LimitRule) IsAllow(requestURL *url.URL) bool

IsAllow returns true if requestURL is no limit to crawl.

type VisitCallback

type VisitCallback func(response []byte)

VisitCallback is a type of alias for OnVisit callback functions

type VisitedCallback

type VisitedCallback func(*CrawlResult)

VisitedCallback is a type of alias for OnVisited callback functions

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL