Documentation
¶
Index ¶
- func DefaultLinkExtractor(c *Client, currLink string, resp []byte) []string
- func IsClientErrorResponse(resp *http.Response) bool
- func IsHtmlContent(resp *http.Response) bool
- func IsNoopResponse(resp *http.Response) bool
- func IsOkResponse(resp *http.Response) bool
- func IsServerErrorResponse(resp *http.Response) bool
- type Client
- type Config
- type IPInfo
- type LinkExtractor
- type NetworkInfo
- type PageInfo
- type ResponseMatcher
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultLinkExtractor ¶
DefaultLinkExtractor looks for <a href="..."> tags and extracts the link if the host is not blacklisted. This function assumes that if the href value is a relative path, it is relative to the current URL.
func IsClientErrorResponse ¶
This matches all responses that return a 4xx status code
func IsHtmlContent ¶
This matches all responses that return a 2xx status code and have a Content-Type header that contains "text/html"
func IsOkResponse ¶
This matches all responses that return a 200 status code
func IsServerErrorResponse ¶
This matches all responses that return a 5xx status code
Types ¶
type Client ¶
type Client struct {
MaxDepth int
NetMutex sync.RWMutex
PageMutex sync.RWMutex
HostBlacklist map[string]struct{}
VisitedNetInfo map[string][]NetworkInfo
VisitedPageInfo map[string]PageInfo
// contains filtered or unexported fields
}
func New ¶
func New(ctx context.Context, config *Config, rm []ResponseMatcher, le LinkExtractor) *Client
New creates a new crawler client using the context to allow for cancellation, the crawler config, and list of response matchers to filter out responses.
Note that the ordering of the response matchers matter, the first matcher to return false will cause the link to be skipped.
type Config ¶
type Config struct {
BlacklistHosts map[string]struct{} // hosts to blacklist
MaxDepth int // max depth from seed
MaxRetries int // max retries for HTTP requests
MaxRPS float64 // max requests per second
ProxyURL *url.URL // proxy URL, if any. useful to avoid IP bans
SeedURLs []string // where to start crawling from
Timeout time.Duration // timeout for HTTP requests
}
type LinkExtractor ¶
Takes in a map of blacklisted hosts and the response body and returns a slice of links
type NetworkInfo ¶
type NetworkInfo struct {
RemoteIPInfo []IPInfo `json:"remote_ip_info"`
AvgResponseMs int64 `json:"avg_response_ms"`
PathCount int `json:"path_count"`
VisitedPaths []string `json:"visited_paths"`
// These values are not exported to JSON
TotalResponseTimeMs int64 `json:"-"`
VisitedPathSet map[string]struct{} `json:"-"`
}
type ResponseMatcher ¶
ResponseMatcher is a function that takes an http.Response and returns a boolean to indicate whether or not the contents of the URL should be processed (e.g extract links)