gocrawler

package module

v0.0.3 Latest Latest Go to latest Published: Nov 1, 2023 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/yusufaine/gocrawler

Links

Open Source Insights

README ¶

gocrawler

A simple concurrent webcrawler package written in Go.

Packages

Package	Description
`gocrawler` (main)	Main crawler logic with a customisable `LinkExtractor` to allow users to determine how links are extracted, and `ResponseMatcher` to filter out unwanted responses.
`logger` (internal)	Sets up `charmbracelet/log` to make logging less boring
`rhttp` (internal)	Wrapper over `net/http` with provided backoff and retry policies that can be customised

Usage

Examples of how to use the crawler package can be found in the example directory.

Documentation ¶

Index ¶

func DefaultLinkExtractor(c *Client, currLink string, resp []byte) []string
func IsClientErrorResponse(resp *http.Response) bool
func IsHtmlContent(resp *http.Response) bool
func IsNoopResponse(resp *http.Response) bool
func IsOkResponse(resp *http.Response) bool
func IsServerErrorResponse(resp *http.Response) bool
type Client
- func New(ctx context.Context, config *Config, rm []ResponseMatcher, le LinkExtractor) *Client
- func (c *Client) Crawl(ctx context.Context, currDepth int, currLink, parent string)
type Config
type IPInfo
type LinkExtractor
type NetworkInfo
type PageInfo
type ResponseMatcher

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DefaultLinkExtractor ¶

func DefaultLinkExtractor(c *Client, currLink string, resp []byte) []string

DefaultLinkExtractor looks for <a href="..."> tags and extracts the link if the host is not blacklisted. This function assumes that if the href value is a relative path, it is relative to the current URL.

func IsClientErrorResponse ¶

func IsClientErrorResponse(resp *http.Response) bool

This matches all responses that return a 4xx status code

func IsHtmlContent ¶

func IsHtmlContent(resp *http.Response) bool

This matches all responses that return a 2xx status code and have a Content-Type header that contains "text/html"

func IsNoopResponse ¶

func IsNoopResponse(resp *http.Response) bool

This matches all responses

func IsOkResponse ¶

func IsOkResponse(resp *http.Response) bool

This matches all responses that return a 200 status code

func IsServerErrorResponse ¶

func IsServerErrorResponse(resp *http.Response) bool

This matches all responses that return a 5xx status code

Types ¶

type Client ¶

type Client struct {
	MaxDepth        int
	NetMutex        sync.RWMutex
	PageMutex       sync.RWMutex
	HostBlacklist   map[string]struct{}
	VisitedNetInfo  map[string][]NetworkInfo
	VisitedPageInfo map[string]PageInfo
	// contains filtered or unexported fields
}

func New ¶

func New(ctx context.Context, config *Config, rm []ResponseMatcher, le LinkExtractor) *Client

New creates a new crawler client using the context to allow for cancellation, the crawler config, and list of response matchers to filter out responses.

Note that the ordering of the response matchers matter, the first matcher to return false will cause the link to be skipped.

func (*Client) Crawl ¶

func (c *Client) Crawl(ctx context.Context, currDepth int, currLink, parent string)

Crawl is called recursively to crawl the supplied URL and all outgoing links which is extracted by the supplied LinkExtractor. The crawl will stop when the MaxDepth is reached or if the context is cancelled.

type Config ¶

type Config struct {
	BlacklistHosts map[string]struct{} // hosts to blacklist
	MaxDepth       int                 // max depth from seed
	MaxRetries     int                 // max retries for HTTP requests
	MaxRPS         float64             // max requests per second
	ProxyURL       *url.URL            // proxy URL, if any. useful to avoid IP bans
	SeedURLs       []string            // where to start crawling from
	Timeout        time.Duration       // timeout for HTTP requests
}

type IPInfo ¶

type IPInfo struct {
	IP       string `json:"ip"`
	Location string `json:"location"`
	ASNumber string `json:"as_number"`
}

type LinkExtractor ¶

type LinkExtractor func(c *Client, currLink string, resp []byte) []string

Takes in a map of blacklisted hosts and the response body and returns a slice of links

type NetworkInfo ¶

type NetworkInfo struct {
	RemoteIPInfo  []IPInfo `json:"remote_ip_info"`
	AvgResponseMs int64    `json:"avg_response_ms"`
	PathCount     int      `json:"path_count"`
	VisitedPaths  []string `json:"visited_paths"`

	// These values are not exported to JSON
	TotalResponseTimeMs int64               `json:"-"`
	VisitedPathSet      map[string]struct{} `json:"-"`
}

type PageInfo ¶

type PageInfo struct {
	Depth  int      `json:"depth"`
	Parent string   `json:"parent"`
	Links  []string `json:"links"`

	// These values are not exported to JSON
	Content []byte `json:"-"`
}

type ResponseMatcher ¶

type ResponseMatcher func(resp *http.Response) bool

ResponseMatcher is a function that takes an http.Response and returns a boolean to indicate whether or not the contents of the URL should be processed (e.g extract links)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example
explorer command
explorer/internal/explorer
internal/filewriter
sitemapper command
sitemapper/internal/sitemapper
tianalyser command
tianalyser/internal/tianalyser
internal
logger
rhttp

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL