fetchbot

package module
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2021 License: BSD-3-Clause Imports: 11 Imported by: 93

README

fetchbot build status Go Reference

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt. It also integrates code from the iq package.

The API documentation is available on godoc.org.

Changes

  • 2019-09-11 (v1.2.0): update robotstxt dependency (import path/repo URL has changed, issue #31, thanks to @michael-stevens for raising the issue).
  • 2017-09-04 (v1.1.1): fix a goroutine leak when cancelling a Queue (issue #26, thanks to @ryu-koui for raising the issue).
  • 2017-07-06 (v1.1.0): add Queue.Done to get the done channel on the queue, allowing to wait in a select statement (thanks to @DennisDenuto).
  • 2015-07-25 (v1.0.0) : add Cancel method on the Queue, to close and drain without requesting any pending commands, unlike Close that waits for all pending commands to be processed (thanks to @buro9 for the feature request).
  • 2015-07-24 : add HandlerCmd and call the Command's Handler function if it implements the Handler interface, bypassing the Fetcher's handler. Support a Custom matcher on the Mux, using a predicate. (thanks to @mmcdole for the feature requests).
  • 2015-06-18 : add Scheme criteria on the muxer (thanks to @buro9).
  • 2015-06-10 : add DisablePoliteness field on the Fetcher to optionally bypass robots.txt checks (thanks to @oli-g).
  • 2014-07-04 : change the type of Fetcher.HttpClient from *http.Client to the Doer interface. Low chance of breaking existing code, but it's a possibility if someone used the fetcher's client to run other requests (e.g. f.HttpClient.Get(...)).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Fetcher

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs.

  • BasicAuthProvider: Implement this interface to specify the basic authentication credentials to set on the request.

  • CookiesProvider: If the Command implements this interface, the provided Cookies will be set on the request.

  • HeaderProvider: Implement this interface to specify the headers to set on the request.

  • ReaderProvider: Implement this interface to set the body of the request, via an io.Reader.

  • ValuesProvider: Implement this interface to set the body of the request, as form-encoded values. If the Content-Type is not specifically set via a HeaderProvider, it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider should be mutually exclusive as they both set the body of the request. If both are implemented, the ReaderProvider interface is used.

  • Handler: Implement this interface if the Command's response should be handled by a specific callback function. By default, the response is handled by the Fetcher's Handler, but if the Command implements this, this handler function takes precedence and the Fetcher's Handler is ignored.

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

There is also a convenience HandlerCmd struct for the commands that should be handled by a specific callback function. It is a Command with a Handler interface implementation.

Fetcher Options

The Fetcher has a number of fields that provide further customization:

  • HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

  • CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

  • UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

  • WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

  • AutoClose : If true, closes the queue automatically once the number of active hosts reach 0.

  • DisablePoliteness : If true, ignores the robots.txt policies of the hosts.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license, the same as the Go language. The iq package source code is under the CDDL-1.0 license (details in the source file).

Documentation

Overview

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt (https://github.com/temoto/robotstxt). It also integrates code from the iq package (https://github.com/kylelemons/iq).

The API documentation is available on godoc.org (http://godoc.org/github.com/PuerkitoBio/fetchbot).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Fetcher

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs.

* BasicAuthProvider: Implement this interface to specify the basic authentication credentials to set on the request.

* CookiesProvider: If the Command implements this interface, the provided Cookies will be set on the request.

* HeaderProvider: Implement this interface to specify the headers to set on the request.

* ReaderProvider: Implement this interface to set the body of the request, via an io.Reader.

* ValuesProvider: Implement this interface to set the body of the request, as form-encoded values. If the Content-Type is not specifically set via a HeaderProvider, it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider should be mutually exclusive as they both set the body of the request. If both are implemented, the ReaderProvider interface is used.

* Handler: Implement this interface if the Command's response should be handled by a specific callback function. By default, the response is handled by the Fetcher's Handler, but if the Command implements this, this handler function takes precedence and the Fetcher's Handler is ignored.

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString\* methods.

There is also a convenience HandlerCmd struct for the commands that should be handled by a specific callback function. It is a Command with a Handler interface implementation.

Fetcher Options

The Fetcher has a number of fields that provide further customization:

* HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

* CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

* UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

* WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

* AutoClose : If true, closes the queue automatically once the number of active hosts reach 0.

* DisablePoliteness : If true, ignores the robots.txt policies of the hosts.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in the source file).

Index

Constants

View Source
const (
	// DefaultCrawlDelay represents the delay to use if there is no robots.txt
	// specified delay.
	DefaultCrawlDelay = 5 * time.Second

	// DefaultUserAgent is the default user agent string.
	DefaultUserAgent = "Fetchbot (https://github.com/PuerkitoBio/fetchbot)"

	// DefaultWorkerIdleTTL is the default time-to-live of an idle host worker goroutine.
	// If no URL is sent for a given host within this duration, this host's goroutine
	// is disposed of.
	DefaultWorkerIdleTTL = 30 * time.Second
)

Variables

View Source
var (
	// ErrEmptyHost is returned if a command to be enqueued has an URL with an empty host.
	ErrEmptyHost = errors.New("fetchbot: invalid empty host")

	// ErrDisallowed is returned when the requested URL is disallowed by the robots.txt
	// policy.
	ErrDisallowed = errors.New("fetchbot: disallowed by robots.txt")

	// ErrQueueClosed is returned when a Send call is made on a closed Queue.
	ErrQueueClosed = errors.New("fetchbot: send on a closed queue")
)

Functions

This section is empty.

Types

type BasicAuthProvider

type BasicAuthProvider interface {
	BasicAuth() (user string, pwd string)
}

BasicAuthProvider interface gets the credentials to use to perform the request with Basic Authentication.

type Cmd

type Cmd struct {
	U *url.URL
	M string
}

Cmd defines a basic Command implementation.

func (*Cmd) Method

func (c *Cmd) Method() string

Method returns the HTTP verb to use to process this command (i.e. "GET", "HEAD", etc.).

func (*Cmd) URL

func (c *Cmd) URL() *url.URL

URL returns the resource targeted by this command.

type Command

type Command interface {
	URL() *url.URL
	Method() string
}

Command interface defines the methods required by the Fetcher to request a resource.

type Context

type Context struct {
	Cmd Command
	Q   *Queue
}

Context is a Command's fetch context, passed to the Handler. It gives access to the original Command and the associated Queue.

type CookiesProvider

type CookiesProvider interface {
	Cookies() []*http.Cookie
}

CookiesProvider interface gets the cookies to send with the request.

type DebugInfo

type DebugInfo struct {
	NumHosts int
}

The DebugInfo holds information to introspect the Fetcher's state.

type Doer

type Doer interface {
	Do(*http.Request) (*http.Response, error)
}

Doer defines the method required to use a type as HttpClient. The net/*http.Client type satisfies this interface.

type Fetcher

type Fetcher struct {
	// The Handler to be called for each request. All successfully enqueued requests
	// produce a Handler call.
	Handler Handler

	// DisablePoliteness disables fetching and using the robots.txt policies of
	// hosts.
	DisablePoliteness bool

	// Default delay to use between requests to a same host if there is no robots.txt
	// crawl delay or if DisablePoliteness is true.
	CrawlDelay time.Duration

	// The *http.Client to use for the requests. If nil, defaults to the net/http
	// package's default client. Should be HTTPClient to comply with go lint, but
	// this is a breaking change, won't fix.
	HttpClient Doer

	// The user-agent string to use for robots.txt validation and URL fetching.
	UserAgent string

	// The time a host-dedicated worker goroutine can stay idle, with no Command to enqueue,
	// before it is stopped and cleared from memory.
	WorkerIdleTTL time.Duration

	// AutoClose makes the fetcher close its queue automatically once the number
	// of hosts reach 0. A host is removed once it has been idle for WorkerIdleTTL
	// duration.
	AutoClose bool
	// contains filtered or unexported fields
}

A Fetcher defines the parameters for running a web crawler.

func New

func New(h Handler) *Fetcher

New returns an initialized Fetcher.

func (*Fetcher) Debug

func (f *Fetcher) Debug() <-chan *DebugInfo

Debug returns the channel to use to receive the debugging information. It is not intended to be used by package users.

func (*Fetcher) Start

func (f *Fetcher) Start() *Queue

Start starts the Fetcher, and returns the Queue to use to send Commands to be fetched.

type Handler

type Handler interface {
	Handle(*Context, *http.Response, error)
}

The Handler interface is used to process the Fetcher's requests. It is similar to the net/http.Handler interface.

type HandlerCmd

type HandlerCmd struct {
	*Cmd
	HandlerFunc
}

HandlerCmd is a basic Command with its own Handler function that is called to handle the HTTP response.

func NewHandlerCmd

func NewHandlerCmd(method, rawURL string, fn func(*Context, *http.Response, error)) (*HandlerCmd, error)

NewHandlerCmd creates a HandlerCmd for the provided request and callback handler function.

type HandlerFunc

type HandlerFunc func(*Context, *http.Response, error)

A HandlerFunc is a function signature that implements the Handler interface. A function with this signature can thus be used as a Handler.

func (HandlerFunc) Handle

func (h HandlerFunc) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for the HandlerFunc type.

type HeaderProvider

type HeaderProvider interface {
	Header() http.Header
}

HeaderProvider interface gets the headers to set on the request. If an Authorization header is set, it will be overridden by the BasicAuthProvider, if implemented.

type Mux

type Mux struct {
	DefaultHandler Handler
	// contains filtered or unexported fields
}

Mux is a simple multiplexer for the Handler interface, similar to net/http.ServeMux. It is itself a Handler, and dispatches the calls to the matching Handlers.

For error Handlers, if there is a Handler registered for the same error value, it will be called. Otherwise, if there is a Handler registered for any error, this Handler will be called.

For Response Handlers, a match with a path criteria has higher priority than other matches, and the longer path match will get called.

If multiple Response handlers with the same path length (or no path criteria) match a response, the actual handler called is undefined, but one and only one will be called.

In any case, if no Handler matches, the DefaultHandler is called, and it defaults to a no-op.

func NewMux

func NewMux() *Mux

NewMux returns an initialized Mux.

func (*Mux) Handle

func (mux *Mux) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for Mux. It dispatches the calls to the matching Handler.

func (*Mux) HandleError

func (mux *Mux) HandleError(err error, h Handler)

HandleError registers a Handler for a specific error value. Multiple calls with the same error value override previous calls. As a special case, a nil error value registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) HandleErrors

func (mux *Mux) HandleErrors(h Handler)

HandleErrors registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) Response

func (mux *Mux) Response() *ResponseMatcher

Response initializes an entry for a Response Handler based on various criteria. The Response Handler is not registered until Handle is called.

type Queue

type Queue struct {
	// contains filtered or unexported fields
}

Queue offers methods to send Commands to the Fetcher, and to Stop the crawling process. It is safe to use from concurrent goroutines.

func (*Queue) Block

func (q *Queue) Block()

Block blocks the current goroutine until the Queue is closed and all pending commands are drained.

func (*Queue) Cancel

func (q *Queue) Cancel() error

Cancel closes the Queue and drains the pending commands without processing them, allowing for a fast "stop immediately"-ish operation.

func (*Queue) Close

func (q *Queue) Close() error

Close closes the Queue so that no more Commands can be sent. It blocks until the Fetcher drains all pending commands. After the call, the Fetcher is stopped. Attempts to enqueue new URLs after Close has been called will always result in a ErrQueueClosed error.

func (*Queue) Done added in v1.1.0

func (q *Queue) Done() <-chan struct{}

Done returns a channel that is closed when the Queue is closed (either via Close or Cancel). Multiple calls always return the same channel.

func (*Queue) Send

func (q *Queue) Send(c Command) error

Send enqueues a Command into the Fetcher. If the Queue has been closed, it returns ErrQueueClosed. The Command's URL must have a Host.

func (*Queue) SendString

func (q *Queue) SendString(method string, rawurl ...string) (int, error)

SendString enqueues a method and some URL strings into the Fetcher. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringGet

func (q *Queue) SendStringGet(rawurl ...string) (int, error)

SendStringGet enqueues the URL strings to be fetched with a GET method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringHead

func (q *Queue) SendStringHead(rawurl ...string) (int, error)

SendStringHead enqueues the URL strings to be fetched with a HEAD method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

type ReaderProvider

type ReaderProvider interface {
	Reader() io.Reader
}

ReaderProvider interface gets the Reader to use as the Body of the request. It has higher priority than the ValuesProvider interface, so that if both interfaces are implemented, the ReaderProvider is used.

type ResponseMatcher

type ResponseMatcher struct {
	// contains filtered or unexported fields
}

A ResponseMatcher holds the criteria for a response Handler.

func (*ResponseMatcher) ContentType

func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher

ContentType sets a criteria based on the Content-Type header for the Response Handler. Its Handler will only be called if it has this content type, ignoring any additional parameter on the Header value (following the semicolon, i.e. "text/html; charset=utf-8").

func (*ResponseMatcher) Custom

func (r *ResponseMatcher) Custom(predicate func(*http.Response) bool) *ResponseMatcher

Custom sets a criteria based on a function that receives the HTTP response and returns true if the matcher should be used to handle this response, false otherwise.

func (*ResponseMatcher) Handler

func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher

Handler sets the Handler to be called when this Response Handler is the match for a given response. It registers the Response Handler in its parent Mux.

func (*ResponseMatcher) Host

func (r *ResponseMatcher) Host(host string) *ResponseMatcher

Host sets a criteria based on the host of the URL for the Response Handler. Its Handler will only be called if the host of the URL matches exactly the specified host.

func (*ResponseMatcher) Method

func (r *ResponseMatcher) Method(m string) *ResponseMatcher

Method sets a method criteria for the Response Handler. Its Handler will only be called if it has this HTTP method (i.e. "GET", "HEAD", ...).

func (*ResponseMatcher) Path

Path sets a criteria based on the path of the URL for the Response Handler. Its Handler will only be called if the path of the URL starts with this path. Longer matches have priority over shorter ones.

func (*ResponseMatcher) Scheme

func (r *ResponseMatcher) Scheme(scheme string) *ResponseMatcher

Scheme sets a criteria based on the scheme of the URL for the Response Handler. Its Handler will only be called if the scheme of the URL matches exactly the specified scheme.

func (*ResponseMatcher) Status

func (r *ResponseMatcher) Status(code int) *ResponseMatcher

Status sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has this status code.

func (*ResponseMatcher) StatusRange

func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher

StatusRange sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has a status code between the min and max. If min is greater than max, the values are switched.

type ValuesProvider

type ValuesProvider interface {
	Values() url.Values
}

ValuesProvider interface gets the values to send as the Body of the request. It has lower priority than the ReaderProvider interface, so that if both interfaces are implemented, the ReaderProvider is used. If the request has no explicit Content-Type set, it will be automatically set to "application/x-www-form-urlencoded".

Directories

Path Synopsis
example

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL