ant

package module
v0.0.0-...-bba8168 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2022 License: MIT Imports: 24 Imported by: 0

README




ant (alpha) is a web crawler for Go.








Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.


var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.


Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)


Documentation

Overview

Package ant implements a web crawler.

Index

Constants

This section is empty.

Variables

View Source
var (
	// UserAgent is the default user agent to use.
	//
	// The user agent is used by default when fetching
	// pages and robots.txt.
	UserAgent = StaticAgent("antbot")

	// DefaultFetcher is the default fetcher to use.
	//
	// It uses the default client and default user agent.
	DefaultFetcher = &Fetcher{
		Client:    DefaultClient,
		UserAgent: UserAgent,
	}
)
View Source
var DefaultClient = &http.Client{
	Transport: &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		DialContext: (&net.Dialer{
			Timeout:   30 * time.Second,
			KeepAlive: 30 * time.Second,
			DualStack: true,
		}).DialContext,
		ForceAttemptHTTP2:     true,
		MaxIdleConns:          0,
		MaxIdleConnsPerHost:   1000,
		IdleConnTimeout:       90 * time.Second,
		TLSHandshakeTimeout:   10 * time.Second,
		ExpectContinueTimeout: 1 * time.Second,
	},
	Timeout: 10 * time.Second,
}

DefaultClient is the default client to use.

It is configured the same way as the `http.DefualtClient` except for 3 changes:

  • Timeout => 10s
  • Transport.MaxIdleConns => infinity
  • Transport.MaxIdleConnsPerHost => 1,000

Note that this default client is used for all robots.txt requests when they're enabled.

Functions

This section is empty.

Types

type Client

type Client interface {
	// Do sends an HTTP request and returns an HTTP response.
	//
	// The method does not rely on the HTTP response code to return an error
	// also a non-nil error does not guarantee that the response is nil, its
	// body must be closed and read until EOF so that the underlying resources
	// may be reused.
	Do(req *http.Request) (*http.Response, error)
}

Client represents an HTTP client.

A client is used by the fetcher to turn URLs into pages, it is up to the client to decide how it manages the underlying connections, redirects or cookies.

A client must be safe to use from multiple goroutines.

type Deduper

type Deduper interface {
	// Dedupe de-duplicates the given URLs.
	//
	// The method returns a new slice of URLs
	// that were not visited yet, it must be
	// thread-safe.
	//
	// The function is not required to normalize the URLs
	// the engine normalizes them before calling the method.
	//
	// If an error is returned that implements
	// `Temporary() bool` and returns true, the
	// engine will retry.
	Dedupe(ctx context.Context, urls URLs) (URLs, error)
}

Deduper represents a URL de-duplicator.

A deduper must be safe to use from multiple goroutines.

func DedupeBF

func DedupeBF(k, m uint) Deduper

DedupeBF returns a new deduper backed by bloom filter.

The de-duplicator uses an in-memory bloomfilter to check if a URL has been visited, when `Dedupe()` is called with a set of URLs, it will loop over them and check if they exist in the set, if they are not, it will add them to the set and return them.

func DedupeMap

func DedupeMap() Deduper

DedupeMap returns a new deduper backed by sync.Map.

The de-duplicator is in-efficient and is meant to be used for smaller crawls, it keeps the URLs in-memory.

If you're concerned about memory use, either supply your own de-duplicator implementation or use `DedupeBF()`.

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine implements web crawler engine.

func NewEngine

func NewEngine(c EngineConfig) (*Engine, error)

NewEngine returns a new engine.

func (*Engine) Enqueue

func (eng *Engine) Enqueue(ctx context.Context, rawurls ...string) error

Enqueue enqueues the given set of URLs.

The method blocks until all URLs are queued or the given context is canceled.

The method will also de-duplicate the URLs, ensuring that URLs will not be visited more than once.

func (*Engine) Run

func (eng *Engine) Run(ctx context.Context, urls ...string) error

Run runs the engine with the given start urls.

type EngineConfig

type EngineConfig struct {
	// Scraper is the scraper to use.
	//
	// If nil, NewEngine returns an error.
	Scraper Scraper

	// Deduper is the URL de-duplicator to use.
	//
	// If nil, DedupeMap is used.
	Deduper Deduper

	// Fetcher is the page fetcher to use.
	//
	// If nil, the default HTTP fetcher is used.
	Fetcher *Fetcher

	// Queue is the URL queue to use.
	//
	// If nil, the default in-memory queue is used.
	Queue Queue

	// Limiter is the rate limiter to use.
	//
	// The limiter is called with each URL before
	// it is fetched.
	//
	// If nil, no limits are used.
	Limiter Limiter

	// Matcher is the URL matcher to use.
	//
	// The matcher is called with a URL before it is queued
	// if it returns false the URL is discarded.
	//
	// If nil, all URLs are queued.
	Matcher Matcher

	// Impolite skips any robots.txt checking.
	//
	// Note that it does not affect any configured
	// ratelimiters or matchers.
	//
	// By default the engine checks robots.txt, it uses
	// the default ant.UserAgent.
	Impolite bool

	// Workers specifies the amount of workers to use.
	//
	// Every worker the engine start consumes URLs from the queue
	// and starts a goroutine for each URL.
	//
	// If <= 0, defaults to 1.
	Workers int

	// Concurrency is the maximum amount of URLs to process
	// at any given time.
	//
	// The engine uses a global semaphore to limit the amount
	// of goroutines started by the workers.
	//
	// If <= 0, there's no limit.
	Concurrency int
}

EngineConfig configures the engine.

type FetchError

type FetchError struct {
	URL    *url.URL
	Status int
}

FetchError represents a fetch error.

func (FetchError) Error

func (err FetchError) Error() string

Error implementation.

func (FetchError) Temporary

func (err FetchError) Temporary() bool

Temporary returns true if the HTTP status code generally means the error is temporary.

type Fetcher

type Fetcher struct {
	// Client is the client to use.
	//
	// If nil, ant.DefaultClient is used.
	Client Client

	// UserAgent is the user agent to use.
	//
	// It implements the fmt.Stringer interface
	// to allow user agent spoofing when needed.
	//
	// If nil, the client decides the user agent.
	UserAgent fmt.Stringer

	// MaxAttempts is the maximum request attempts to make.
	//
	// When <= 0, it defaults to 5.
	MaxAttempts int

	// MinBackoff to use when the fetcher retries.
	//
	// Must be less than MaxBackoff, otherwise
	// the fetcher returns an error.
	//
	// Defaults to `50ms`.
	MinBackoff time.Duration

	// MaxBackoff to use when the fetcher retries.
	//
	// Must be greater than MinBackoff, otherwise the
	// fetcher returns an error.
	//
	// Defaults to `1s`.
	MaxBackoff time.Duration
}

Fetcher implements a page fetcher.

func (*Fetcher) Fetch

func (f *Fetcher) Fetch(ctx context.Context, url *URL) (*Page, error)

Fetch fetches a page by URL.

The method uses the configured client to make a new request parse the response and return a page.

The method returns a nil page and nil error when the status code is 404.

The will retry the request when the status code is temporary or when a temporary network error occures.

The returned page contains the response's body, the body must be read until EOF and closed so that the client can re-use the underlying TCP connection.

type Limiter

type Limiter interface {
	// Limit blocks until a request is allowed to happen.
	//
	// The method receives a URL and must block until a request
	// to the URL is allowed to happen.
	//
	// If the given context is canceled, the method returns immediately
	// with the context's err.
	Limit(ctx context.Context, u *url.URL) error
}

Limiter controls how many requests can be made by the engine.

A limiter receives a context and a URL and blocks until a request is allowed to happen or returns an error if the context is canceled.

A limiter must be safe to use from multiple goroutines.

type LimiterFunc

type LimiterFunc func(context.Context, *url.URL) error

LimiterFunc implements a limiter.

func Limit

func Limit(n int) LimiterFunc

Limit returns a new limiter.

The limiter allows `n` requests per second.

func LimitHostname

func LimitHostname(n int, name string) LimiterFunc

LimitHostname returns a hostname limiter.

The limiter allows `n` requests for the hostname per second.

func LimitPattern

func LimitPattern(n int, pattern string) LimiterFunc

LimitPattern returns a pattern limiter.

The limiter allows `n` requests for any URLs that match the pattern per second.

The provided pattern is matched against a URL that does not contain the query string or the scheme.

func LimitRegexp

func LimitRegexp(n int, expr string) LimiterFunc

LimitRegexp returns a new regexp limiter.

The limiter limits all URLs that match the regexp the URL does not contain the scheme and the query parameters.

func (LimiterFunc) Limit

func (f LimiterFunc) Limit(ctx context.Context, u *url.URL) error

Limit implementation.

type List

type List []*html.Node

List represents a list of nodes.

The list wraps the html node slice with helper methods to extract data and manipulate the list.

func (List) At

func (l List) At(i int) List

At returns a list that contains the node at index i.

If a negative index is provided the method returns node from the end of the list.

func (List) Attr

func (l List) Attr(key string) (string, bool)

Attr returns the attribute value of key of the first node.

func (List) Is

func (l List) Is(selector string) (matched bool)

Is returns true if any of the nodes matches selector.

func (List) Query

func (l List) Query(selector string) List

Query returns a list of nodes matching selector.

If the selector is invalid, the method returns a nil list.

func (List) Scan

func (l List) Scan(dst interface{}) error

Scan scans all items into struct `dst`.

The method scans data from the 1st node.

func (List) Text

func (l List) Text() string

Text returns inner text of the first node..

type Matcher

type Matcher interface {
	// Match returns true if the URL matches.
	//
	// The method will be just before a URL is queued
	// if it returns false, the URL will not be queued.
	Match(url *url.URL) bool
}

Matcher represents a URL matcher.

A matcher must be safe to use from multiple goroutines.

type MatcherFunc

type MatcherFunc func(*url.URL) bool

MatcherFunc implements a Matcher.

func MatchHostname

func MatchHostname(host string) MatcherFunc

MatchHostname returns a new hostname matcher.

The matcher returns true for all URLs that match the host.

func MatchPattern

func MatchPattern(pattern string) MatcherFunc

MatchPattern returns a new pattern matcher.

The matcher returns true for all URLs that match the pattern, the URL does not contain the scheme and the query parameters.

func MatchRegexp

func MatchRegexp(expr string) MatcherFunc

MatchRegexp returns a new regexp matcher.

The matcher returns true for all URLs that match the regexp, the URL does not contain the scheme and the query parameters.

func (MatcherFunc) Match

func (mf MatcherFunc) Match(url *url.URL) bool

Match implementation.

type Page

type Page struct {
	URL    *url.URL
	Header http.Header
	// contains filtered or unexported fields
}

Page represents a page.

func Fetch

func Fetch(ctx context.Context, rawurl string) (*Page, error)

Fetch fetches a page from URL.

func (*Page) Next

func (p *Page) Next(selector string) (URLs, error)

Next all URLs matching the given selector.

func (*Page) Query

func (p *Page) Query(selector string) List

Query returns all nodes matching selector.

The method returns an empty list if no nodes were found.

func (*Page) Scan

func (p *Page) Scan(dst interface{}) error

Scan scans data into the given value dst.

func (*Page) Text

func (p *Page) Text(selector string) string

Text returns the text of the selected node.

The method returns an empty string if the node is not found.

func (*Page) URLs

func (p *Page) URLs() URLs

URLs returns all URLs on the page.

The method skips any invalid URLs.

type Queue

type Queue interface {
	// Enqueue enqueues the given set of URLs.
	//
	// The method returns an io.EOF if the queue was
	// closed and a context error if the context was
	// canceled.
	//
	// Any other error will be treated as a critical
	// error and will be porpagated.
	Enqueue(ctx context.Context, urls URLs) error

	// Dequeue dequeues a URL.
	//
	// The method returns a URL or io.EOF error if
	// the queue was stopped.
	//
	// The method blocks until a URL is available or
	// until the queue is closed.
	Dequeue(ctx context.Context) (*URL, error)

	// Done acknowledges a URL.
	//
	// When a URL has been handled by the engine the method
	// is called with the URL.
	Done(ctx context.Context, url *URL) error

	// Wait blocks until the queue is closed.
	//
	// When the engine encounters an error, or there are
	// no more URLs to handle the method should unblock.
	Wait()

	// Close closes the queue.
	//
	// The method blocks until the queue is closed
	// any queued URLs are discarded.
	Close(context.Context) error
}

Queue represents a URL queue.

A queue must be safe to use from multiple goroutines.

func MemoryQueue

func MemoryQueue(size int) Queue

MemoryQueue returns a new memory queue.

type Scraper

type Scraper interface {
	// Scrape scrapes the given page.
	//
	// The method can return a set of URLs that should
	// be queued and scraped next.
	//
	// If the scraper returns an error and it implements
	// a `Temporary() bool` method that returns true it will
	// be retried.
	Scrape(ctx context.Context, p *Page) (URLs, error)
}

Scraper represents a scraper.

A scraper must be safe to use from multiple goroutines.

func JSON

func JSON(w io.Writer, t interface{}, selectors ...string) Scraper

JSON returns a new JSON scraper.

The scraper receives the writer to write JSON lines into the type to scrape from pages and optional selectors from which to extract the next set of pages to crawl.

The provided type `t` must be a struct, otherwise the scraper will return an error on the initial scrape and the crawl engine will abort.

The scraper uses the `encoding/json` package to encode the provided type into JSON, any errors that are received from the encoder are returned from the scraper.

If no selectors are provided, the scraper will return all valid URLs on the page.

type StaticAgent

type StaticAgent string

StaticAgent is a static user agent string.

func (StaticAgent) String

func (sa StaticAgent) String() string

String implementation.

type URL

type URL = url.URL

URL represents a parsed URL.

type URLs

type URLs = []*URL

URLs represents a slice of parsed URLs.

Directories

Path Synopsis
_examples
cdp
Package antcache implements an HTTP client that caches responses.
Package antcache implements an HTTP client that caches responses.
Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.
Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.
Package anttest implements scraper test helpers.
Package anttest implements scraper test helpers.
internal
normalize
Package normalize provides URL normalization.
Package normalize provides URL normalization.
robots
Package robots implements a higher-level robots.txt interface.
Package robots implements a higher-level robots.txt interface.
scan
Package scan implements structures that can scan HTML into go values.
Package scan implements structures that can scan HTML into go values.
selectors
Package selectors provides utilities to compile and cache CSS selectors.
Package selectors provides utilities to compile and cache CSS selectors.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL