ant

package module

v0.0.0-...-bba8168 Latest Latest Go to latest Published: Mar 24, 2022 License: MIT Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mishnea/ant

Links

Open Source Insights

README ¶

ant (alpha) is a web crawler for Go.

Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.


var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.

Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)

Documentation ¶

Overview ¶

Package ant implements a web crawler.

Index ¶

Variables
type Client
type Deduper
- func DedupeBF(k, m uint) Deduper
- func DedupeMap() Deduper
type Engine
- func NewEngine(c EngineConfig) (*Engine, error)
- func (eng *Engine) Enqueue(ctx context.Context, rawurls ...string) error
- func (eng *Engine) Run(ctx context.Context, urls ...string) error
type EngineConfig
type FetchError
- func (err FetchError) Error() string
- func (err FetchError) Temporary() bool
type Fetcher
- func (f *Fetcher) Fetch(ctx context.Context, url *URL) (*Page, error)
type Limiter
type LimiterFunc
- func Limit(n int) LimiterFunc
- func LimitHostname(n int, name string) LimiterFunc
- func LimitPattern(n int, pattern string) LimiterFunc
- func LimitRegexp(n int, expr string) LimiterFunc
- func (f LimiterFunc) Limit(ctx context.Context, u *url.URL) error
type List
- func (l List) At(i int) List
- func (l List) Attr(key string) (string, bool)
- func (l List) Is(selector string) (matched bool)
- func (l List) Query(selector string) List
- func (l List) Scan(dst interface{}) error
- func (l List) Text() string
type Matcher
type MatcherFunc
- func MatchHostname(host string) MatcherFunc
- func MatchPattern(pattern string) MatcherFunc
- func MatchRegexp(expr string) MatcherFunc
- func (mf MatcherFunc) Match(url *url.URL) bool
type Page
- func Fetch(ctx context.Context, rawurl string) (*Page, error)
- func (p *Page) Next(selector string) (URLs, error)
- func (p *Page) Query(selector string) List
- func (p *Page) Scan(dst interface{}) error
- func (p *Page) Text(selector string) string
- func (p *Page) URLs() URLs
type Queue
- func MemoryQueue(size int) Queue
type Scraper
- func JSON(w io.Writer, t interface{}, selectors ...string) Scraper
type StaticAgent
- func (sa StaticAgent) String() string
type URL
type URLs

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// UserAgent is the default user agent to use.
	//
	// The user agent is used by default when fetching
	// pages and robots.txt.
	UserAgent = StaticAgent("antbot")

	// DefaultFetcher is the default fetcher to use.
	//
	// It uses the default client and default user agent.
	DefaultFetcher = &Fetcher{
		Client:    DefaultClient,
		UserAgent: UserAgent,
	}
)

View Source

var DefaultClient = &http.Client{
	Transport: &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		DialContext: (&net.Dialer{
			Timeout:   30 * time.Second,
			KeepAlive: 30 * time.Second,
			DualStack: true,
		}).DialContext,
		ForceAttemptHTTP2:     true,
		MaxIdleConns:          0,
		MaxIdleConnsPerHost:   1000,
		IdleConnTimeout:       90 * time.Second,
		TLSHandshakeTimeout:   10 * time.Second,
		ExpectContinueTimeout: 1 * time.Second,
	},
	Timeout: 10 * time.Second,
}

DefaultClient is the default client to use.

It is configured the same way as the `http.DefualtClient` except for 3 changes:

Timeout => 10s
Transport.MaxIdleConns => infinity
Transport.MaxIdleConnsPerHost => 1,000

Note that this default client is used for all robots.txt requests when they're enabled.

Functions ¶

This section is empty.

Types ¶

type Client ¶

type Client interface {
	// Do sends an HTTP request and returns an HTTP response.
	//
	// The method does not rely on the HTTP response code to return an error
	// also a non-nil error does not guarantee that the response is nil, its
	// body must be closed and read until EOF so that the underlying resources
	// may be reused.
	Do(req *http.Request) (*http.Response, error)
}

Client represents an HTTP client.

A client is used by the fetcher to turn URLs into pages, it is up to the client to decide how it manages the underlying connections, redirects or cookies.

A client must be safe to use from multiple goroutines.

type Deduper ¶

type Deduper interface {
	// Dedupe de-duplicates the given URLs.
	//
	// The method returns a new slice of URLs
	// that were not visited yet, it must be
	// thread-safe.
	//
	// The function is not required to normalize the URLs
	// the engine normalizes them before calling the method.
	//
	// If an error is returned that implements
	// `Temporary() bool` and returns true, the
	// engine will retry.
	Dedupe(ctx context.Context, urls URLs) (URLs, error)
}

Deduper represents a URL de-duplicator.

A deduper must be safe to use from multiple goroutines.

func DedupeBF ¶

func DedupeBF(k, m uint) Deduper

DedupeBF returns a new deduper backed by bloom filter.

The de-duplicator uses an in-memory bloomfilter to check if a URL has been visited, when `Dedupe()` is called with a set of URLs, it will loop over them and check if they exist in the set, if they are not, it will add them to the set and return them.

func DedupeMap ¶

func DedupeMap() Deduper

DedupeMap returns a new deduper backed by sync.Map.

The de-duplicator is in-efficient and is meant to be used for smaller crawls, it keeps the URLs in-memory.

If you're concerned about memory use, either supply your own de-duplicator implementation or use `DedupeBF()`.

type Engine ¶

type Engine struct {
	// contains filtered or unexported fields
}

Engine implements web crawler engine.

func NewEngine ¶

func NewEngine(c EngineConfig) (*Engine, error)

NewEngine returns a new engine.

func (*Engine) Enqueue ¶

func (eng *Engine) Enqueue(ctx context.Context, rawurls ...string) error

Enqueue enqueues the given set of URLs.

The method blocks until all URLs are queued or the given context is canceled.

The method will also de-duplicate the URLs, ensuring that URLs will not be visited more than once.

func (*Engine) Run ¶

func (eng *Engine) Run(ctx context.Context, urls ...string) error

Run runs the engine with the given start urls.

type EngineConfig ¶

type EngineConfig struct {
	// Scraper is the scraper to use.
	//
	// If nil, NewEngine returns an error.
	Scraper Scraper

	// Deduper is the URL de-duplicator to use.
	//
	// If nil, DedupeMap is used.
	Deduper Deduper

	// Fetcher is the page fetcher to use.
	//
	// If nil, the default HTTP fetcher is used.
	Fetcher *Fetcher

	// Queue is the URL queue to use.
	//
	// If nil, the default in-memory queue is used.
	Queue Queue

	// Limiter is the rate limiter to use.
	//
	// The limiter is called with each URL before
	// it is fetched.
	//
	// If nil, no limits are used.
	Limiter Limiter

	// Matcher is the URL matcher to use.
	//
	// The matcher is called with a URL before it is queued
	// if it returns false the URL is discarded.
	//
	// If nil, all URLs are queued.
	Matcher Matcher

	// Impolite skips any robots.txt checking.
	//
	// Note that it does not affect any configured
	// ratelimiters or matchers.
	//
	// By default the engine checks robots.txt, it uses
	// the default ant.UserAgent.
	Impolite bool

	// Workers specifies the amount of workers to use.
	//
	// Every worker the engine start consumes URLs from the queue
	// and starts a goroutine for each URL.
	//
	// If <= 0, defaults to 1.
	Workers int

	// Concurrency is the maximum amount of URLs to process
	// at any given time.
	//
	// The engine uses a global semaphore to limit the amount
	// of goroutines started by the workers.
	//
	// If <= 0, there's no limit.
	Concurrency int
}

EngineConfig configures the engine.

type FetchError ¶

type FetchError struct {
	URL    *url.URL
	Status int
}

FetchError represents a fetch error.

func (FetchError) Error ¶

func (err FetchError) Error() string

Error implementation.

func (FetchError) Temporary ¶

func (err FetchError) Temporary() bool

Temporary returns true if the HTTP status code generally means the error is temporary.

type Fetcher ¶

type Fetcher struct {
	// Client is the client to use.
	//
	// If nil, ant.DefaultClient is used.
	Client Client

	// UserAgent is the user agent to use.
	//
	// It implements the fmt.Stringer interface
	// to allow user agent spoofing when needed.
	//
	// If nil, the client decides the user agent.
	UserAgent fmt.Stringer

	// MaxAttempts is the maximum request attempts to make.
	//
	// When <= 0, it defaults to 5.
	MaxAttempts int

	// MinBackoff to use when the fetcher retries.
	//
	// Must be less than MaxBackoff, otherwise
	// the fetcher returns an error.
	//
	// Defaults to `50ms`.
	MinBackoff time.Duration

	// MaxBackoff to use when the fetcher retries.
	//
	// Must be greater than MinBackoff, otherwise the
	// fetcher returns an error.
	//
	// Defaults to `1s`.
	MaxBackoff time.Duration
}

Fetcher implements a page fetcher.

func (*Fetcher) Fetch ¶

func (f *Fetcher) Fetch(ctx context.Context, url *URL) (*Page, error)

Fetch fetches a page by URL.

The method uses the configured client to make a new request parse the response and return a page.

The method returns a nil page and nil error when the status code is 404.

The will retry the request when the status code is temporary or when a temporary network error occures.

The returned page contains the response's body, the body must be read until EOF and closed so that the client can re-use the underlying TCP connection.

type Limiter ¶

type Limiter interface {
	// Limit blocks until a request is allowed to happen.
	//
	// The method receives a URL and must block until a request
	// to the URL is allowed to happen.
	//
	// If the given context is canceled, the method returns immediately
	// with the context's err.
	Limit(ctx context.Context, u *url.URL) error
}

Limiter controls how many requests can be made by the engine.

A limiter receives a context and a URL and blocks until a request is allowed to happen or returns an error if the context is canceled.

A limiter must be safe to use from multiple goroutines.

type LimiterFunc ¶

type LimiterFunc func(context.Context, *url.URL) error

LimiterFunc implements a limiter.

func Limit ¶

func Limit(n int) LimiterFunc

Limit returns a new limiter.

The limiter allows `n` requests per second.

func LimitHostname ¶

func LimitHostname(n int, name string) LimiterFunc

LimitHostname returns a hostname limiter.

The limiter allows `n` requests for the hostname per second.

func LimitPattern ¶

func LimitPattern(n int, pattern string) LimiterFunc

LimitPattern returns a pattern limiter.

The limiter allows `n` requests for any URLs that match the pattern per second.

The provided pattern is matched against a URL that does not contain the query string or the scheme.

func LimitRegexp ¶

func LimitRegexp(n int, expr string) LimiterFunc

LimitRegexp returns a new regexp limiter.

The limiter limits all URLs that match the regexp the URL does not contain the scheme and the query parameters.

func (LimiterFunc) Limit ¶

func (f LimiterFunc) Limit(ctx context.Context, u *url.URL) error

Limit implementation.

type List ¶

type List []*html.Node

List represents a list of nodes.

The list wraps the html node slice with helper methods to extract data and manipulate the list.

func (List) At ¶

func (l List) At(i int) List

At returns a list that contains the node at index i.

If a negative index is provided the method returns node from the end of the list.

func (List) Attr ¶

func (l List) Attr(key string) (string, bool)

Attr returns the attribute value of key of the first node.

func (List) Is ¶

func (l List) Is(selector string) (matched bool)

Is returns true if any of the nodes matches selector.

func (List) Query ¶

func (l List) Query(selector string) List

Query returns a list of nodes matching selector.

If the selector is invalid, the method returns a nil list.

func (List) Scan ¶

func (l List) Scan(dst interface{}) error

Scan scans all items into struct `dst`.

The method scans data from the 1st node.

func (List) Text ¶

func (l List) Text() string

Text returns inner text of the first node..

type Matcher ¶

type Matcher interface {
	// Match returns true if the URL matches.
	//
	// The method will be just before a URL is queued
	// if it returns false, the URL will not be queued.
	Match(url *url.URL) bool
}

Matcher represents a URL matcher.

A matcher must be safe to use from multiple goroutines.

type MatcherFunc ¶

type MatcherFunc func(*url.URL) bool

MatcherFunc implements a Matcher.

func MatchHostname ¶

func MatchHostname(host string) MatcherFunc

MatchHostname returns a new hostname matcher.

The matcher returns true for all URLs that match the host.

func MatchPattern ¶

func MatchPattern(pattern string) MatcherFunc

MatchPattern returns a new pattern matcher.

The matcher returns true for all URLs that match the pattern, the URL does not contain the scheme and the query parameters.

func MatchRegexp ¶

func MatchRegexp(expr string) MatcherFunc

MatchRegexp returns a new regexp matcher.

The matcher returns true for all URLs that match the regexp, the URL does not contain the scheme and the query parameters.

func (MatcherFunc) Match ¶

func (mf MatcherFunc) Match(url *url.URL) bool

Match implementation.

type Page ¶

type Page struct {
	URL    *url.URL
	Header http.Header
	// contains filtered or unexported fields
}

Page represents a page.

func Fetch ¶

func Fetch(ctx context.Context, rawurl string) (*Page, error)

Fetch fetches a page from URL.

func (*Page) Next ¶

func (p *Page) Next(selector string) (URLs, error)

Next all URLs matching the given selector.

func (*Page) Query ¶

func (p *Page) Query(selector string) List

Query returns all nodes matching selector.

The method returns an empty list if no nodes were found.

func (*Page) Scan ¶

func (p *Page) Scan(dst interface{}) error

Scan scans data into the given value dst.

func (*Page) Text ¶

func (p *Page) Text(selector string) string

Text returns the text of the selected node.

The method returns an empty string if the node is not found.

func (*Page) URLs ¶

func (p *Page) URLs() URLs

URLs returns all URLs on the page.

The method skips any invalid URLs.

type Queue ¶

type Queue interface {
	// Enqueue enqueues the given set of URLs.
	//
	// The method returns an io.EOF if the queue was
	// closed and a context error if the context was
	// canceled.
	//
	// Any other error will be treated as a critical
	// error and will be porpagated.
	Enqueue(ctx context.Context, urls URLs) error

	// Dequeue dequeues a URL.
	//
	// The method returns a URL or io.EOF error if
	// the queue was stopped.
	//
	// The method blocks until a URL is available or
	// until the queue is closed.
	Dequeue(ctx context.Context) (*URL, error)

	// Done acknowledges a URL.
	//
	// When a URL has been handled by the engine the method
	// is called with the URL.
	Done(ctx context.Context, url *URL) error

	// Wait blocks until the queue is closed.
	//
	// When the engine encounters an error, or there are
	// no more URLs to handle the method should unblock.
	Wait()

	// Close closes the queue.
	//
	// The method blocks until the queue is closed
	// any queued URLs are discarded.
	Close(context.Context) error
}

Queue represents a URL queue.

A queue must be safe to use from multiple goroutines.

func MemoryQueue ¶

func MemoryQueue(size int) Queue

MemoryQueue returns a new memory queue.

type Scraper ¶

type Scraper interface {
	// Scrape scrapes the given page.
	//
	// The method can return a set of URLs that should
	// be queued and scraped next.
	//
	// If the scraper returns an error and it implements
	// a `Temporary() bool` method that returns true it will
	// be retried.
	Scrape(ctx context.Context, p *Page) (URLs, error)
}

Scraper represents a scraper.

A scraper must be safe to use from multiple goroutines.

func JSON ¶

func JSON(w io.Writer, t interface{}, selectors ...string) Scraper

JSON returns a new JSON scraper.

The scraper receives the writer to write JSON lines into the type to scrape from pages and optional selectors from which to extract the next set of pages to crawl.

The provided type `t` must be a struct, otherwise the scraper will return an error on the initial scrape and the crawl engine will abort.

The scraper uses the `encoding/json` package to encode the provided type into JSON, any errors that are received from the encoder are returned from the scraper.

If no selectors are provided, the scraper will return all valid URLs on the page.

type StaticAgent ¶

type StaticAgent string

StaticAgent is a static user agent string.

func (StaticAgent) String ¶

func (sa StaticAgent) String() string

String implementation.

type URL ¶

type URL = url.URL

URL represents a parsed URL.

type URLs ¶

type URLs = []*URL

URLs represents a slice of parsed URLs.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
_examples
cache command
cdp command
jsonquotes command
quotes command
antcache Package antcache implements an HTTP client that caches responses.	Package antcache implements an HTTP client that caches responses.
antcdp Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.	Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.
anttest Package anttest implements scraper test helpers.	Package anttest implements scraper test helpers.
internal
normalize Package normalize provides URL normalization.	Package normalize provides URL normalization.
robots Package robots implements a higher-level robots.txt interface.	Package robots implements a higher-level robots.txt interface.
scan Package scan implements structures that can scan HTML into go values.	Package scan implements structures that can scan HTML into go values.
selectors Package selectors provides utilities to compile and cache CSS selectors.	Package selectors provides utilities to compile and cache CSS selectors.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL