colly

package module
v0.0.0-...-05f1932 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 6, 2017 License: Apache-2.0 Imports: 14 Imported by: 0

README

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Documentation

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Println(link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	c.Visit("https://en.wikipedia.org/")
}

See examples folder for more detailed examples.

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Documentation

Overview

Package colly implements a HTTP scraping framework

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Collector

type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// AllowURLRevisit allows multiple downloads of the same URL
	AllowURLRevisit bool
	// MaxBodySize limits the retrieved response body. `0` means unlimited.
	// The default value for MaxBodySize is 10240 (10MB)
	MaxBodySize int
	// contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector

func NewCollector() *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) DisableCookies

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling for this collector

func (*Collector) Init

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new `LimitRule` to the collector

func (*Collector) Limits

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new `LimitRule`s to the collector

func (*Collector) OnError

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed on every error.

func (*Collector) OnHTML

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnRequest

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) Post

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts collecting job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) SetRequestTimeout

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) Visit

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) Wait

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport

func (c *Collector) WithTransport(transport *http.Transport)

WithTransport allows you to set a custom http.Transport for this collector.

type Context

type Context struct {
	// contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) Get

func (c *Context) Get(key string) string

Get retrieves a value from Context. If no value found for `k` Get returns an empty string if key not found

func (*Context) Put

func (c *Context) Put(key, value string)

Put stores a value in Context

type ErrorCallback

type ErrorCallback func(*Request, *Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement

type HTMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func (*HTMLElement) Attr

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

type LimitRule

type LimitRule struct {
	// DomainRegexp is a regular expression to match against domains
	DomainRegexp string
	// DomainRegexp is a glob pattern to match against domains
	DomainGlob string
	// Delay is the duration to wait before creating a new request to the matching domains
	Delay time.Duration
	// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
	Parallelism int
	// contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. There can be two kind of limitations:

  • Parallelism: Set limit for the number of concurrent requests to a domain
  • Delay: Set rate limit for a domain (this means no parallelism on the matching domains)

func (*LimitRule) Init

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type Request

type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of this request
	Depth int
	// contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) AbsoluteURL

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Post

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Request) Visit

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

type RequestCallback

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response

type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

type ResponseCallback

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL