colly

package module

v0.0.0-...-05f1932 Latest Latest Go to latest Published: Oct 6, 2017 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tampajohn/colly

Links

Open Source Insights

README ¶

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Documentation

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Println(link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	c.Visit("https://en.wikipedia.org/")
}

See examples folder for more detailed examples.

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Documentation ¶

Overview ¶

Package colly implements a HTTP scraping framework

Index ¶

type Collector
- func NewCollector() *Collector
type Context
- func NewContext() *Context
- func (c *Context) Get(key string) string
- func (c *Context) Put(key, value string)
type ErrorCallback
type HTMLCallback
type HTMLElement
- func (h *HTMLElement) Attr(k string) string
type LimitRule
- func (r *LimitRule) Init() error
- func (r *LimitRule) Match(domain string) bool
type Request
type RequestCallback
type Response
type ResponseCallback

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Collector ¶

type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// AllowURLRevisit allows multiple downloads of the same URL
	AllowURLRevisit bool
	// MaxBodySize limits the retrieved response body. `0` means unlimited.
	// The default value for MaxBodySize is 10240 (10MB)
	MaxBodySize int
	// contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector ¶

func NewCollector() *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) DisableCookies ¶

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling for this collector

func (*Collector) Init ¶

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit ¶

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new `LimitRule` to the collector

func (*Collector) Limits ¶

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new `LimitRule`s to the collector

func (*Collector) OnError ¶

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed on every error.

func (*Collector) OnHTML ¶

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnRequest ¶

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse ¶

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) Post ¶

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts collecting job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) SetRequestTimeout ¶

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) Visit ¶

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) Wait ¶

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport ¶

func (c *Collector) WithTransport(transport *http.Transport)

WithTransport allows you to set a custom http.Transport for this collector.

type Context ¶

type Context struct {
	// contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext ¶

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) Get ¶

func (c *Context) Get(key string) string

Get retrieves a value from Context. If no value found for `k` Get returns an empty string if key not found

func (*Context) Put ¶

func (c *Context) Put(key, value string)

Put stores a value in Context

type ErrorCallback ¶

type ErrorCallback func(*Request, *Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback ¶

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement ¶

type HTMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func (*HTMLElement) Attr ¶

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

type LimitRule ¶

type LimitRule struct {
	// DomainRegexp is a regular expression to match against domains
	DomainRegexp string
	// DomainRegexp is a glob pattern to match against domains
	DomainGlob string
	// Delay is the duration to wait before creating a new request to the matching domains
	Delay time.Duration
	// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
	Parallelism int
	// contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. There can be two kind of limitations:

Parallelism: Set limit for the number of concurrent requests to a domain
Delay: Set rate limit for a domain (this means no parallelism on the matching domains)

func (*LimitRule) Init ¶

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match ¶

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type Request ¶

type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of this request
	Depth int
	// contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) AbsoluteURL ¶

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Post ¶

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Request) Visit ¶

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

type RequestCallback ¶

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response ¶

type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

type ResponseCallback ¶

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
bad_request
basic
coursera_courses
max_depth
parallel
rate_limit
request_context

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL