colly

package module

v0.0.0-...-2837d3a Latest Latest Go to latest Published: Nov 12, 2017 License: Apache-2.0 Imports: 30 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/schmorrison/colly

Links

Open Source Insights

README ¶

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("https://en.wikipedia.org/")
}

See examples folder for more detailed examples.

Installation

go get -u github.com/gocolly/colly/...

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive

If you are using Colly in a project please send a pull request to add it to the list.

Documentation ¶

Overview ¶

Package colly implements a HTTP scraping framework

Index ¶

func SanitizeFileName(fileName string) string
func UnmarshalHTML(v interface{}, s *goquery.Selection) error
type Collector
- func NewCollector() *Collector
- func (c *Collector) Appengine(req *http.Request)
- func (c *Collector) Clone() *Collector
- func (c *Collector) Cookies(URL string) []*http.Cookie
- func (c *Collector) DisableCookies()
- func (c *Collector) Init()
- func (c *Collector) Limit(rule *LimitRule) error
- func (c *Collector) Limits(rules []*LimitRule) error
- func (c *Collector) OnError(f ErrorCallback)
- func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
- func (c *Collector) OnHTMLDetach(goquerySelector string)
- func (c *Collector) OnRequest(f RequestCallback)
- func (c *Collector) OnResponse(f ResponseCallback)
- func (c *Collector) Post(URL string, requestData map[string]string) error
- func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error
- func (c *Collector) PostRaw(URL string, requestData []byte) error
- func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error
- func (c *Collector) SetCookieJar(j *cookiejar.Jar)
- func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error
- func (c *Collector) SetProxy(proxyURL string) error
- func (c *Collector) SetRequestTimeout(timeout time.Duration)
- func (c *Collector) String() string
- func (c *Collector) Visit(URL string) error
- func (c *Collector) Wait()
- func (c *Collector) WithTransport(transport http.RoundTripper)
type Context
- func NewContext() *Context
- func (c *Context) Get(key string) string
- func (c *Context) GetAny(key string) interface{}
- func (c *Context) MarshalBinary() (_ []byte, _ error)
- func (c *Context) Put(key string, value interface{})
- func (c *Context) UnmarshalBinary(_ []byte) error
type ErrorCallback
type HTMLCallback
type HTMLElement
- func (h *HTMLElement) Attr(k string) string
- func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string
- func (h *HTMLElement) ChildText(goquerySelector string) string
- func (h *HTMLElement) Unmarshal(v interface{}) error
type LimitRule
- func (r *LimitRule) Init() error
- func (r *LimitRule) Match(domain string) bool
type Request
- func (r *Request) AbsoluteURL(u string) string
- func (r *Request) Post(URL string, requestData map[string]string) error
- func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error
- func (r *Request) PostRaw(URL string, requestData []byte) error
- func (r *Request) Visit(URL string) error
type RequestCallback
type Response
- func (r *Response) FileName() string
- func (r *Response) Save(fileName string) error
type ResponseCallback

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SanitizeFileName ¶

func SanitizeFileName(fileName string) string

SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.

func UnmarshalHTML ¶

func UnmarshalHTML(v interface{}, s *goquery.Selection) error

UnmarshalHTML declaratively extracts text or attributes to a struct from HTML response using struct tags composed of css selectors. Allowed struct tags:

"selector" (required): CSS (goquery) selector of the desired data
"attr" (optional): Selects the matching element's attribute's value. Leave it blank or omit to get the text of the element.

Example struct declaration:

type Nested struct {
	String  string   `selector:"div > p"`
   Classes []string `selector:"li" attr:"class"`
	Struct  *Nested  `selector:"div > div"`
}

Supported types: struct, *struct, string, []string

Types ¶

type Collector ¶

type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// DisallowedDomains is a domain blacklist.
	DisallowedDomains []string
	// URLFilters is a list of regular expressions which restricts
	// visiting URLs. If any of the rules matches to a URL the
	// request won't be stopped.
	// Leave it blank to allow any URLs to be visited
	URLFilters []*regexp.Regexp
	// AllowURLRevisit allows multiple downloads of the same URL
	AllowURLRevisit bool
	// MaxBodySize is the limit of the retrieved response body in bytes.
	// 0 means unlimited.
	// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
	MaxBodySize int
	// CacheDir specifies a location where GET requests are cached as files.
	// When it's not defined, caching is disabled.
	CacheDir string
	// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
	// the target host's robots.txt file.  See http://www.robotstxt.org/ for more
	// information.
	IgnoreRobotsTxt bool
	// contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector ¶

func NewCollector() *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) Appengine ¶

func (c *Collector) Appengine(req *http.Request)

Appengine will replace the Collector's backend http.Client With an Http.Client that is provided by appengine/urlfetch This function should be used when the scraper is initiated by a http.Request to Google App Engine

func (*Collector) Clone ¶

func (c *Collector) Clone() *Collector

Clone creates an exact copy of a Collector without callbacks. HTTP backend, robots.txt cache and cookie jar are shared between collectors.

func (*Collector) Cookies ¶

func (c *Collector) Cookies(URL string) []*http.Cookie

Cookies returns the cookies to send in a request for the given URL.

func (*Collector) DisableCookies ¶

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling

func (*Collector) Init ¶

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit ¶

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new LimitRule to the collector

func (*Collector) Limits ¶

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new LimitRules to the collector

func (*Collector) OnError ¶

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed if an error occurs during the HTTP request.

func (*Collector) OnHTML ¶

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the GoQuery Selector parameter. GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnHTMLDetach ¶

func (c *Collector) OnHTMLDetach(goquerySelector string)

OnHTMLDetach deregister a function. Function will not be execute after detached

func (*Collector) OnRequest ¶

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse ¶

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) Post ¶

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts a collector job by creating a POST request. Post also calls the previously provided callbacks

func (*Collector) PostMultipart ¶

func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided callbacks

func (*Collector) PostRaw ¶

func (c *Collector) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. Post also calls the previously provided callbacks

func (*Collector) Request ¶

func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error

Request starts a collector job by creating a custom HTTP request where method, context, headers and request data can be specified. Set requestData, ctx, hdr parameters to nil if you don't want to use them. Valid methods:

"GET"
"POST"
"PUT"
"DELETE"
"PATCH"
"OPTIONS"

func (*Collector) SetCookieJar ¶

func (c *Collector) SetCookieJar(j *cookiejar.Jar)

SetCookieJar overrides the previously set cookie jar

func (*Collector) SetCookies ¶

func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error

SetCookies handles the receipt of the cookies in a reply for the given URL

func (*Collector) SetProxy ¶

func (c *Collector) SetProxy(proxyURL string) error

SetProxy sets a proxy for the collector. This overrides the previously used http.Transport if the type of the transport is not http.RoundTripper

func (*Collector) SetRequestTimeout ¶

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) String ¶

func (c *Collector) String() string

String is the text representation of the collector. It contains useful debug information about the collector's internals

func (*Collector) Visit ¶

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

func (*Collector) Wait ¶

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport ¶

func (c *Collector) WithTransport(transport http.RoundTripper)

WithTransport allows you to set a custom http.RoundTripper (transport)

type Context ¶

type Context struct {
	// contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext ¶

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) Get ¶

func (c *Context) Get(key string) string

Get retrieves a string value from Context. Get returns an empty string if key not found

func (*Context) GetAny ¶

func (c *Context) GetAny(key string) interface{}

GetAny retrieves a value from Context. GetAny returns nil if key not found

func (*Context) MarshalBinary ¶

func (c *Context) MarshalBinary() (_ []byte, _ error)

MarshalBinary encodes Context value This function is used by request caching

func (*Context) Put ¶

func (c *Context) Put(key string, value interface{})

Put stores a value of any type in Context

func (*Context) UnmarshalBinary ¶

func (c *Context) UnmarshalBinary(_ []byte) error

UnmarshalBinary decodes Context value to nil This function is used by request caching

type ErrorCallback ¶

type ErrorCallback func(*Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback ¶

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement ¶

type HTMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func (*HTMLElement) Attr ¶

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

func (*HTMLElement) ChildAttr ¶

func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string

ChildAttr returns the stripped text content of the first matching element's attribute.

func (*HTMLElement) ChildText ¶

func (h *HTMLElement) ChildText(goquerySelector string) string

ChildText returns the concatenated and stripped text content of the matching elements.

func (*HTMLElement) Unmarshal ¶

func (h *HTMLElement) Unmarshal(v interface{}) error

Unmarshal is a shorthand for colly.UnmarshalHTML

type LimitRule ¶

type LimitRule struct {
	// DomainRegexp is a regular expression to match against domains
	DomainRegexp string
	// DomainRegexp is a glob pattern to match against domains
	DomainGlob string
	// Delay is the duration to wait before creating a new request to the matching domains
	Delay time.Duration
	// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
	Parallelism int
	// contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. Both DomainRegexp and DomainGlob can be used to specify the included domains patterns, but at least one is required. There can be two kind of limitations:

Parallelism: Set limit for the number of concurrent requests to matching domains
Delay: Wait specified amount of time between requests (parallelism is 1 in this case)

func (*LimitRule) Init ¶

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match ¶

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type Request ¶

type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of this request
	Depth int
	// contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) AbsoluteURL ¶

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Post ¶

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request and preserves the Context of the previous request. Post also calls the previously provided callbacks

func (*Request) PostMultipart ¶

func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided. callbacks

func (*Request) PostRaw ¶

func (r *Request) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. PostRaw preserves the Context of the previous request and calls the previously provided callbacks

func (*Request) Visit ¶

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks

type RequestCallback ¶

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response ¶

type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

func (*Response) FileName ¶

func (r *Response) FileName() string

FileName returns the sanitized file name parsed from "Content-Disposition" header or from URL

func (*Response) Save ¶

func (r *Response) Save(fileName string) error

Save writes response body to disk

type ResponseCallback ¶

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
_examples
basic
coursera_courses
error_handling
google_groups
hackernews_comments
instagram
login
max_depth
multipart
openedx_courses
parallel
rate_limit
request_context
url_filter
xkcd_store
cmd
colly

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL