crawl

package module
v0.0.0-...-a1d1ef6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 24, 2017 License: Apache-2.0 Imports: 20 Imported by: 4

README

crawl

GoDoc Circle CI

Lightweight library for crawlers in Go.

HTML parsing and extracting is done thanks to goquery.

Usage

You can take a look at example crawler code.

License

                             Apache License
                       Version 2.0, January 2004
                    http://www.apache.org/licenses/

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultHeaders = map[string]string{
	"Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
	"Accept-Language": "en-US,en;q=0.8",
	"User-Agent":      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
}

DefaultHeaders - Default crawler headers.

View Source
var NodeDataPhoto = NodeAttr("data-photo")

NodeDataPhoto - Node "data-photo" attribute selector.

View Source
var NodeHref = NodeAttr("href")

NodeHref - Node "href" attribute selector.

View Source
var NodeSrc = NodeAttr("src")

NodeSrc - Node "src" attribute selector.

Functions

func Attr

func Attr(n Finder, attr, selector string) string

Attr - Finds node in response and returns attr content.

func Callbacks

func Callbacks(v ...string) []string

Callbacks - Helper for creating list of strings (callback names).

func ConstructHTTPRequest

func ConstructHTTPRequest(req *Request) (r *http.Request, err error)

ConstructHTTPRequest - Constructs a http.Request structure.

func FindAny

func FindAny(finder Finder, selectors ...string) (node *goquery.Selection)

FindAny - Finds node in response and returns attr content.

func NodeAttr

func NodeAttr(attr string) func(int, *goquery.Selection) string

NodeAttr - Returns node attribute selector. Helper for (*goquery.Selection).Each().

func NodeResolveURL

func NodeResolveURL(resp *Response) func(int, *goquery.Selection) string

NodeResolveURL - Returns selector which takes href and resolves url. Returns helper for (*goquery.Selection).Each().

func NodeText

func NodeText(_ int, n *goquery.Selection) string

NodeText - Returns node text. Helper for (*goquery.Selection).Each().

func ParseFloat

func ParseFloat(n Finder, selector string) (res float64, err error)

ParseFloat - Finds node in response and parses text as float64. When text is not found returns result 0.0 and nil error. Returned error source is strconv.ParseFloat.

func ParseUint

func ParseUint(n Finder, selector string) (res uint64, err error)

ParseUint - Finds node in response and parses text as uint64. When text is not found returns result 0 and nil error. Returned error source is strconv.ParseUint.

func ProxyFromContext

func ProxyFromContext(ctx context.Context) (addrs []string, ok bool)

ProxyFromContext - Returns proxy from context metadata.

func Text

func Text(n Finder, selector string) string

Text - Finds node in response and returns text.

func WithProxy

func WithProxy(ctx context.Context, addrs ...string) context.Context

WithProxy - Sets proxies in context metadata.

func WriteResponseFile

func WriteResponseFile(r *Response, fname string) (err error)

WriteResponseFile - Reads response body to memory and writes to file.

Types

type Crawler

type Crawler interface {
	// Schedule - Schedules request.
	// Context is passed to queue in a job.
	Schedule(context.Context, *Request) error

	// Execute - Makes a http request respecting context deadline.
	// If request Raw is not true - ParseHTML() method is executed on Response.
	// Then all callbacks are executed with context.
	Execute(context.Context, *Request) (*Response, error)

	// Handlers - Returns all registered handlers.
	Handlers() map[string][]Handler

	// Register - Registers crawl handler.
	Register(name string, h Handler)

	// Middleware - Registers a middleware.
	// Request is not executed if middleware returns an error.
	Middleware(Middleware)

	// Start - Starts the crawler.
	// All errors should be received from Errors() channel.
	Start()

	// Close - Closes the queue and the crawler.
	Close() error

	// Errors - Returns channel that will receive all crawl errors.
	// Only errors from queued requests are here.
	// Not only request errors but also queue errors.
	Errors() <-chan error
}

Crawler - Crawler interface.

func New

func New(opts ...Option) Crawler

New - Creates new crawler. If queue is not provided it will create a memory queue with a capacity of WithQueueCapacity seting value (default=10000).

type Finder

type Finder interface {
	Find(string) *goquery.Selection
}

Finder - HTML finder.

type Handler

type Handler func(context.Context, *Response) error

Handler - Crawler handler.

type Job

type Job interface {
	// Request - Returns crawl job.
	Request() *Request

	// Context - Returns job context.
	Context() context.Context

	// Done - Sets job as done.
	Done()
}

Job - Crawl job interface.

type Middleware

type Middleware func(context.Context, *Request, *http.Request) error

Middleware - Crawler middleware.

type Option

type Option func(*crawl)

Option - Crawl option.

func WithConcurrency

func WithConcurrency(n int) Option

WithConcurrency - Sets crawl concurrency. Default: 1000.

func WithDefaultHeaders

func WithDefaultHeaders(headers map[string]string) Option

WithDefaultHeaders - Sets crawl default headers. Default: empty.

func WithDefaultTimeout

func WithDefaultTimeout(d time.Duration) Option

WithDefaultTimeout - Sets default request timeout duration.

func WithQueue

func WithQueue(queue Queue) Option

WithQueue - Sets crawl queue. Default: creates queue using NewQueue() with capacity of WitWithQueueCapacity().

func WithQueueCapacity

func WithQueueCapacity(n int) Option

WithQueueCapacity - Sets queue capacity. It sets queue capacity if a queue needs to be created and it sets a capacity of channel in-memory queue. It also sets capacity of errors buffered channel. Default: 10000.

func WithSpiders

func WithSpiders(spiders ...func(Crawler)) Option

WithSpiders - Registers spider on a crawler.

func WithTransport

func WithTransport(transport *http.Transport) Option

WithTransport - Sets crawl HTTP transport.

func WithUserAgent

func WithUserAgent(ua string) Option

WithUserAgent - Sets crawl default user-agent.

type Queue

type Queue interface {
	// Get - Gets request from Queue channel.
	// Returns io.EOF if queue is empty.
	Get() (Job, error)

	// Schedule - Schedules a Request.
	// Returns io.ErrClosedPipe if queue is closed.
	Schedule(context.Context, *Request) error

	// Close - Closes the queue.
	Close() error
}

Queue - Requests queue.

func NewQueue

func NewQueue(capacity int) Queue

NewQueue - Makes a new queue. Capacity argument is a capacity of requests channel.

type Request

type Request struct {
	// URL - It can be absolute URL or a relative to source URL if referer is set.
	URL string `json:"url,omitempty"`
	// Method - "GET" by default.
	Method string `json:"method,omitempty"`
	// Referer - Request referer.
	Referer string `json:"referer,omitempty"`
	// Form - Form values which set as request body.
	Form url.Values `json:"form,omitempty"`
	// Query - Form values which set as url query.
	Query url.Values `json:"query,omitempty"`
	// Cookies - Request cookies.
	Cookies url.Values `json:"cookies,omitempty"`
	// Header - Header values.
	Header map[string]string `json:"header,omitempty"`
	// Raw - when set to false, it means we expect HTML response
	Raw bool `json:"raw,omitempty"`
	// Callbacks - Crawl callback list.
	Callbacks []string `json:"callbacks,omitempty"`
}

Request - HTTP Request. Multipart form is not implemented.

func (*Request) GetMethod

func (req *Request) GetMethod() string

GetMethod - Returns request Method. If empty returns "GET".

func (*Request) ParseURL

func (req *Request) ParseURL() (u *url.URL, err error)

ParseURL - Parses request URL. If request Source is set, parsed - URL is resolved with reference to source request URL.

func (*Request) String

func (req *Request) String() string

String - Returns "{method} {url}" formatted string.

type RequestError

type RequestError struct {
	*Request
	Err error
}

RequestError - Crawl error.

func (*RequestError) Error

func (err *RequestError) Error() string

Error - Returns request error message.

type Response

type Response struct {
	*Request
	*http.Response
	// contains filtered or unexported fields
}

Response - Crawl http response. It is expected it to be a HTML response but not required. It ALWAYS has to be released using Close() method.

func (*Response) Bytes

func (r *Response) Bytes() (body []byte, err error)

Bytes - Reads response body and returns byte array.

func (*Response) Close

func (r *Response) Close() error

Close - Closes response body.

func (*Response) Find

func (r *Response) Find(selector string) *goquery.Selection

Find - Short for: r.Query().Find(selector).

func (*Response) ParseHTML

func (r *Response) ParseHTML() (err error)

ParseHTML - Reads response body and parses it as HTML.

func (*Response) Query

func (r *Response) Query() *goquery.Document

Query - Returns goquery.Document.

func (*Response) Status

func (r *Response) Status() string

Status - Gets response status.

func (*Response) URL

func (r *Response) URL() *url.URL

URL - Gets response request URL.

Directories

Path Synopsis
examples
imdb
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.
imdb/spider
Package spider implements imdb spider.
Package spider implements imdb spider.
Package forms implements helpers for filling forms.
Package forms implements helpers for filling forms.
nsq
consumer
Package consumer implements command line crawl consumer from nsq.
Package consumer implements command line crawl consumer from nsq.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL