Documentation
¶
Index ¶
- type Context
- type DefaultDownloader
- type Downloader
- type Engine
- type FIFOScheduler
- type Items
- type Option
- func MaxCrawlingDepth(depth int) Option
- func SetConcurrency(num int) Option
- func UseDownloader(d Downloader) Option
- func UseLogger(lg logger.Logger) Option
- func UseScheduler(s Scheduler) Option
- func WithDelay(delay time.Duration) Option
- func WithRequestMiddlewares(middlewares ...RequestHandleFunc) Option
- func WithResponseMiddlewares(middlewares ...ResponseHandleFunc) Option
- type Pipeline
- type Request
- type RequestHandleFunc
- type Response
- type ResponseHandleFunc
- type Scheduler
- type Spider
- type StringMatcher
- type URLMatcher
- type URLRegExpMatcher
- type WeightedScheduler
- func (sched *WeightedScheduler) HasMore() bool
- func (sched *WeightedScheduler) Len() int
- func (sched *WeightedScheduler) Less(i, j int) bool
- func (sched *WeightedScheduler) Pop() interface{}
- func (sched *WeightedScheduler) PopRequest() (req *Request, ok bool)
- func (sched *WeightedScheduler) Push(x interface{})
- func (sched *WeightedScheduler) PushRequest(req *Request) (ok bool)
- func (sched *WeightedScheduler) Start() error
- func (sched *WeightedScheduler) Stop() error
- func (sched *WeightedScheduler) Swap(i, j int)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Context ¶
Context represents the scraping and crawling context
type DefaultDownloader ¶
type DefaultDownloader struct {
// contains filtered or unexported fields
}
DefaultDownloader a simple downloader implementation
func (*DefaultDownloader) Download ¶
func (dd *DefaultDownloader) Download(req *Request) (*Response, error)
Download sends http request and using goquery to get http document
func (*DefaultDownloader) SetHTTPClient ¶
func (dd *DefaultDownloader) SetHTTPClient(client *http.Client)
SetHTTPClient set http client using to fetch pages
type Downloader ¶
Downloader is an interface that representing the ability to download data from internet. It is responsible for fetching web pages, the downloading response will be took over by engine, in turn, fed to spiders.
type Engine ¶
type Engine struct {
// contains filtered or unexported fields
}
Engine represents scraping engine, it is responsible for managing the data flow among scheduler, downloader and spiders.
func (*Engine) RegisterPipelines ¶
RegisterPipelines register pipelines
func (*Engine) RegisterSipders ¶
RegisterSipders add working spiders
type FIFOScheduler ¶
type FIFOScheduler struct {
// contains filtered or unexported fields
}
FIFOScheduler default scheduler implementation
func NewFIFOScheduler ¶
func NewFIFOScheduler() *FIFOScheduler
NewFIFOScheduler create a new fifo scheduler with queue size of 100
func (*FIFOScheduler) HasMore ¶
func (ds *FIFOScheduler) HasMore() bool
HasMore returns true if there are more request to be scheduled
func (*FIFOScheduler) PopRequest ¶
func (ds *FIFOScheduler) PopRequest() (req *Request, hadMore bool)
PopRequest returns next request
func (*FIFOScheduler) PushRequest ¶
func (ds *FIFOScheduler) PushRequest(req *Request) (ok bool)
PushRequest add request
type Items ¶
Items items
type Option ¶
type Option func(e *Engine)
Option engine option
func MaxCrawlingDepth ¶
MaxCrawlingDepth returns an Option that sets the max crawling depth. The engine will drop Requests that have current depth exceeded the maximum limit.
func WithRequestMiddlewares ¶
func WithRequestMiddlewares(middlewares ...RequestHandleFunc) Option
WithRequestMiddlewares registers request middlewares. Requests will be processed by request middlewares just before passing to downloader.
for aborting request in middleware, using Request.Abrot()
for example:
func ReqMiddleware(req *goscrapy.Request) error { if req.URL == "http://www.example.com" { req.Abort() return nil } return nil }
func WithResponseMiddlewares ¶
func WithResponseMiddlewares(middlewares ...ResponseHandleFunc) Option
WithResponseMiddlewares registers response middlewares. Response will be processed by response middlewares right after downloader finishes downloading and takes over response to engine
type Pipeline ¶
type Pipeline interface { Name() string // returns pipeline's name ItemList() []string // returns all items' name that this pipeline cares about Handle(items *Items) error }
Pipeline pipeline
type Request ¶
type Request struct { Method string `json:"method,omitempty"` URL string `json:"url,omitempty"` Header http.Header `json:"header,omitempty"` Query url.Values `json:"query,omitempty"` // using to decide scheduling sequence. It only means something when using a // scheduler that schedules requests based on request weight. Weight int // contains filtered or unexported fields }
Request represents crawling request
func (*Request) Abort ¶
func (r *Request) Abort()
Abort aborts current request, you could use it at your request middleware to make sure certain request will not be handled by downloader and spiders.
func (*Request) ContextValue ¶
ContextValue returns the value associated with this request for key, or nil if no value is associated with key.
func (*Request) WithContextValue ¶
WithContextValue sets the value into request associated with the key.
type RequestHandleFunc ¶
RequestHandleFunc request handler func
type Response ¶
type Response struct { Status string `json:"status,omitempty"` // e.g. "200 OK" StatusCode int `json:"status_code,omitempty"` // e.g. 200 // Request represents request that was send to obtain this response. Request *Request `json:"request,omitempty"` // Document represents an HTML document to be manipulated. Document *goquery.Document `json:"-"` // Body represents the response body. Body []byte `json:"-"` // ContentLength records the length of the associated content. more details see http.Response. ContentLength int64 `json:"content_length,omitempty"` // Header represents response header, maps header keys to values. Header http.Header `json:"header,omitempty"` }
Response represents crawling response
type ResponseHandleFunc ¶
ResponseHandleFunc response handler func
type Scheduler ¶
type Scheduler interface { Start() error // start scheduler Stop() error // stop scheduler PopRequest() (req *Request, hasMore bool) // return next request from scheduler PushRequest(req *Request) (ok bool) // add new request into scheduler HasMore() bool // returns true if there are more request to be scheduled }
Scheduler scheduler is responsible for managing all scraping and crawling request
type Spider ¶
type Spider interface { Name() string StartRequests() []*Request URLMatcher() URLMatcher Parse(ctx *Context) (*Items, []*Request, error) }
Spider is an interface that defines how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behavior for crawling and parsing pages for a particular site (or, in some cases, a group of sites). For spiders, the scraping cycle goes through something like this:
- Using initial Requests generated by StartRequests to crawl the first URLs.
- Parsing the response (web page), then return items object (structured data) and request objects. Those requests will be added into scheduler by goscrapy engine and downloaded by downloader in the future.
type StringMatcher ¶
type StringMatcher struct {
// contains filtered or unexported fields
}
StringMatcher static string matcher
func NewStaticStringMatcher ¶
func NewStaticStringMatcher(str string) *StringMatcher
NewStaticStringMatcher new static string matcher
func (*StringMatcher) Match ¶
func (m *StringMatcher) Match(url string) bool
Match returns true if url is matched
type URLMatcher ¶
URLMatcher url matcher
type URLRegExpMatcher ¶
type URLRegExpMatcher struct {
// contains filtered or unexported fields
}
URLRegExpMatcher url regexp matcher
func NewRegexpMatcher ¶
func NewRegexpMatcher(str string) *URLRegExpMatcher
NewRegexpMatcher new URL matcher
func (*URLRegExpMatcher) Match ¶
func (m *URLRegExpMatcher) Match(url string) bool
Match returns true if url is matched
type WeightedScheduler ¶
type WeightedScheduler struct {
// contains filtered or unexported fields
}
WeightedScheduler scheduler
func NewWeightedScheduler ¶
func NewWeightedScheduler() *WeightedScheduler
NewWeightedScheduler new a weighted scheduler which is implemented based on max-heap.
func (*WeightedScheduler) HasMore ¶
func (sched *WeightedScheduler) HasMore() bool
HasMore returns true if queue has more request
func (*WeightedScheduler) Len ¶
func (sched *WeightedScheduler) Len() int
Len returns the number of elements in the collection.
func (*WeightedScheduler) Less ¶
func (sched *WeightedScheduler) Less(i, j int) bool
Less reports whether the element with index i should sort before the element with index j.
func (*WeightedScheduler) Pop ¶
func (sched *WeightedScheduler) Pop() interface{}
Pop remove and return element Len() - 1. It's aimed to implement sort.Interface. For pushing request onto schduler using PopRequest instead.
func (*WeightedScheduler) PopRequest ¶
func (sched *WeightedScheduler) PopRequest() (req *Request, ok bool)
PopRequest pop request
func (*WeightedScheduler) Push ¶
func (sched *WeightedScheduler) Push(x interface{})
Push pushes value onto heap, it's aimed to implement sort.Interface. For pushing request onto schduler using PushRequest instead.
func (*WeightedScheduler) PushRequest ¶
func (sched *WeightedScheduler) PushRequest(req *Request) (ok bool)
PushRequest push request
func (*WeightedScheduler) Swap ¶
func (sched *WeightedScheduler) Swap(i, j int)
Swap swaps the elements with indexes i and j.