Documentation ¶
Overview ¶
Installation:
go get -u github.com/celrenheit/spider
Usage of this package is around the usage of spiders and passing contexts.
ctx, err := spider.Setup(nil) err := spider.Spin(ctx)
If you have many spider you can make use of a scheduler. This package provides a basic scheduler.
scheduler := spider.NewScheduler() scheduler.Add(schedule.Every(20 * time.Second), spider1) scheduler.Add(schedule.Every(20 * time.Second),spider2) scheduler.Start()
This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second.
You can create you own spider by implementing the Spider interface
package main import ( "fmt" "github.com/celrenheit/spider" ) func main() { wikiSpider := &WikipediaHTMLSpider{ Title: "Albert Einstein", } ctx, _ := wikiSpider.Setup(nil) wikiSpider.Spin(ctx) } type WikipediaHTMLSpider struct { Title string } func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) { url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title) return spider.NewHTTPContext("GET", url, nil) } func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error { if _, err := ctx.DoRequest(); err != nil { return err } html, _ := ctx.HTMLParser() summary := html.Find("#mw-content-text p").First().Text() fmt.Println(summary) return nil }
Index ¶
- Variables
- func Add(sched Schedule, spider Spider)
- func AddFunc(sched Schedule, url string, fn func(*Context) error)
- func Delete(url string, fn spinFunc) *spiderFunc
- func Get(url string, fn spinFunc) *spiderFunc
- func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc
- func NewKVStore() *store
- func Post(url string, body io.Reader, fn spinFunc) *spiderFunc
- func Put(url string, body io.Reader, fn spinFunc) *spiderFunc
- func Start()
- func Stop()
- type BackoffCondition
- type Context
- func (c *Context) Cookies() []*http.Cookie
- func (c *Context) DoRequest() (*http.Response, error)
- func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)
- func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context
- func (c *Context) Get(key string) interface{}
- func (c *Context) HTMLParser() (*goquery.Document, error)
- func (c *Context) JSONParser() (*simplejson.Json, error)
- func (c *Context) NewClient() (*http.Client, error)
- func (c *Context) NewCookieJar() (*cookiejar.Jar, error)
- func (c *Context) RAWContent() ([]byte, error)
- func (c *Context) Request() *http.Request
- func (c *Context) ResetClient() (*http.Client, error)
- func (c *Context) ResetCookies() error
- func (c *Context) Response() *http.Response
- func (c *Context) Set(key string, value interface{})
- func (c *Context) SetParent(parent *Context)
- func (c *Context) SetRequest(req *http.Request)
- func (c *Context) SetResponse(res *http.Response)
- type Entries
- type Entry
- type InMemory
- type Schedule
- type Spider
Constants ¶
This section is empty.
Variables ¶
var ( ErrNoClient = errors.New("No request has been set") ErrNoRequest = errors.New("No request has been set") )
Functions ¶
func AddFunc ¶ added in v0.3.0
AddFunc allows to add a spider to the standard scheduler using an url and a closure.
func Delete ¶ added in v0.3.0
func Delete(url string, fn spinFunc) *spiderFunc
Delete returns a new DELETE HTTP Spider.
func Get ¶ added in v0.3.0
func Get(url string, fn spinFunc) *spiderFunc
Get returns a new GET HTTP Spider.
func NewHTTPSpider ¶ added in v0.3.0
NewHTTPSpider creates a new spider according to the http method, url and body. The last argument is a closure for doing the actual work
Types ¶
type BackoffCondition ¶
func ErrorIfStatusCodeIsNot ¶
func ErrorIfStatusCodeIsNot(status int) BackoffCondition
type Context ¶
type Context struct { Client *http.Client Parent *Context Children []*Context // contains filtered or unexported fields }
Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.
func NewHTTPContext ¶ added in v0.3.0
NewHTTPContext returns a new Context.
It creates a new http.Client and a new http.Request with the provided arguments.
func (*Context) DoRequest ¶
DoRequest makes an http request using the http.Client and http.Request associated with this context.
This will store the response in this context. To access the response you should do:
ctx.Response() // to get the http.Response
func (*Context) DoRequestWithExponentialBackOff ¶
func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)
DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.
func (*Context) ExtendWithRequest ¶
ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.
func (*Context) HTMLParser ¶
HTMLParser returns an HTML parser.
It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.
func (*Context) JSONParser ¶
JSONParser returns a JSON parser.
It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson
func (*Context) NewCookieJar ¶
NewCookieJar create a new *cookiejar.Jar
func (*Context) RAWContent ¶
RAWContent returns the raw data of the reponse's body
func (*Context) ResetClient ¶
ResetClient create a new http.Client and replace the existing one if there is one.
func (*Context) ResetCookies ¶
ResetCookies create a new cookie jar.
Note: All the cookies previously will be deleted.
func (*Context) SetParent ¶
Set a parent context to the current context. It will also add the current context to the list of children of the parent context.
func (*Context) SetRequest ¶
SetRequest set an http.Request
func (*Context) SetResponse ¶
SetResponse set an http.Response
type Entries ¶ added in v0.3.0
type Entries []*Entry
Entries is a collection of Entry. Sortable by time.
type Entry ¶ added in v0.3.0
Entry groups a spider, its root context, a Schedule and the Next time the spider must be launched
type InMemory ¶ added in v0.3.0
type InMemory struct {
// contains filtered or unexported fields
}
InMemory is the default scheduler
func NewScheduler ¶ added in v0.3.0
func NewScheduler() *InMemory
NewScheduler returns a new InMemory scheduler
func (*InMemory) AddFunc ¶ added in v0.3.0
AddFunc allows to add a spider using an url and a closure. It is by default using the GET HTTP method.
func (*InMemory) AddWithCtx ¶ added in v0.3.0
AddWithCtx adds a spider with a root Context passed in the arguments