README

Spider Build Status GoDoc License

This package provides a simple way, yet extensible, to scrape HTML and JSON pages. It uses spiders around the web scheduled at certain configurable intervals to fetch data. It is written in Golang and is MIT licensed.

You can see an example app using this package here: https://github.com/celrenheit/trending-machine

Installation

$ go get -u github.com/celrenheit/spider

Usage

package main

import (
	"fmt"
	"time"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/schedule"
)

// LionelMessiSpider scrape wikipedia's page for LionelMessi
// It is defined below in the init function
var LionelMessiSpider spider.Spider

func main() {
	// Create a new scheduler
	scheduler := spider.NewScheduler()

	// Register the spider to be scheduled every 15 seconds
	scheduler.Add(schedule.Every(15*time.Second), LionelMessiSpider)
	// Alternatively, you can choose a cron schedule
	// This will run every minute of every day
	scheduler.Add(schedule.Cron("* * * * *"), LionelMessiSpider)

	// Start the scheduler
	scheduler.Start()

	// Exit 5 seconds later to let time for the request to be done.
	// Depends on your internet connection
	<-time.After(65 * time.Second)
}

func init() {
	LionelMessiSpider = spider.Get("https://en.wikipedia.org/wiki/Lionel_Messi", func(ctx *spider.Context) error {
		fmt.Println(time.Now())
		// Execute the request
		if _, err := ctx.DoRequest(); err != nil {
			return err
		}

		// Get goquery's html parser
		htmlparser, err := ctx.HTMLParser()
		if err != nil {
			return err
		}
		// Get the first paragraph of the wikipedia page
		summary := htmlparser.Find("#mw-content-text > p").First().Text()

		fmt.Println(summary)
		return nil
	})
}

In order, to create your own spiders you have to implement the spider.Spider interface. It has two functions, Setup and Spin.

Setup gets a Context and returns a new Context with an error if something wrong happened. Usually, it is in this function that you create a new http client and http request.

Spin gets a Context do its work and returns an error if necessarry. It is in this function that you do your work (do a request, handle response, parse HTML or JSON, etc...). It should return an error if something didn't happened correctly.

Documentation

The documentation is hosted on GoDoc.

Examples

$ cd $GOPATH/src/github.com/celrenheit/spider/examples
$ go run wiki.go

Contributing

Contributions are welcome ! Feel free to submit a pull request. You can improve documentation and examples to start. You can also provides spiders and better schedulers.

If you have developed your own spiders or schedulers, I will be pleased to review your code and eventually merge it into the project.

License

MIT License

Inspiration

Dkron for the new in memory scheduler (as of 0.3)

Documentation

Overview

    Installation:

    go get -u github.com/celrenheit/spider
    

    Usage of this package is around the usage of spiders and passing contexts.

    ctx, err := spider.Setup(nil)
    err := spider.Spin(ctx)
    

    If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

    scheduler := spider.NewScheduler()
    
    scheduler.Add(schedule.Every(20 * time.Second), spider1)
    
    scheduler.Add(schedule.Every(20 * time.Second),spider2)
    
    scheduler.Start()
    

    This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second.

    You can create you own spider by implementing the Spider interface

    package main
    
    import (
    	"fmt"
    
    	"github.com/celrenheit/spider"
    )
    
    func main() {
    	wikiSpider := &WikipediaHTMLSpider{
    		Title: "Albert Einstein",
    	}
    	ctx, _ := wikiSpider.Setup(nil)
    	wikiSpider.Spin(ctx)
    }
    
    type WikipediaHTMLSpider struct {
    	Title string
    }
    
    func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
    	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
    	return spider.NewHTTPContext("GET", url, nil)
    }
    
    func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
    	if _, err := ctx.DoRequest(); err != nil {
    		return err
    	}
    
    	html, _ := ctx.HTMLParser()
    	summary := html.Find("#mw-content-text p").First().Text()
    
    	fmt.Println(summary)
    	return nil
    }
    

    Index

    Constants

    This section is empty.

    Variables

    View Source
    var (
    	ErrNoClient  = errors.New("No request has been set")
    	ErrNoRequest = errors.New("No request has been set")
    )

    Functions

    func Add

    func Add(sched Schedule, spider Spider)

      Add adds a spider to the standard scheduler

      func AddFunc

      func AddFunc(sched Schedule, url string, fn func(*Context) error)

        AddFunc allows to add a spider to the standard scheduler using an url and a closure.

        func Delete

        func Delete(url string, fn spinFunc) *spiderFunc

          Delete returns a new DELETE HTTP Spider.

          func Get

          func Get(url string, fn spinFunc) *spiderFunc

            Get returns a new GET HTTP Spider.

            func NewHTTPSpider

            func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc

              NewHTTPSpider creates a new spider according to the http method, url and body. The last argument is a closure for doing the actual work

              func NewKVStore

              func NewKVStore() *store

                NewKVStore returns a new store.

                func Post

                func Post(url string, body io.Reader, fn spinFunc) *spiderFunc

                  Post returns a new POST HTTP Spider.

                  func Put

                  func Put(url string, body io.Reader, fn spinFunc) *spiderFunc

                    Put returns a new PUT HTTP Spider.

                    func Start

                    func Start()

                      Start starts the standard scheduler

                      func Stop

                      func Stop()

                        Stop stops the standard scheduler

                        Types

                        type BackoffCondition

                        type BackoffCondition func(*http.Response) error

                        func ErrorIfStatusCodeIsNot

                        func ErrorIfStatusCodeIsNot(status int) BackoffCondition

                        type Context

                        type Context struct {
                        	Client *http.Client
                        
                        	Parent   *Context
                        	Children []*Context
                        	// contains filtered or unexported fields
                        }

                          Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

                          func NewContext

                          func NewContext() *Context

                            NewContext returns a new Context.

                            func NewHTTPContext

                            func NewHTTPContext(method, url string, body io.Reader) (*Context, error)

                              NewHTTPContext returns a new Context.

                              It creates a new http.Client and a new http.Request with the provided arguments.

                              func (*Context) Cookies

                              func (c *Context) Cookies() []*http.Cookie

                                Cookies return a list of cookies for the given request URL

                                func (*Context) DoRequest

                                func (c *Context) DoRequest() (*http.Response, error)

                                  DoRequest makes an http request using the http.Client and http.Request associated with this context.

                                  This will store the response in this context. To access the response you should do:

                                  ctx.Response() // to get the http.Response
                                  

                                  func (*Context) DoRequestWithExponentialBackOff

                                  func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

                                    DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

                                    func (*Context) ExtendWithRequest

                                    func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

                                      ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

                                      func (*Context) Get

                                      func (c *Context) Get(key string) interface{}

                                        Get a value from this context

                                        func (*Context) HTMLParser

                                        func (c *Context) HTMLParser() (*goquery.Document, error)

                                          HTMLParser returns an HTML parser.

                                          It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

                                          func (*Context) JSONParser

                                          func (c *Context) JSONParser() (*simplejson.Json, error)

                                            JSONParser returns a JSON parser.

                                            It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

                                            func (*Context) NewClient

                                            func (c *Context) NewClient() (*http.Client, error)

                                              NewClient create a new http.Client

                                              func (*Context) NewCookieJar

                                              func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

                                                NewCookieJar create a new *cookiejar.Jar

                                                func (*Context) RAWContent

                                                func (c *Context) RAWContent() ([]byte, error)

                                                  RAWContent returns the raw data of the reponse's body

                                                  func (*Context) Request

                                                  func (c *Context) Request() *http.Request

                                                    Request returns an http.Response

                                                    func (*Context) ResetClient

                                                    func (c *Context) ResetClient() (*http.Client, error)

                                                      ResetClient create a new http.Client and replace the existing one if there is one.

                                                      func (*Context) ResetCookies

                                                      func (c *Context) ResetCookies() error

                                                        ResetCookies create a new cookie jar.

                                                        Note: All the cookies previously will be deleted.

                                                        func (*Context) Response

                                                        func (c *Context) Response() *http.Response

                                                          Response returns an http.Response

                                                          func (*Context) Set

                                                          func (c *Context) Set(key string, value interface{})

                                                            Set a value to this context

                                                            func (*Context) SetParent

                                                            func (c *Context) SetParent(parent *Context)

                                                              Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

                                                              func (*Context) SetRequest

                                                              func (c *Context) SetRequest(req *http.Request)

                                                                SetRequest set an http.Request

                                                                func (*Context) SetResponse

                                                                func (c *Context) SetResponse(res *http.Response)

                                                                  SetResponse set an http.Response

                                                                  type Entries

                                                                  type Entries []*Entry

                                                                    Entries is a collection of Entry. Sortable by time.

                                                                    func (Entries) Len

                                                                    func (e Entries) Len() int

                                                                    func (Entries) Less

                                                                    func (e Entries) Less(i, j int) bool

                                                                    func (Entries) Swap

                                                                    func (e Entries) Swap(i, j int)

                                                                    type Entry

                                                                    type Entry struct {
                                                                    	Spider   Spider
                                                                    	Schedule Schedule
                                                                    	Ctx      *Context
                                                                    	Next     time.Time
                                                                    }

                                                                      Entry groups a spider, its root context, a Schedule and the Next time the spider must be launched

                                                                      type InMemory

                                                                      type InMemory struct {
                                                                      	// contains filtered or unexported fields
                                                                      }

                                                                        InMemory is the default scheduler

                                                                        func NewScheduler

                                                                        func NewScheduler() *InMemory

                                                                          NewScheduler returns a new InMemory scheduler

                                                                          func (*InMemory) Add

                                                                          func (in *InMemory) Add(sched Schedule, spider Spider)

                                                                            Add adds a spider using a nil root Context

                                                                            func (*InMemory) AddFunc

                                                                            func (in *InMemory) AddFunc(sched Schedule, url string, fn func(*Context) error)

                                                                              AddFunc allows to add a spider using an url and a closure. It is by default using the GET HTTP method.

                                                                              func (*InMemory) AddWithCtx

                                                                              func (in *InMemory) AddWithCtx(sched Schedule, spider Spider, ctx *Context)

                                                                                AddWithCtx adds a spider with a root Context passed in the arguments

                                                                                func (*InMemory) Start

                                                                                func (in *InMemory) Start()

                                                                                  Start launch the scheduler. It will run in its own goroutine. Your code will continue to be execute after calling this function.

                                                                                  func (*InMemory) Stop

                                                                                  func (in *InMemory) Stop()

                                                                                    Stop the scheduler. Should be called after Start.

                                                                                    type Schedule

                                                                                    type Schedule interface {
                                                                                    	Next(time.Time) time.Time
                                                                                    }

                                                                                      Schedule is an interface with only a Next method. Next will return the next time it should run given the current time as a parameter.

                                                                                      type Spider

                                                                                      type Spider interface {
                                                                                      	Setup(*Context) (*Context, error)
                                                                                      	Spin(*Context) error
                                                                                      }

                                                                                        Spider is an interface with two methods. It is the primary element of the package

                                                                                        Directories

                                                                                        Path Synopsis