spider

package module

v0.3.1 Latest Latest Go to latest Published: Jan 20, 2016 License: MIT Imports: 13 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/celrenheit/spider

Links

Open Source Insights

README ¶

Spider

This package provides a simple way, yet extensible, to scrape HTML and JSON pages. It uses spiders around the web scheduled at certain configurable intervals to fetch data. It is written in Golang and is MIT licensed.

You can see an example app using this package here: https://github.com/celrenheit/trending-machine

Installation

$ go get -u github.com/celrenheit/spider

Usage

package main

import (
	"fmt"
	"time"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/schedule"
)

// LionelMessiSpider scrape wikipedia's page for LionelMessi
// It is defined below in the init function
var LionelMessiSpider spider.Spider

func main() {
	// Create a new scheduler
	scheduler := spider.NewScheduler()

	// Register the spider to be scheduled every 15 seconds
	scheduler.Add(schedule.Every(15*time.Second), LionelMessiSpider)
	// Alternatively, you can choose a cron schedule
	// This will run every minute of every day
	scheduler.Add(schedule.Cron("* * * * *"), LionelMessiSpider)

	// Start the scheduler
	scheduler.Start()

	// Exit 5 seconds later to let time for the request to be done.
	// Depends on your internet connection
	<-time.After(65 * time.Second)
}

func init() {
	LionelMessiSpider = spider.Get("https://en.wikipedia.org/wiki/Lionel_Messi", func(ctx *spider.Context) error {
		fmt.Println(time.Now())
		// Execute the request
		if _, err := ctx.DoRequest(); err != nil {
			return err
		}

		// Get goquery's html parser
		htmlparser, err := ctx.HTMLParser()
		if err != nil {
			return err
		}
		// Get the first paragraph of the wikipedia page
		summary := htmlparser.Find("#mw-content-text > p").First().Text()

		fmt.Println(summary)
		return nil
	})
}

In order, to create your own spiders you have to implement the spider.Spider interface. It has two functions, Setup and Spin.

Setup gets a Context and returns a new Context with an error if something wrong happened. Usually, it is in this function that you create a new http client and http request.

Spin gets a Context do its work and returns an error if necessarry. It is in this function that you do your work (do a request, handle response, parse HTML or JSON, etc...). It should return an error if something didn't happened correctly.

Documentation

The documentation is hosted on GoDoc.

Examples

$ cd $GOPATH/src/github.com/celrenheit/spider/examples
$ go run wiki.go

Contributing

Contributions are welcome ! Feel free to submit a pull request. You can improve documentation and examples to start. You can also provides spiders and better schedulers.

If you have developed your own spiders or schedulers, I will be pleased to review your code and eventually merge it into the project.

License

MIT License

Inspiration

Dkron for the new in memory scheduler (as of 0.3)

Documentation ¶

Overview ¶

Installation:

go get -u github.com/celrenheit/spider

Usage of this package is around the usage of spiders and passing contexts.

ctx, err := spider.Setup(nil)
err := spider.Spin(ctx)

If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

scheduler := spider.NewScheduler()

scheduler.Add(schedule.Every(20 * time.Second), spider1)

scheduler.Add(schedule.Every(20 * time.Second),spider2)

scheduler.Start()

This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second.

You can create you own spider by implementing the Spider interface

package main

import (
	"fmt"

	"github.com/celrenheit/spider"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{
		Title: "Albert Einstein",
	}
	ctx, _ := wikiSpider.Setup(nil)
	wikiSpider.Spin(ctx)
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	return spider.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	html, _ := ctx.HTMLParser()
	summary := html.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Index ¶

Variables
func Add(sched Schedule, spider Spider)
func AddFunc(sched Schedule, url string, fn func(*Context) error)
func Delete(url string, fn spinFunc) *spiderFunc
func Get(url string, fn spinFunc) *spiderFunc
func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc
func NewKVStore() *store
func Post(url string, body io.Reader, fn spinFunc) *spiderFunc
func Put(url string, body io.Reader, fn spinFunc) *spiderFunc
func Start()
func Stop()
type BackoffCondition
- func ErrorIfStatusCodeIsNot(status int) BackoffCondition
type Context
- func NewContext() *Context
- func NewHTTPContext(method, url string, body io.Reader) (*Context, error)
- func (c *Context) Cookies() []*http.Cookie
- func (c *Context) DoRequest() (*http.Response, error)
- func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)
- func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context
- func (c *Context) Get(key string) interface{}
- func (c *Context) HTMLParser() (*goquery.Document, error)
- func (c *Context) JSONParser() (*simplejson.Json, error)
- func (c *Context) NewClient() (*http.Client, error)
- func (c *Context) NewCookieJar() (*cookiejar.Jar, error)
- func (c *Context) RAWContent() ([]byte, error)
- func (c *Context) Request() *http.Request
- func (c *Context) ResetClient() (*http.Client, error)
- func (c *Context) ResetCookies() error
- func (c *Context) Response() *http.Response
- func (c *Context) Set(key string, value interface{})
- func (c *Context) SetParent(parent *Context)
- func (c *Context) SetRequest(req *http.Request)
- func (c *Context) SetResponse(res *http.Response)
type Entries
- func (e Entries) Len() int
- func (e Entries) Less(i, j int) bool
- func (e Entries) Swap(i, j int)
type Entry
type InMemory
- func NewScheduler() *InMemory
- func (in *InMemory) Add(sched Schedule, spider Spider)
- func (in *InMemory) AddFunc(sched Schedule, url string, fn func(*Context) error)
- func (in *InMemory) AddWithCtx(sched Schedule, spider Spider, ctx *Context)
- func (in *InMemory) Start()
- func (in *InMemory) Stop()
type Schedule
type Spider

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrNoClient  = errors.New("No request has been set")
	ErrNoRequest = errors.New("No request has been set")
)

Functions ¶

func Add ¶ added in v0.3.0

func Add(sched Schedule, spider Spider)

Add adds a spider to the standard scheduler

func AddFunc ¶ added in v0.3.0

func AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider to the standard scheduler using an url and a closure.

func Delete ¶ added in v0.3.0

func Delete(url string, fn spinFunc) *spiderFunc

Delete returns a new DELETE HTTP Spider.

func Get ¶ added in v0.3.0

func Get(url string, fn spinFunc) *spiderFunc

Get returns a new GET HTTP Spider.

func NewHTTPSpider ¶ added in v0.3.0

func NewHTTPSpider(method, url string, body io.Reader, fn spinFunc) *spiderFunc

NewHTTPSpider creates a new spider according to the http method, url and body. The last argument is a closure for doing the actual work

func NewKVStore ¶

func NewKVStore() *store

NewKVStore returns a new store.

func Post ¶ added in v0.3.0

func Post(url string, body io.Reader, fn spinFunc) *spiderFunc

Post returns a new POST HTTP Spider.

func Put ¶ added in v0.3.0

func Put(url string, body io.Reader, fn spinFunc) *spiderFunc

Put returns a new PUT HTTP Spider.

func Start ¶ added in v0.3.0

func Start()

Start starts the standard scheduler

func Stop ¶ added in v0.3.0

func Stop()

Stop stops the standard scheduler

Types ¶

type BackoffCondition ¶

type BackoffCondition func(*http.Response) error

func ErrorIfStatusCodeIsNot ¶

func ErrorIfStatusCodeIsNot(status int) BackoffCondition

type Context ¶

type Context struct {
	Client *http.Client

	Parent   *Context
	Children []*Context
	// contains filtered or unexported fields
}

Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

func NewContext ¶

func NewContext() *Context

NewContext returns a new Context.

func NewHTTPContext ¶ added in v0.3.0

func NewHTTPContext(method, url string, body io.Reader) (*Context, error)

NewHTTPContext returns a new Context.

It creates a new http.Client and a new http.Request with the provided arguments.

func (*Context) Cookies ¶

func (c *Context) Cookies() []*http.Cookie

Cookies return a list of cookies for the given request URL

func (*Context) DoRequest ¶

func (c *Context) DoRequest() (*http.Response, error)

DoRequest makes an http request using the http.Client and http.Request associated with this context.

This will store the response in this context. To access the response you should do:

ctx.Response() // to get the http.Response

func (*Context) DoRequestWithExponentialBackOff ¶

func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

func (*Context) ExtendWithRequest ¶

func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

func (*Context) Get ¶

func (c *Context) Get(key string) interface{}

Get a value from this context

func (*Context) HTMLParser ¶

func (c *Context) HTMLParser() (*goquery.Document, error)

HTMLParser returns an HTML parser.

It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

func (*Context) JSONParser ¶

func (c *Context) JSONParser() (*simplejson.Json, error)

JSONParser returns a JSON parser.

It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

func (*Context) NewClient ¶

func (c *Context) NewClient() (*http.Client, error)

NewClient create a new http.Client

func (*Context) NewCookieJar ¶

func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

NewCookieJar create a new *cookiejar.Jar

func (*Context) RAWContent ¶

func (c *Context) RAWContent() ([]byte, error)

RAWContent returns the raw data of the reponse's body

func (*Context) Request ¶

func (c *Context) Request() *http.Request

Request returns an http.Response

func (*Context) ResetClient ¶

func (c *Context) ResetClient() (*http.Client, error)

ResetClient create a new http.Client and replace the existing one if there is one.

func (*Context) ResetCookies ¶

func (c *Context) ResetCookies() error

ResetCookies create a new cookie jar.

Note: All the cookies previously will be deleted.

func (*Context) Response ¶

func (c *Context) Response() *http.Response

Response returns an http.Response

func (*Context) Set ¶

func (c *Context) Set(key string, value interface{})

Set a value to this context

func (*Context) SetParent ¶

func (c *Context) SetParent(parent *Context)

Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

func (*Context) SetRequest ¶

func (c *Context) SetRequest(req *http.Request)

SetRequest set an http.Request

func (*Context) SetResponse ¶

func (c *Context) SetResponse(res *http.Response)

SetResponse set an http.Response

type Entries ¶ added in v0.3.0

type Entries []*Entry

Entries is a collection of Entry. Sortable by time.

func (Entries) Len ¶ added in v0.3.0

func (e Entries) Len() int

func (Entries) Less ¶ added in v0.3.0

func (e Entries) Less(i, j int) bool

func (Entries) Swap ¶ added in v0.3.0

func (e Entries) Swap(i, j int)

type Entry ¶ added in v0.3.0

type Entry struct {
	Spider   Spider
	Schedule Schedule
	Ctx      *Context
	Next     time.Time
}

Entry groups a spider, its root context, a Schedule and the Next time the spider must be launched

type InMemory ¶ added in v0.3.0

type InMemory struct {
	// contains filtered or unexported fields
}

InMemory is the default scheduler

func NewScheduler ¶ added in v0.3.0

func NewScheduler() *InMemory

NewScheduler returns a new InMemory scheduler

func (*InMemory) Add ¶ added in v0.3.0

func (in *InMemory) Add(sched Schedule, spider Spider)

Add adds a spider using a nil root Context

func (*InMemory) AddFunc ¶ added in v0.3.0

func (in *InMemory) AddFunc(sched Schedule, url string, fn func(*Context) error)

AddFunc allows to add a spider using an url and a closure. It is by default using the GET HTTP method.

func (*InMemory) AddWithCtx ¶ added in v0.3.0

func (in *InMemory) AddWithCtx(sched Schedule, spider Spider, ctx *Context)

AddWithCtx adds a spider with a root Context passed in the arguments

func (*InMemory) Start ¶ added in v0.3.0

func (in *InMemory) Start()

Start launch the scheduler. It will run in its own goroutine. Your code will continue to be execute after calling this function.

func (*InMemory) Stop ¶ added in v0.3.0

func (in *InMemory) Stop()

Stop the scheduler. Should be called after Start.

type Schedule ¶ added in v0.2.0

type Schedule interface {
	Next(time.Time) time.Time
}

Schedule is an interface with only a Next method. Next will return the next time it should run given the current time as a parameter.

type Spider ¶

type Spider interface {
	Setup(*Context) (*Context, error)
	Spin(*Context) error
}

Spider is an interface with two methods. It is the primary element of the package

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
schedule

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL