goribot

package module
Version: v0.1.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 4, 2020 License: Apache-2.0 Imports: 32 Imported by: 0

README

Goribot

一个分布式友好的轻量的 Golang 爬虫框架。

完整文档 | Document

GitHub go.mod Go version GitHub tag (latest by date) codecov go-report license code-size

🚀Feature

版本警告

Goribot 仅支持 Go1.13 及以上版本。

👜获取 Goribot

go get -u github.com/zhshch2002/goribot

Goribot 包含一个历史开发版本,如果您需要使用过那个版本,请拉取 Tag 为 v0.0.1 版本。

⚡建立你的第一个项目

package main

import (
	"fmt"
	"github.com/zhshch2002/goribot"
)

func main() {
	s := goribot.NewSpider()

	s.AddTask(
		goribot.GetReq("https://httpbin.org/get"),
		func(ctx *goribot.Context) {
			fmt.Println(ctx.Resp.Text)
			fmt.Println(ctx.Resp.Json("headers.User-Agent"))
		},
	)

	s.Run()
}

🎉完成

至此你已经可以使用 Goribot 了。更多内容请从 开始使用 了解。

🙏感谢

万分感谢以上项目的帮助🙏。

Documentation

Index

Constants

View Source
const DeduplicateSuffix = "_deduplicate"
View Source
const ItemsSuffix = "_items"
View Source
const TasksSuffix = "_tasks"

Variables

View Source
var Do = D.Do
View Source
var ErrRunFinishedSpider = errors.New("running a spider which is finished,you could recreate this spider and run the new one")
View Source
var GetReq = Get

Deprecated: will be remove at next major version

View Source
var Log = logging.MustGetLogger("goribot")
View Source
var PostReq = Post

Deprecated: will be remove at next major version

Functions

func AddCookieToJar added in v0.1.5

func AddCookieToJar(urlAddr string, cookies ...*http.Cookie) func(s *Spider)

AddCookieToJar is an extension add a cookie to downloader's cookie jar

func GetRequestHash

func GetRequestHash(r *Request) [md5.Size]byte

GetRequestHash return a hash of url,header,cookie and body data from a request

func Limiter added in v0.1.1

func Limiter(WhiteList bool, rules ...*LimitRule) func(s *Spider)

func RandomProxy added in v0.1.1

func RandomProxy(p ...string) func(s *Spider)

RandomUserAgent is an extension can set random proxy url for new task

func RandomUserAgent

func RandomUserAgent() func(s *Spider)

RandomUserAgent is an extension can set random User-Agent for new task

func RedisDistributed added in v0.1.0

func RedisDistributed(ro *redis.Options, sName string, useDeduplicate bool, onSeedHandler CtxHandlerFun) func(s *Spider)

func RedisReqDeduplicate added in v0.1.0

func RedisReqDeduplicate(r *redis.Client, sName string) func(s *Spider)

ReqDeduplicate is an extension can deduplicate new task based on redis to support distributed

func RefererFiller

func RefererFiller() func(s *Spider)

RefererFiller is an extension can add Referer for new task

func ReqDeduplicate

func ReqDeduplicate() func(s *Spider)

ReqDeduplicate is an extension can deduplicate new task

func Retry added in v0.1.1

func Retry(maxTimes int, okcode ...int) func(s *Spider)

Retry is a extension make a new request when get response with error

func RobotsTxt

func RobotsTxt(baseUrl, ua string) func(s *Spider)

RobotsTxt is an extension can parse the robots.txt and follow it

func SaveItemsAsCSV added in v0.1.1

func SaveItemsAsCSV(f *os.File) func(s *Spider)

SaveItemsAsCSV is a extension save items to a csv file

func SaveItemsAsJSON added in v0.1.1

func SaveItemsAsJSON(f *os.File) func(s *Spider)

SaveItemsAsCSV is a extension save items to a json file

func SetDepthFirst added in v0.1.0

func SetDepthFirst(d bool) func(s *Spider)

SetDepthFirst is an extension change Scheduler DepthFirst setting

func SpiderLogError added in v0.1.1

func SpiderLogError(f *os.File) func(s *Spider)

SpiderLogError is a extension logs special or error response

func SpiderLogPrint added in v0.1.1

func SpiderLogPrint() func(s *Spider)

SpiderLogPrint is a extension print spider working status

Types

type BaseDownloader added in v0.1.0

type BaseDownloader struct {
	Client *http.Client
	// contains filtered or unexported fields
}

BaseDownloader is default downloader of goribot

func NewBaseDownloader added in v0.1.0

func NewBaseDownloader() *BaseDownloader

func (*BaseDownloader) AddMiddleware added in v0.1.2

func (s *BaseDownloader) AddMiddleware(fn func(req *Request, next func(*Request) (*Response, error)) (*Response, error))

func (*BaseDownloader) Do added in v0.1.0

func (s *BaseDownloader) Do(req *Request) (resp *Response, err error)

type BaseScheduler added in v0.1.0

type BaseScheduler struct {

	// DepthFirst sets push new tasks to the top of the queue
	DepthFirst bool
	// contains filtered or unexported fields
}

Scheduler is default scheduler of goribot

func NewBaseScheduler added in v0.1.0

func NewBaseScheduler(depthFirst bool) *BaseScheduler

func (*BaseScheduler) AddItem added in v0.1.0

func (s *BaseScheduler) AddItem(i interface{})

func (*BaseScheduler) AddTask added in v0.1.0

func (s *BaseScheduler) AddTask(t *Task)

func (*BaseScheduler) GetItem added in v0.1.0

func (s *BaseScheduler) GetItem() interface{}

func (*BaseScheduler) GetTask added in v0.1.0

func (s *BaseScheduler) GetTask() *Task

func (*BaseScheduler) IsItemEmpty added in v0.1.0

func (s *BaseScheduler) IsItemEmpty() bool

func (*BaseScheduler) IsTaskEmpty added in v0.1.0

func (s *BaseScheduler) IsTaskEmpty() bool

type Context

type Context struct {
	// Req is the origin request
	Req *Request
	// Resp is the response object
	Resp *Response

	// Meta the request task created by NewTaskWithMeta func will have a k-y pair
	Meta map[string]interface{}

	Handlers []CtxHandlerFun
	// contains filtered or unexported fields
}

Context is a wrap of response,origin request,new task,etc

func (*Context) Abort added in v0.1.0

func (c *Context) Abort()

Abort this context to break the handler chain and stop handling

func (*Context) AddItem

func (c *Context) AddItem(i interface{})

AddItem add an item to new item list. After every handler func return, spider will collect these items and call OnItem handler func

func (*Context) AddTask

func (c *Context) AddTask(request *Request, handlers ...CtxHandlerFun)

AddTask add a task to new task list. After every handler func return,spider will collect these tasks

func (*Context) IsAborted added in v0.1.0

func (c *Context) IsAborted() bool

IsAborted return was the context dropped

type CsvItem added in v0.1.1

type CsvItem []string

type CtxHandlerFun added in v0.1.0

type CtxHandlerFun func(ctx *Context)

type Downloader added in v0.1.0

type Downloader interface {
	Do(req *Request) (resp *Response, err error)
	AddMiddleware(func(req *Request, next func(req *Request) (resp *Response, err error)) (resp *Response, err error))
}

Downloader tool download response from request

type DownloaderErr added in v0.1.0

type DownloaderErr struct {

	// Request is the Request object when the error occurred
	Request *Request
	// Response is the Request object when the error occurred.It could be nil.
	Response *Response
	// contains filtered or unexported fields
}

DownloaderErr is a error create by Downloader

type ErrorItem added in v0.1.1

type ErrorItem struct {
	Ctx *Context
	Msg string
}

type JsonItem added in v0.1.1

type JsonItem struct {
	Data interface{}
}

type LimitRule added in v0.1.1

type LimitRule struct {
	Regexp, Glob string
	Allow        LimitRuleAllow
	Parallelism  int64

	Rate int64

	Delay       time.Duration
	RandomDelay time.Duration
	MaxReq      int64

	MaxDepth int64
	// contains filtered or unexported fields
}

func (*LimitRule) Match added in v0.1.1

func (s *LimitRule) Match(u *url.URL) bool

type LimitRuleAllow added in v0.1.1

type LimitRuleAllow uint8
const (
	NotSet LimitRuleAllow = iota
	Allow
	Disallow
)

type Manager added in v0.1.0

type Manager struct {
	// contains filtered or unexported fields
}

func NewManager added in v0.1.0

func NewManager(redis *redis.Client, sName string) *Manager

func (*Manager) GetItem added in v0.1.0

func (s *Manager) GetItem() interface{}

func (*Manager) OnItem added in v0.1.0

func (s *Manager) OnItem(fn func(i interface{}) interface{})

func (*Manager) Run added in v0.1.0

func (s *Manager) Run()

func (*Manager) SendReq added in v0.1.0

func (s *Manager) SendReq(req *Request)

func (*Manager) SetItemPoolSize added in v0.1.0

func (s *Manager) SetItemPoolSize(i int)

type RedisScheduler added in v0.1.0

type RedisScheduler struct {
	// contains filtered or unexported fields
}

Scheduler is default scheduler of goribot

func NewRedisScheduler added in v0.1.0

func NewRedisScheduler(redis *redis.Client, sName string, bs int, fn ...CtxHandlerFun) *RedisScheduler

func (*RedisScheduler) AddItem added in v0.1.0

func (s *RedisScheduler) AddItem(i interface{})

func (*RedisScheduler) AddTask added in v0.1.0

func (s *RedisScheduler) AddTask(t *Task)

func (*RedisScheduler) GetItem added in v0.1.0

func (s *RedisScheduler) GetItem() interface{}

func (*RedisScheduler) GetTask added in v0.1.0

func (s *RedisScheduler) GetTask() *Task

func (*RedisScheduler) IsItemEmpty added in v0.1.0

func (s *RedisScheduler) IsItemEmpty() bool

func (*RedisScheduler) IsTaskEmpty added in v0.1.0

func (s *RedisScheduler) IsTaskEmpty() bool

type Request

type Request struct {
	*http.Request
	Depth int
	// ResponseCharacterEncoding is the character encoding of the response body.
	// Leave it blank to allow automatic character encoding of the response body.
	// It is empty by default and it can be set in OnRequest callback.
	ResponseCharacterEncoding string
	// ProxyURL is the proxy address that handles the request
	ProxyURL string
	// Meta contains data between a Request and a Response
	Meta map[string]interface{}
	Err  error
	// contains filtered or unexported fields
}

Request is a object of HTTP request

func Get added in v0.1.9

func Get(urladdr string) *Request

Get creates a get request

func Post added in v0.1.9

func Post(urladdr string, body io.Reader) *Request

Post creates a post request

func PostFormReq added in v0.1.0

func PostFormReq(urladdr string, requestData map[string]string) *Request

PostFormReq creates a post request with form data

func PostJsonReq added in v0.1.0

func PostJsonReq(urladdr string, requestData interface{}) *Request

PostJsonReq creates a post request with json data

func PostRawReq added in v0.1.0

func PostRawReq(urladdr string, body []byte) *Request

PostReq creates a post request with raw data

func (*Request) AddCookie

func (s *Request) AddCookie(c *http.Cookie) *Request

AddCookie adds a cookie to the request.

func (*Request) AddParam added in v0.1.7

func (s *Request) AddParam(k, v string) *Request

AddParam adds a query param of request url.

func (*Request) GetBody added in v0.1.0

func (s *Request) GetBody() []byte

GetBody returns the body as bytes of request

func (*Request) SetHeader

func (s *Request) SetHeader(key, value string) *Request

SetHeader sets the header entries associated with key to the single element value.

func (*Request) SetParam added in v0.1.0

func (s *Request) SetParam(p map[string]string) *Request

SetParam sets query param of request url. Deprecated: will be remove at next major version

func (*Request) SetProxy added in v0.1.0

func (s *Request) SetProxy(p string) *Request

SetProxy sets proxy url of request.

func (*Request) SetUA added in v0.1.0

func (s *Request) SetUA(ua string) *Request

SetProxy sets user-agent url of request header.

func (*Request) WithMeta added in v0.1.0

func (s *Request) WithMeta(k string, v interface{}) *Request

SetParam sets the meta data of request.

type Response

type Response struct {
	*http.Response
	// Body is the content of the Response
	Body []byte
	// Text is the content of the Response parsed as string
	Text string
	// Request is the Req object from goribot of the response.Tip: there is another Request attr come from *http.Response
	Req *Request
	// Dom is the parsed html object
	Dom *goquery.Document
	// Meta contains data between a Request and a Response
	Meta map[string]interface{}
}

Response is a object of HTTP response

func (*Response) DecodeAndParse added in v0.1.0

func (s *Response) DecodeAndParse() error

DecodeAndParas decodes the body to text and try to parse it to html or json.

func (*Response) IsHTML added in v0.1.3

func (s *Response) IsHTML() bool

func (*Response) IsJSON added in v0.1.3

func (s *Response) IsJSON() bool

func (*Response) Json

func (s *Response) Json(q string) gjson.Result

Json returns json result parsed from response

type Scheduler added in v0.1.0

type Scheduler interface {
	// GetTask pops a task
	GetTask() *Task
	// GetItem pops a item
	GetItem() interface{}

	// AddTask push a task
	AddTask(t *Task)
	// AddItem push a item
	AddItem(i interface{})

	// IsTaskEmpty returns is tasks queue empty
	IsTaskEmpty() bool
	// IsItemEmpty returns is items queue empty
	IsItemEmpty() bool
}

Scheduler is a queue of tasks and items

type Spider

type Spider struct {
	Scheduler  Scheduler
	Downloader Downloader
	AutoStop   bool
	// contains filtered or unexported fields
}

func NewSpider

func NewSpider(exts ...func(s *Spider)) *Spider

func (*Spider) AddTask

func (s *Spider) AddTask(request *Request, handlers ...CtxHandlerFun)

func (*Spider) OnAdd added in v0.1.0

func (s *Spider) OnAdd(fn func(ctx *Context, t *Task) *Task)

***********************************************************************************

func (*Spider) OnError

func (s *Spider) OnError(fn func(ctx *Context, err error))

***********************************************************************************

func (*Spider) OnFinish added in v0.1.0

func (s *Spider) OnFinish(fn func(s *Spider))

***********************************************************************************

func (*Spider) OnHTML added in v0.1.4

func (s *Spider) OnHTML(selector string, fn func(ctx *Context, sel *goquery.Selection))

func (*Spider) OnItem

func (s *Spider) OnItem(fn func(i interface{}) interface{})

***********************************************************************************

func (*Spider) OnJSON added in v0.1.4

func (s *Spider) OnJSON(q string, fn func(ctx *Context, j gjson.Result))

func (*Spider) OnReq added in v0.1.0

func (s *Spider) OnReq(fn func(ctx *Context, req *Request) *Request)

***********************************************************************************

func (*Spider) OnResp

func (s *Spider) OnResp(fn CtxHandlerFun)

***********************************************************************************

func (*Spider) OnStart added in v0.1.0

func (s *Spider) OnStart(fn func(s *Spider))

***********************************************************************************

func (*Spider) Run

func (s *Spider) Run()

func (*Spider) SetItemPoolSize added in v0.1.0

func (s *Spider) SetItemPoolSize(i int)

func (*Spider) SetTaskPoolSize added in v0.1.0

func (s *Spider) SetTaskPoolSize(i int)

func (*Spider) Use added in v0.1.0

func (s *Spider) Use(fn ...func(s *Spider))

type Task

type Task struct {
	Request  *Request
	Handlers []CtxHandlerFun
}

func NewTask

func NewTask(request *Request, handlers ...CtxHandlerFun) *Task

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL