goscrapy

package module

v0.0.0-...-75cde0a Latest Latest Go to latest Published: Oct 7, 2022 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jiandahao/goscrapy

Links

Open Source Insights

README ¶

goscrapy

goscrapy is a web crawling and web scraping framework written by golang. The architecture of goscrapy is similar to python-scrpay, and described as following:

Data Flow

Easy to use

An example usage is available at here.

goscrapy-cli is a command line tool which is helpful to auto-generate code template for you, see goscrapy-cli for more details.

Documentation ¶

Index ¶

type Context
- func (ctx *Context) Document() *goquery.Document
- func (ctx *Context) Request() *Request
- func (ctx *Context) Response() *Response
type DefaultDownloader
- func (dd *DefaultDownloader) Download(req *Request) (*Response, error)
- func (dd *DefaultDownloader) SetHTTPClient(client *http.Client)
type Downloader
type Engine
- func New(opts ...Option) *Engine
- func (e *Engine) RegisterPipelines(pipelines ...Pipeline)
- func (e *Engine) RegisterSipders(spiders ...Spider)
- func (e *Engine) Start()
- func (e *Engine) Stop()
type FIFOScheduler
- func NewFIFOScheduler() *FIFOScheduler
- func (ds *FIFOScheduler) HasMore() bool
- func (ds *FIFOScheduler) PopRequest() (req *Request, hadMore bool)
- func (ds *FIFOScheduler) PushRequest(req *Request) (ok bool)
- func (ds *FIFOScheduler) Start() error
- func (ds *FIFOScheduler) Stop() error
type Items
- func NewItems(name string) *Items
- func (item *Items) Name() string
type Option
- func MaxCrawlingDepth(depth int) Option
- func SetConcurrency(num int) Option
- func UseDownloader(d Downloader) Option
- func UseLogger(lg logger.Logger) Option
- func UseScheduler(s Scheduler) Option
- func WithDelay(delay time.Duration) Option
- func WithRequestMiddlewares(middlewares ...RequestHandleFunc) Option
- func WithResponseMiddlewares(middlewares ...ResponseHandleFunc) Option
type Pipeline
type Request
- func (r *Request) Abort()
- func (r *Request) ContextValue(key string) interface{}
- func (r *Request) IsAborted() bool
- func (r *Request) WithContextValue(key string, value interface{})
type RequestHandleFunc
type Response
type ResponseHandleFunc
type Scheduler
type Spider
type StringMatcher
- func NewStaticStringMatcher(str string) *StringMatcher
- func (m *StringMatcher) Match(url string) bool
type URLMatcher
type URLRegExpMatcher
- func NewRegexpMatcher(str string) *URLRegExpMatcher
- func (m *URLRegExpMatcher) Match(url string) bool
type WeightedScheduler
- func NewWeightedScheduler() *WeightedScheduler
- func (sched *WeightedScheduler) HasMore() bool
- func (sched *WeightedScheduler) Len() int
- func (sched *WeightedScheduler) Less(i, j int) bool
- func (sched *WeightedScheduler) Pop() interface{}
- func (sched *WeightedScheduler) PopRequest() (req *Request, ok bool)
- func (sched *WeightedScheduler) Push(x interface{})
- func (sched *WeightedScheduler) PushRequest(req *Request) (ok bool)
- func (sched *WeightedScheduler) Start() error
- func (sched *WeightedScheduler) Stop() error
- func (sched *WeightedScheduler) Swap(i, j int)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Context ¶

type Context struct {
	context.Context
	// contains filtered or unexported fields
}

Context represents the scraping and crawling context

func (*Context) Document ¶

func (ctx *Context) Document() *goquery.Document

Document returns HTML document

func (*Context) Request ¶

func (ctx *Context) Request() *Request

Request returns the crawling request

func (*Context) Response ¶

func (ctx *Context) Response() *Response

Response returns the downloading response

type DefaultDownloader ¶

type DefaultDownloader struct {
	// contains filtered or unexported fields
}

DefaultDownloader a simple downloader implementation

func (*DefaultDownloader) Download ¶

func (dd *DefaultDownloader) Download(req *Request) (*Response, error)

Download sends http request and using goquery to get http document

func (*DefaultDownloader) SetHTTPClient ¶

func (dd *DefaultDownloader) SetHTTPClient(client *http.Client)

SetHTTPClient set http client using to fetch pages

type Downloader ¶

type Downloader interface {
	Download(*Request) (*Response, error)
}

Downloader is an interface that representing the ability to download data from internet. It is responsible for fetching web pages, the downloading response will be took over by engine, in turn, fed to spiders.

type Engine ¶

type Engine struct {
	// contains filtered or unexported fields
}

Engine represents scraping engine, it is responsible for managing the data flow among scheduler, downloader and spiders.

func New ¶

func New(opts ...Option) *Engine

New create a new goscrapy engine

func (*Engine) RegisterPipelines ¶

func (e *Engine) RegisterPipelines(pipelines ...Pipeline)

RegisterPipelines register pipelines

func (*Engine) RegisterSipders ¶

func (e *Engine) RegisterSipders(spiders ...Spider)

RegisterSipders add working spiders

func (*Engine) Start ¶

func (e *Engine) Start()

Start starts engine

func (*Engine) Stop ¶

func (e *Engine) Stop()

Stop stops engine

type FIFOScheduler ¶

type FIFOScheduler struct {
	// contains filtered or unexported fields
}

FIFOScheduler default scheduler implementation

func NewFIFOScheduler ¶

func NewFIFOScheduler() *FIFOScheduler

NewFIFOScheduler create a new fifo scheduler with queue size of 100

func (*FIFOScheduler) HasMore ¶

func (ds *FIFOScheduler) HasMore() bool

HasMore returns true if there are more request to be scheduled

func (*FIFOScheduler) PopRequest ¶

func (ds *FIFOScheduler) PopRequest() (req *Request, hadMore bool)

PopRequest returns next request

func (*FIFOScheduler) PushRequest ¶

func (ds *FIFOScheduler) PushRequest(req *Request) (ok bool)

PushRequest add request

func (*FIFOScheduler) Start ¶

func (ds *FIFOScheduler) Start() error

Start starts scheduler

func (*FIFOScheduler) Stop ¶

func (ds *FIFOScheduler) Stop() error

Stop stops scheduler

type Items ¶

type Items struct {
	sync.Map
	// contains filtered or unexported fields
}

Items items

func NewItems ¶

func NewItems(name string) *Items

NewItems new items with specified name, goscrapy pipeline will make the decision whether to handle an item based on the item's name.

func (*Items) Name ¶

func (item *Items) Name() string

Name returns items' name

type Option ¶

type Option func(e *Engine)

Option engine option

func MaxCrawlingDepth ¶

func MaxCrawlingDepth(depth int) Option

MaxCrawlingDepth returns an Option that sets the max crawling depth. The engine will drop Requests that have current depth exceeded the maximum limit.

func SetConcurrency ¶

func SetConcurrency(num int) Option

SetConcurrency set concurrency

func UseDownloader ¶

func UseDownloader(d Downloader) Option

UseDownloader set downloader

func UseLogger ¶

func UseLogger(lg logger.Logger) Option

UseLogger set logger

func UseScheduler ¶

func UseScheduler(s Scheduler) Option

UseScheduler set scheduler

func WithDelay ¶

func WithDelay(delay time.Duration) Option

WithDelay set the duration to wait before handling next request.

func WithRequestMiddlewares ¶

func WithRequestMiddlewares(middlewares ...RequestHandleFunc) Option

WithRequestMiddlewares registers request middlewares. Requests will be processed by request middlewares just before passing to downloader.

for aborting request in middleware, using Request.Abrot()

for example:

func ReqMiddleware(req *goscrapy.Request) error {
	if req.URL == "http://www.example.com" {
		req.Abort()
		return nil
	}

	return nil
}

func WithResponseMiddlewares ¶

func WithResponseMiddlewares(middlewares ...ResponseHandleFunc) Option

WithResponseMiddlewares registers response middlewares. Response will be processed by response middlewares right after downloader finishes downloading and takes over response to engine

type Pipeline ¶

type Pipeline interface {
	Name() string       // returns pipeline's name
	ItemList() []string // returns all items' name that this pipeline cares about
	Handle(items *Items) error
}

Pipeline pipeline

type Request ¶

type Request struct {
	Method string      `json:"method,omitempty"`
	URL    string      `json:"url,omitempty"`
	Header http.Header `json:"header,omitempty"`
	Query  url.Values  `json:"query,omitempty"`
	// using to decide scheduling sequence. It only means something when using a
	// scheduler that schedules requests based on request weight.
	Weight int
	// contains filtered or unexported fields
}

Request represents crawling request

func (*Request) Abort ¶

func (r *Request) Abort()

Abort aborts current request, you could use it at your request middleware to make sure certain request will not be handled by downloader and spiders.

func (*Request) ContextValue ¶

func (r *Request) ContextValue(key string) interface{}

ContextValue returns the value associated with this request for key, or nil if no value is associated with key.

func (*Request) IsAborted ¶

func (r *Request) IsAborted() bool

IsAborted returns true if the current request was aborted.

func (*Request) WithContextValue ¶

func (r *Request) WithContextValue(key string, value interface{})

WithContextValue sets the value into request associated with the key.

type RequestHandleFunc ¶

type RequestHandleFunc func(*Request) error

RequestHandleFunc request handler func

type Response ¶

type Response struct {
	Status     string `json:"status,omitempty"`      // e.g. "200 OK"
	StatusCode int    `json:"status_code,omitempty"` // e.g. 200
	// Request represents request that was send to obtain this response.
	Request *Request `json:"request,omitempty"`
	// Document represents an HTML document to be manipulated.
	Document *goquery.Document `json:"-"`
	// Body represents the response body.
	Body []byte `json:"-"`
	// ContentLength records the length of the associated content. more details see http.Response.
	ContentLength int64 `json:"content_length,omitempty"`
	// Header represents response header, maps header keys to values.
	Header http.Header `json:"header,omitempty"`
}

Response represents crawling response

type ResponseHandleFunc ¶

type ResponseHandleFunc func(*Response) error

ResponseHandleFunc response handler func

type Scheduler ¶

type Scheduler interface {
	Start() error                             // start scheduler
	Stop() error                              // stop scheduler
	PopRequest() (req *Request, hasMore bool) // return next request from scheduler
	PushRequest(req *Request) (ok bool)       // add new request into scheduler
	HasMore() bool                            // returns true if there are more request to be scheduled
}

Scheduler scheduler is responsible for managing all scraping and crawling request

type Spider ¶

type Spider interface {
	Name() string
	StartRequests() []*Request
	URLMatcher() URLMatcher
	Parse(ctx *Context) (*Items, []*Request, error)
}

Spider is an interface that defines how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behavior for crawling and parsing pages for a particular site (or, in some cases, a group of sites). For spiders, the scraping cycle goes through something like this:

Using initial Requests generated by StartRequests to crawl the first URLs.
Parsing the response (web page), then return items object (structured data) and request objects. Those requests will be added into scheduler by goscrapy engine and downloaded by downloader in the future.

type StringMatcher ¶

type StringMatcher struct {
	// contains filtered or unexported fields
}

StringMatcher static string matcher

func NewStaticStringMatcher ¶

func NewStaticStringMatcher(str string) *StringMatcher

NewStaticStringMatcher new static string matcher

func (*StringMatcher) Match ¶

func (m *StringMatcher) Match(url string) bool

Match returns true if url is matched

type URLMatcher ¶

type URLMatcher interface {
	// Match returns true if url is matched
	Match(url string) bool
}

URLMatcher url matcher

type URLRegExpMatcher ¶

type URLRegExpMatcher struct {
	// contains filtered or unexported fields
}

URLRegExpMatcher url regexp matcher

func NewRegexpMatcher ¶

func NewRegexpMatcher(str string) *URLRegExpMatcher

NewRegexpMatcher new URL matcher

func (*URLRegExpMatcher) Match ¶

func (m *URLRegExpMatcher) Match(url string) bool

Match returns true if url is matched

type WeightedScheduler ¶

type WeightedScheduler struct {
	// contains filtered or unexported fields
}

WeightedScheduler scheduler

func NewWeightedScheduler ¶

func NewWeightedScheduler() *WeightedScheduler

NewWeightedScheduler new a weighted scheduler which is implemented based on max-heap.

func (*WeightedScheduler) HasMore ¶

func (sched *WeightedScheduler) HasMore() bool

HasMore returns true if queue has more request

func (*WeightedScheduler) Len ¶

func (sched *WeightedScheduler) Len() int

Len returns the number of elements in the collection.

func (*WeightedScheduler) Less ¶

func (sched *WeightedScheduler) Less(i, j int) bool

Less reports whether the element with index i should sort before the element with index j.

func (*WeightedScheduler) Pop ¶

func (sched *WeightedScheduler) Pop() interface{}

Pop remove and return element Len() - 1. It's aimed to implement sort.Interface. For pushing request onto schduler using PopRequest instead.

func (*WeightedScheduler) PopRequest ¶

func (sched *WeightedScheduler) PopRequest() (req *Request, ok bool)

PopRequest pop request

func (*WeightedScheduler) Push ¶

func (sched *WeightedScheduler) Push(x interface{})

Push pushes value onto heap, it's aimed to implement sort.Interface. For pushing request onto schduler using PushRequest instead.

func (*WeightedScheduler) PushRequest ¶

func (sched *WeightedScheduler) PushRequest(req *Request) (ok bool)

PushRequest push request

func (*WeightedScheduler) Start ¶

func (sched *WeightedScheduler) Start() error

Start start

func (*WeightedScheduler) Stop ¶

func (sched *WeightedScheduler) Stop() error

Stop stop

func (*WeightedScheduler) Swap ¶

func (sched *WeightedScheduler) Swap(i, j int)

Swap swaps the elements with indexes i and j.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
goscrapy-cli
example
pkg
broswer/chrome
broswer/chrome/example
downloader
downloader/cache
useragent

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL