antch

package module

v0.3.6 Latest Latest Go to latest Published: Mar 17, 2021 License: MIT Imports: 31 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/neevaco/antch

Links

Open Source Insights

README ¶

Antch

Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started.

Antch is a fast, powerful and extensible web crawling & scraping framework for Go, used to crawl websites and extract structured data from their pages.

Get Started

Getting Started

Follow the Getting Started instructions to start your first spider.

Features

Polite, highly concurrent web crawler.
Powerful and customizable HTTP middleware.
Item data pipeline for the web spider.
Built-in proxy support (HTTP, HTTPS, SOCKS5).
Built-in XPath query support for HTML/XML documents.
Easy to use and integrate with your project.

Examples

BingWallpaper - Bing daily wallpaper.

Documentation

See https://github.com/antchfx/antch/wiki

Documentation ¶

Index ¶

Variables
func ParseHTML(resp *http.Response) (*html.Node, error)
func ParseJSON(resp *http.Response) (*jsonquery.Node, error)
func ParseXML(resp *http.Response) (*xmlquery.Node, error)
type Crawler
- func NewCrawler() *Crawler
- func (c *Crawler) Crawl(req *http.Request) error
- func (c *Crawler) EnqueueURL(URL string) error
- func (c *Crawler) Handle(pattern string, handler Handler)
- func (c *Crawler) Handler(res *http.Response) (h Handler, pattern string)
- func (c *Crawler) StartURLs(URLs []string)
- func (c *Crawler) UseCompression() *Crawler
- func (c *Crawler) UseCookies() *Crawler
- func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler
- func (c *Crawler) UsePipeline(p ...Pipeline) *Crawler
- func (c *Crawler) UseProxy(proxyURL *url.URL) *Crawler
- func (c *Crawler) UseRobotstxt() *Crawler
type Handler
- func VoidHandler() Handler
type HandlerFunc
- func (f HandlerFunc) ServeSpider(c chan<- Item, resp *http.Response)
type HttpMessageHandler
type HttpMessageHandlerFunc
- func (f HttpMessageHandlerFunc) Send(req *http.Request) (*http.Response, error)
type Item
type Logger
type MediaType
- func ParseMediaType(v string) MediaType
- func (m MediaType) ContentType() string
type Middleware
- func CompressionMiddleware() Middleware
- func CookiesMiddleware() Middleware
- func ProxyMiddleware(f func(*http.Request) (*url.URL, error)) Middleware
- func RobotstxtMiddleware() Middleware
type Pipeline
type PipelineHandler
type PipelineHandlerFunc
- func (f PipelineHandlerFunc) ServePipeline(v Item)
type ProxyKey

Constants ¶

This section is empty.

Variables ¶

View Source

var NilLogger nilLogger

NilLogger is a Logger that will not logging any message.

Functions ¶

func ParseHTML ¶

func ParseHTML(resp *http.Response) (*html.Node, error)

ParseHTML parses an HTTP response as HTML document.

func ParseJSON ¶

func ParseJSON(resp *http.Response) (*jsonquery.Node, error)

ParseJSON parses an HTTP response as JSON document.

func ParseXML ¶

func ParseXML(resp *http.Response) (*xmlquery.Node, error)

ParseXML parses an HTTP response as XML document.

Types ¶

type Crawler ¶

type Crawler struct {
	// CheckRedirect specifies the policy for handling redirects.
	CheckRedirect func(req *http.Request, via []*http.Request) error

	// MaxConcurrentRequests specifies the maximum number of concurrent
	// requests that will be performed.
	// Default is 16.
	MaxConcurrentRequests int

	// MaxConcurrentRequestsPerHost specifies the maximum number of
	// concurrent requests that will be performed to any single domain.
	// Default is 1.
	MaxConcurrentRequestsPerSite int

	// RequestTimeout specifies a time to wait before the HTTP Request times out.
	// Default is 30s.
	RequestTimeout time.Duration

	// DownloadDelay specifies delay time to wait before access same website.
	// Default is 0.25s.
	DownloadDelay time.Duration

	// MaxConcurrentItems specifies the maximum number of concurrent items
	// to process parallel in the pipeline.
	// Default is 32.
	MaxConcurrentItems int

	// MaxRetries specifies the maximum number of retries we'll make for a particular URL.
	// Default is 0 (excluding the original attempt)
	MaxRetries int

	// RetryHTTPResponseCodes specifies the response codes for which we'll retry for a particular URL.
	RetryHTTPResponseCodes []int

	// Default duration between retries. If the response headers have a Retry-After, we'll respect that instead.
	// Default is 10s.
	DefaultRetryPeriod time.Duration

	// Transport specifies the mechanism by which HTTP requests are made. Note that if populated, then this field would
	// take precedence over other configs that determine the transport configuration e.g.: MaxConcurrentRequestsPerSite.
	// If nil, then a default value is used.
	Transport http.RoundTripper

	// UserAgent specifies the user-agent for the remote server.
	UserAgent string

	// ErrorLog specifies an optional logger for errors HTTP transports
	// and unexpected behavior from handlers.
	// If nil, logging goes to os.Stderr via the log package's
	// standard logger.
	ErrorLog Logger

	// Exit is an optional channel whose closure indicates that the Crawler
	// instance should be stop work and exit.
	Exit <-chan struct{}
	// contains filtered or unexported fields
}

Crawler is core of web crawl server that provides crawl websites and calls pipeline to process for received data from their pages.

func NewCrawler ¶

func NewCrawler() *Crawler

NewCrawler returns a new Crawler with default settings.

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl(req *http.Request) error

Crawl puts an HTTP request into the working queue to crawling.

func (*Crawler) EnqueueURL ¶

func (c *Crawler) EnqueueURL(URL string) error

EnqueueURL puts given URL into the backup URLs queue.

func (*Crawler) Handle ¶

func (c *Crawler) Handle(pattern string, handler Handler)

Handle registers the Handler for the given pattern. If pattern is "*" means will matches all requests if no any pattern matches.

func (*Crawler) Handler ¶

func (c *Crawler) Handler(res *http.Response) (h Handler, pattern string)

Handler returns a Handler for the give HTTP Response.

func (*Crawler) StartURLs ¶

func (c *Crawler) StartURLs(URLs []string)

StartURLs starts crawling for the given URL list.

func (*Crawler) UseCompression ¶

func (c *Crawler) UseCompression() *Crawler

UseCompression enables the HTTP compression middleware to supports gzip, deflate for HTTP Request/Response.

func (*Crawler) UseCookies ¶

func (c *Crawler) UseCookies() *Crawler

UseCookies enables the cookies middleware to working.

func (*Crawler) UseMiddleware ¶

func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler

UseMiddleware adds a Middleware to the crawler.

func (*Crawler) UsePipeline ¶

func (c *Crawler) UsePipeline(p ...Pipeline) *Crawler

UsePipeline adds a Pipeline to the crawler.

func (*Crawler) UseProxy ¶

func (c *Crawler) UseProxy(proxyURL *url.URL) *Crawler

UseProxy enables proxy for each of HTTP requests.

func (*Crawler) UseRobotstxt ¶

func (c *Crawler) UseRobotstxt() *Crawler

UseRobotstxt enables support robots.txt.

type Handler ¶

type Handler interface {
	ServeSpider(chan<- Item, *http.Response)
}

Handler is the HTTP Response handler interface that defines how to extract scraped items from their pages.

ServeSpider should be write got Item to the Channel.

func VoidHandler ¶

func VoidHandler() Handler

VoidHandler returns a Handler that without doing anything.

type HandlerFunc ¶

type HandlerFunc func(chan<- Item, *http.Response)

HandlerFunc is an adapter to allow the use of ordinary functions as Spider.

func (HandlerFunc) ServeSpider ¶

func (f HandlerFunc) ServeSpider(c chan<- Item, resp *http.Response)

ServeSpider performs extract data from received HTTP response and write it into the Channel c.

type HttpMessageHandler ¶

type HttpMessageHandler interface {
	Send(*http.Request) (*http.Response, error)
}

HttpMessageHandler is an interface that receives an HTTP request and returns an HTTP response.

type HttpMessageHandlerFunc ¶

type HttpMessageHandlerFunc func(*http.Request) (*http.Response, error)

HttpMessageHandlerFunc is an adapter to allow the use of ordinary functions as HttpMessageHandler.

func (HttpMessageHandlerFunc) Send ¶

func (f HttpMessageHandlerFunc) Send(req *http.Request) (*http.Response, error)

Send sends a HTTP request and receives HTTP response.

type Item ¶

type Item interface{}

Item is represents an item object.

type Logger ¶

type Logger interface {
	Output(maxdepth int, s string) error
}

Logger is an interface for logging message.

type MediaType ¶

type MediaType struct {
	// Type is the HTTP content type represents. such as
	// "text/html", "image/jpeg".
	Type string
	// Charset is the HTTP content encoding represents.
	Charset string
}

MediaType describe the content type of an HTTP request or HTTP response.

func ParseMediaType ¶

func ParseMediaType(v string) MediaType

ParseMediaType parsing a specified string v to MediaType struct.

func (MediaType) ContentType ¶

func (m MediaType) ContentType() string

ContentType returns the HTTP header content-type value.

type Middleware ¶

type Middleware func(HttpMessageHandler) HttpMessageHandler

Middleware is the HTTP message transport middle layer that send HTTP request passed one message Handler to the next message Handler until returns an HTTP response.

func CompressionMiddleware ¶

func CompressionMiddleware() Middleware

CompressionMiddleware is a middleware to allows compressed (gzip, deflate) traffic to be sent/received from sites.

func CookiesMiddleware ¶

func CookiesMiddleware() Middleware

CookiesMiddleware is an HTTP cookies middleware to allows cookies to tracking for each of HTTP requests.

func ProxyMiddleware ¶

func ProxyMiddleware(f func(*http.Request) (*url.URL, error)) Middleware

ProxyMiddleware is an HTTP proxy middleware to take HTTP Request use the HTTP proxy to access remote sites.

ProxyMiddleware supports HTTP/HTTPS,SOCKS5 protocol list. etc http://127.0.0.1:8080 or https://127.0.0.1:8080 or socks5://127.0.0.1:1080

func RobotstxtMiddleware ¶

func RobotstxtMiddleware() Middleware

RobotstxtMiddleware is a middleware for robots.txt, make HTTP request is more polite.

type Pipeline ¶

type Pipeline func(PipelineHandler) PipelineHandler

Pipeline allows perform value Item passed one PipelineHandler to the next PipelineHandler in the chain.

type PipelineHandler ¶

type PipelineHandler interface {
	ServePipeline(Item)
}

PipelineHandler is an interface for a handler in pipeline.

type PipelineHandlerFunc ¶

type PipelineHandlerFunc func(Item)

PipelineHandlerFunc is an adapter to allow the use of ordinary functions as PipelineHandler.

func (PipelineHandlerFunc) ServePipeline ¶

func (f PipelineHandlerFunc) ServePipeline(v Item)

ServePipeline performs for given Item data.

type ProxyKey ¶

type ProxyKey struct{}

ProxyKey is a key for the proxy URL that used by Crawler.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
contrib
dupefilter

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL