antch

package module
v0.3.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 17, 2021 License: MIT Imports: 31 Imported by: 0

README

Antch

Build Status Coverage Status Go Report Card GoDoc

Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started.

Antch is a fast, powerful and extensible web crawling & scraping framework for Go, used to crawl websites and extract structured data from their pages.

Get Started

Getting Started

Follow the Getting Started instructions to start your first spider.

Features

  • Polite, highly concurrent web crawler.
  • Powerful and customizable HTTP middleware.
  • Item data pipeline for the web spider.
  • Built-in proxy support (HTTP, HTTPS, SOCKS5).
  • Built-in XPath query support for HTML/XML documents.
  • Easy to use and integrate with your project.

Examples

BingWallpaper - Bing daily wallpaper.

Documentation

See https://github.com/antchfx/antch/wiki

Documentation

Index

Constants

This section is empty.

Variables

View Source
var NilLogger nilLogger

NilLogger is a Logger that will not logging any message.

Functions

func ParseHTML

func ParseHTML(resp *http.Response) (*html.Node, error)

ParseHTML parses an HTTP response as HTML document.

func ParseJSON

func ParseJSON(resp *http.Response) (*jsonquery.Node, error)

ParseJSON parses an HTTP response as JSON document.

func ParseXML

func ParseXML(resp *http.Response) (*xmlquery.Node, error)

ParseXML parses an HTTP response as XML document.

Types

type Crawler

type Crawler struct {
	// CheckRedirect specifies the policy for handling redirects.
	CheckRedirect func(req *http.Request, via []*http.Request) error

	// MaxConcurrentRequests specifies the maximum number of concurrent
	// requests that will be performed.
	// Default is 16.
	MaxConcurrentRequests int

	// MaxConcurrentRequestsPerHost specifies the maximum number of
	// concurrent requests that will be performed to any single domain.
	// Default is 1.
	MaxConcurrentRequestsPerSite int

	// RequestTimeout specifies a time to wait before the HTTP Request times out.
	// Default is 30s.
	RequestTimeout time.Duration

	// DownloadDelay specifies delay time to wait before access same website.
	// Default is 0.25s.
	DownloadDelay time.Duration

	// MaxConcurrentItems specifies the maximum number of concurrent items
	// to process parallel in the pipeline.
	// Default is 32.
	MaxConcurrentItems int

	// MaxRetries specifies the maximum number of retries we'll make for a particular URL.
	// Default is 0 (excluding the original attempt)
	MaxRetries int

	// RetryHTTPResponseCodes specifies the response codes for which we'll retry for a particular URL.
	RetryHTTPResponseCodes []int

	// Default duration between retries. If the response headers have a Retry-After, we'll respect that instead.
	// Default is 10s.
	DefaultRetryPeriod time.Duration

	// Transport specifies the mechanism by which HTTP requests are made. Note that if populated, then this field would
	// take precedence over other configs that determine the transport configuration e.g.: MaxConcurrentRequestsPerSite.
	// If nil, then a default value is used.
	Transport http.RoundTripper

	// UserAgent specifies the user-agent for the remote server.
	UserAgent string

	// ErrorLog specifies an optional logger for errors HTTP transports
	// and unexpected behavior from handlers.
	// If nil, logging goes to os.Stderr via the log package's
	// standard logger.
	ErrorLog Logger

	// Exit is an optional channel whose closure indicates that the Crawler
	// instance should be stop work and exit.
	Exit <-chan struct{}
	// contains filtered or unexported fields
}

Crawler is core of web crawl server that provides crawl websites and calls pipeline to process for received data from their pages.

func NewCrawler

func NewCrawler() *Crawler

NewCrawler returns a new Crawler with default settings.

func (*Crawler) Crawl

func (c *Crawler) Crawl(req *http.Request) error

Crawl puts an HTTP request into the working queue to crawling.

func (*Crawler) EnqueueURL

func (c *Crawler) EnqueueURL(URL string) error

EnqueueURL puts given URL into the backup URLs queue.

func (*Crawler) Handle

func (c *Crawler) Handle(pattern string, handler Handler)

Handle registers the Handler for the given pattern. If pattern is "*" means will matches all requests if no any pattern matches.

func (*Crawler) Handler

func (c *Crawler) Handler(res *http.Response) (h Handler, pattern string)

Handler returns a Handler for the give HTTP Response.

func (*Crawler) StartURLs

func (c *Crawler) StartURLs(URLs []string)

StartURLs starts crawling for the given URL list.

func (*Crawler) UseCompression

func (c *Crawler) UseCompression() *Crawler

UseCompression enables the HTTP compression middleware to supports gzip, deflate for HTTP Request/Response.

func (*Crawler) UseCookies

func (c *Crawler) UseCookies() *Crawler

UseCookies enables the cookies middleware to working.

func (*Crawler) UseMiddleware

func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler

UseMiddleware adds a Middleware to the crawler.

func (*Crawler) UsePipeline

func (c *Crawler) UsePipeline(p ...Pipeline) *Crawler

UsePipeline adds a Pipeline to the crawler.

func (*Crawler) UseProxy

func (c *Crawler) UseProxy(proxyURL *url.URL) *Crawler

UseProxy enables proxy for each of HTTP requests.

func (*Crawler) UseRobotstxt

func (c *Crawler) UseRobotstxt() *Crawler

UseRobotstxt enables support robots.txt.

type Handler

type Handler interface {
	ServeSpider(chan<- Item, *http.Response)
}

Handler is the HTTP Response handler interface that defines how to extract scraped items from their pages.

ServeSpider should be write got Item to the Channel.

func VoidHandler

func VoidHandler() Handler

VoidHandler returns a Handler that without doing anything.

type HandlerFunc

type HandlerFunc func(chan<- Item, *http.Response)

HandlerFunc is an adapter to allow the use of ordinary functions as Spider.

func (HandlerFunc) ServeSpider

func (f HandlerFunc) ServeSpider(c chan<- Item, resp *http.Response)

ServeSpider performs extract data from received HTTP response and write it into the Channel c.

type HttpMessageHandler

type HttpMessageHandler interface {
	Send(*http.Request) (*http.Response, error)
}

HttpMessageHandler is an interface that receives an HTTP request and returns an HTTP response.

type HttpMessageHandlerFunc

type HttpMessageHandlerFunc func(*http.Request) (*http.Response, error)

HttpMessageHandlerFunc is an adapter to allow the use of ordinary functions as HttpMessageHandler.

func (HttpMessageHandlerFunc) Send

Send sends a HTTP request and receives HTTP response.

type Item

type Item interface{}

Item is represents an item object.

type Logger

type Logger interface {
	Output(maxdepth int, s string) error
}

Logger is an interface for logging message.

type MediaType

type MediaType struct {
	// Type is the HTTP content type represents. such as
	// "text/html", "image/jpeg".
	Type string
	// Charset is the HTTP content encoding represents.
	Charset string
}

MediaType describe the content type of an HTTP request or HTTP response.

func ParseMediaType

func ParseMediaType(v string) MediaType

ParseMediaType parsing a specified string v to MediaType struct.

func (MediaType) ContentType

func (m MediaType) ContentType() string

ContentType returns the HTTP header content-type value.

type Middleware

type Middleware func(HttpMessageHandler) HttpMessageHandler

Middleware is the HTTP message transport middle layer that send HTTP request passed one message Handler to the next message Handler until returns an HTTP response.

func CompressionMiddleware

func CompressionMiddleware() Middleware

CompressionMiddleware is a middleware to allows compressed (gzip, deflate) traffic to be sent/received from sites.

func CookiesMiddleware

func CookiesMiddleware() Middleware

CookiesMiddleware is an HTTP cookies middleware to allows cookies to tracking for each of HTTP requests.

func ProxyMiddleware

func ProxyMiddleware(f func(*http.Request) (*url.URL, error)) Middleware

ProxyMiddleware is an HTTP proxy middleware to take HTTP Request use the HTTP proxy to access remote sites.

ProxyMiddleware supports HTTP/HTTPS,SOCKS5 protocol list. etc http://127.0.0.1:8080 or https://127.0.0.1:8080 or socks5://127.0.0.1:1080

func RobotstxtMiddleware

func RobotstxtMiddleware() Middleware

RobotstxtMiddleware is a middleware for robots.txt, make HTTP request is more polite.

type Pipeline

type Pipeline func(PipelineHandler) PipelineHandler

Pipeline allows perform value Item passed one PipelineHandler to the next PipelineHandler in the chain.

type PipelineHandler

type PipelineHandler interface {
	ServePipeline(Item)
}

PipelineHandler is an interface for a handler in pipeline.

type PipelineHandlerFunc

type PipelineHandlerFunc func(Item)

PipelineHandlerFunc is an adapter to allow the use of ordinary functions as PipelineHandler.

func (PipelineHandlerFunc) ServePipeline

func (f PipelineHandlerFunc) ServePipeline(v Item)

ServePipeline performs for given Item data.

type ProxyKey

type ProxyKey struct{}

ProxyKey is a key for the proxy URL that used by Crawler.

Directories

Path Synopsis
contrib

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL