README

Antch

Build Status Coverage Status Go Report Card GoDoc

Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started.

Antch is a fast, powerful and extensible web crawling & scraping framework for Go, used to crawl websites and extract structured data from their pages.

Get Started

Getting Started

Follow the Getting Started instructions to start your first spider.

Features

  • Polite, highly concurrent web crawler.
  • Powerful and customizable HTTP middleware.
  • Item data pipeline for the web spider.
  • Built-in proxy support (HTTP, HTTPS, SOCKS5).
  • Built-in XPath query support for HTML/XML documents.
  • Easy to use and integrate with your project.

Examples

BingWallpaper - Bing daily wallpaper.

Documentation

See https://github.com/antchfx/antch/wiki

Documentation

Index

Constants

This section is empty.

Variables

View Source
var NilLogger nilLogger

    NilLogger is a Logger that will not logging any message.

    Functions

    func ParseHTML

    func ParseHTML(resp *http.Response) (*html.Node, error)

      ParseHTML parses an HTTP response as HTML document.

      func ParseJSON

      func ParseJSON(resp *http.Response) (*jsonquery.Node, error)

        ParseJSON parses an HTTP response as JSON document.

        func ParseXML

        func ParseXML(resp *http.Response) (*xmlquery.Node, error)

          ParseXML parses an HTTP response as XML document.

          Types

          type Crawler

          type Crawler struct {
          	// CheckRedirect specifies the policy for handling redirects.
          	CheckRedirect func(req *http.Request, via []*http.Request) error
          
          	// MaxConcurrentRequests specifies the maximum number of concurrent
          	// requests that will be performed.
          	// Default is 16.
          	MaxConcurrentRequests int
          
          	// MaxConcurrentRequestsPerHost specifies the maximum number of
          	// concurrent requests that will be performed to any single domain.
          	// Default is 1.
          	MaxConcurrentRequestsPerSite int
          
          	// RequestTimeout specifies a time to wait before the HTTP Request times out.
          	// Default is 30s.
          	RequestTimeout time.Duration
          
          	// DownloadDelay specifies delay time to wait before access same website.
          	// Default is 0.25s.
          	DownloadDelay time.Duration
          
          	// MaxConcurrentItems specifies the maximum number of concurrent items
          	// to process parallel in the pipeline.
          	// Default is 32.
          	MaxConcurrentItems int
          
          	// UserAgent specifies the user-agent for the remote server.
          	UserAgent string
          
          	// ErrorLog specifies an optional logger for errors HTTP transports
          	// and unexpected behavior from handlers.
          	// If nil, logging goes to os.Stderr via the log package's
          	// standard logger.
          	ErrorLog Logger
          
          	// Exit is an optional channel whose closure indicates that the Crawler
          	// instance should be stop work and exit.
          	Exit <-chan struct{}
          	// contains filtered or unexported fields
          }

            Crawler is core of web crawl server that provides crawl websites and calls pipeline to process for received data from their pages.

            func NewCrawler

            func NewCrawler() *Crawler

              NewCrawler returns a new Crawler with default settings.

              func (*Crawler) Crawl

              func (c *Crawler) Crawl(req *http.Request) error

                Crawl puts an HTTP request into the working queue to crawling.

                func (*Crawler) EnqueueURL

                func (c *Crawler) EnqueueURL(URL string) error

                  EnqueueURL puts given URL into the backup URLs queue.

                  func (*Crawler) Handle

                  func (c *Crawler) Handle(pattern string, handler Handler)

                    Handle registers the Handler for the given pattern. If pattern is "*" means will matches all requests if no any pattern matches.

                    func (*Crawler) Handler

                    func (c *Crawler) Handler(res *http.Response) (h Handler, pattern string)

                      Handler returns a Handler for the give HTTP Response.

                      func (*Crawler) StartURLs

                      func (c *Crawler) StartURLs(URLs []string)

                        StartURLs starts crawling for the given URL list.

                        func (*Crawler) UseCompression

                        func (c *Crawler) UseCompression() *Crawler

                          UseCompression enables the HTTP compression middleware to supports gzip, deflate for HTTP Request/Response.

                          func (*Crawler) UseCookies

                          func (c *Crawler) UseCookies() *Crawler

                            UseCookies enables the cookies middleware to working.

                            func (*Crawler) UseMiddleware

                            func (c *Crawler) UseMiddleware(m ...Middleware) *Crawler

                              UseMiddleware adds a Middleware to the crawler.

                              func (*Crawler) UsePipeline

                              func (c *Crawler) UsePipeline(p ...Pipeline) *Crawler

                                UsePipeline adds a Pipeline to the crawler.

                                func (*Crawler) UseProxy

                                func (c *Crawler) UseProxy(proxyURL *url.URL) *Crawler

                                  UseProxy enables proxy for each of HTTP requests.

                                  func (*Crawler) UseRobotstxt

                                  func (c *Crawler) UseRobotstxt() *Crawler

                                    UseRobotstxt enables support robots.txt.

                                    type Handler

                                    type Handler interface {
                                    	ServeSpider(chan<- Item, *http.Response)
                                    }

                                      Handler is the HTTP Response handler interface that defines how to extract scraped items from their pages.

                                      ServeSpider should be write got Item to the Channel.

                                      func VoidHandler

                                      func VoidHandler() Handler

                                        VoidHandler returns a Handler that without doing anything.

                                        type HandlerFunc

                                        type HandlerFunc func(chan<- Item, *http.Response)

                                          HandlerFunc is an adapter to allow the use of ordinary functions as Spider.

                                          func (HandlerFunc) ServeSpider

                                          func (f HandlerFunc) ServeSpider(c chan<- Item, resp *http.Response)

                                            ServeSpider performs extract data from received HTTP response and write it into the Channel c.

                                            type HttpMessageHandler

                                            type HttpMessageHandler interface {
                                            	Send(*http.Request) (*http.Response, error)
                                            }

                                              HttpMessageHandler is an interface that receives an HTTP request and returns an HTTP response.

                                              type HttpMessageHandlerFunc

                                              type HttpMessageHandlerFunc func(*http.Request) (*http.Response, error)

                                                HttpMessageHandlerFunc is an adapter to allow the use of ordinary functions as HttpMessageHandler.

                                                func (HttpMessageHandlerFunc) Send

                                                  Send sends a HTTP request and receives HTTP response.

                                                  type Item

                                                  type Item interface{}

                                                    Item is represents an item object.

                                                    type Logger

                                                    type Logger interface {
                                                    	Output(maxdepth int, s string) error
                                                    }

                                                      Logger is an interface for logging message.

                                                      type MediaType

                                                      type MediaType struct {
                                                      	// Type is the HTTP content type represents. such as
                                                      	// "text/html", "image/jpeg".
                                                      	Type string
                                                      	// Charset is the HTTP content encoding represents.
                                                      	Charset string
                                                      }

                                                        MediaType describe the content type of an HTTP request or HTTP response.

                                                        func ParseMediaType

                                                        func ParseMediaType(v string) MediaType

                                                          ParseMediaType parsing a specified string v to MediaType struct.

                                                          func (MediaType) ContentType

                                                          func (m MediaType) ContentType() string

                                                            ContentType returns the HTTP header content-type value.

                                                            type Middleware

                                                            type Middleware func(HttpMessageHandler) HttpMessageHandler

                                                              Middleware is the HTTP message transport middle layer that send HTTP request passed one message Handler to the next message Handler until returns an HTTP response.

                                                              func CompressionMiddleware

                                                              func CompressionMiddleware() Middleware

                                                                CompressionMiddleware is a middleware to allows compressed (gzip, deflate) traffic to be sent/received from sites.

                                                                func CookiesMiddleware

                                                                func CookiesMiddleware() Middleware

                                                                  CookiesMiddleware is an HTTP cookies middleware to allows cookies to tracking for each of HTTP requests.

                                                                  func ProxyMiddleware

                                                                  func ProxyMiddleware(f func(*http.Request) (*url.URL, error)) Middleware

                                                                    ProxyMiddleware is an HTTP proxy middleware to take HTTP Request use the HTTP proxy to access remote sites.

                                                                    ProxyMiddleware supports HTTP/HTTPS,SOCKS5 protocol list. etc http://127.0.0.1:8080 or https://127.0.0.1:8080 or socks5://127.0.0.1:1080

                                                                    func RobotstxtMiddleware

                                                                    func RobotstxtMiddleware() Middleware

                                                                      RobotstxtMiddleware is a middleware for robots.txt, make HTTP request is more polite.

                                                                      type Pipeline

                                                                      type Pipeline func(PipelineHandler) PipelineHandler

                                                                        Pipeline allows perform value Item passed one PipelineHandler to the next PipelineHandler in the chain.

                                                                        type PipelineHandler

                                                                        type PipelineHandler interface {
                                                                        	ServePipeline(Item)
                                                                        }

                                                                          PipelineHandler is an interface for a handler in pipeline.

                                                                          type PipelineHandlerFunc

                                                                          type PipelineHandlerFunc func(Item)

                                                                            PipelineHandlerFunc is an adapter to allow the use of ordinary functions as PipelineHandler.

                                                                            func (PipelineHandlerFunc) ServePipeline

                                                                            func (f PipelineHandlerFunc) ServePipeline(v Item)

                                                                              ServePipeline performs for given Item data.

                                                                              type ProxyKey

                                                                              type ProxyKey struct{}

                                                                                ProxyKey is a key for the proxy URL that used by Crawler.

                                                                                Directories

                                                                                Path Synopsis
                                                                                contrib