Documentation
¶
Overview ¶
Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.
Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.
Currently two types of fetcher are available : Chrome Fetcher and Base Fetcher.
Base fetcher is used for downloading html web page using Go standard library's http.
Chrome Fetcher connects to Headless Chrome which renders JavaScript pages.
RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt
Index ¶
- func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool
- func AssembleRobotstxtURL(rawurl string) (string, error)
- func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)
- type BaseFetcher
- type ChromeFetcher
- type Config
- type FetchService
- type Fetcher
- type HTMLServer
- type Request
- type Service
- type ServiceMiddleware
- type Type
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AllowedByRobots ¶
func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool
AllowedByRobots checks if scraping of specified URL is allowed by robots.txt
func AssembleRobotstxtURL ¶
AssembleRobotstxtURL robots.txt URL from URL
func RobotstxtData ¶
func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)
RobotstxtData generates robots.txt url, retrieves its content through API fetch endpoint.
Types ¶
type BaseFetcher ¶
type BaseFetcher struct {
// contains filtered or unexported fields
}
BaseFetcher is a Fetcher that uses the Go standard library's http client to fetch URLs.
func (*BaseFetcher) Fetch ¶
func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)
Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.
type ChromeFetcher ¶
type ChromeFetcher struct {
// contains filtered or unexported fields
}
ChromeFetcher is used to fetch Java Script rendeded pages.
func (*ChromeFetcher) Fetch ¶
func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)
Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.
type FetchService ¶
type FetchService struct {
}
FetchService implements service with empty struct
func (FetchService) Fetch ¶
func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)
Fetch method implements fetching content from web page with Base or Chrome fetcher.
type Fetcher ¶
type Fetcher interface {
// Fetch is called to retrieve HTML content of a document from the remote server.
Fetch(request Request) (io.ReadCloser, error)
// contains filtered or unexported methods
}
Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.
Note: Fetchers may or may not be safe to use concurrently. Please read the documentation for each fetcher for more details.
type HTMLServer ¶
type HTMLServer struct {
// contains filtered or unexported fields
}
HTMLServer represents the web service that serves up HTML
type Request ¶
type Request struct {
Type string `json:"type"`
// URL to be retrieved
URL string `json:"url"`
// HTTP method : GET, POST
Method string
// FormData is a string value for passing formdata parameters.
//
// For example it may be used for processing pages which require authentication
//
// Example:
//
// "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"
//
FormData string `json:"formData,omitempty"`
//UserToken identifies user to keep personal cookies information.
UserToken string `json:"userToken"`
//InfiniteScroll option is used for fetching web pages with Continuous Scrolling
InfiniteScroll bool `json:"infiniteScroll"`
}
Request struct contains request information sent to Fetchers
type Service ¶
type Service interface {
Fetch(req Request) (io.ReadCloser, error)
}
Service defines Fetch service interface
func NewHTTPClient ¶
NewHTTPClient returns an Fetch Service backed by an HTTP server living at the remote instance. We expect instance to come from a service discovery system, so likely of the form "host:port". We bake-in certain middlewares, implementing the client library pattern.
type ServiceMiddleware ¶
ServiceMiddleware defines a middleware for a Fetch service
func LoggingMiddleware ¶
func LoggingMiddleware(logger *logrus.Logger) ServiceMiddleware
LoggingMiddleware logs Service endpoints
func RobotsTxtMiddleware ¶
func RobotsTxtMiddleware() ServiceMiddleware
RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt