fetch

package
v0.0.0-...-7d74a43 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2018 License: BSD-3-Clause Imports: 33 Imported by: 0

Documentation

Overview

Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Currently two types of fetcher are available : Chrome Fetcher and Base Fetcher.

Base fetcher is used for downloading html web page using Go standard library's http.

Chrome Fetcher connects to Headless Chrome which renders JavaScript pages.

RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AllowedByRobots

func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool

AllowedByRobots checks if scraping of specified URL is allowed by robots.txt

func AssembleRobotstxtURL

func AssembleRobotstxtURL(rawurl string) (string, error)

AssembleRobotstxtURL robots.txt URL from URL

func RobotstxtData

func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)

RobotstxtData generates robots.txt url, retrieves its content through API fetch endpoint.

Types

type BaseFetcher

type BaseFetcher struct {
	// contains filtered or unexported fields
}

BaseFetcher is a Fetcher that uses the Go standard library's http client to fetch URLs.

func (*BaseFetcher) Fetch

func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.

type ChromeFetcher

type ChromeFetcher struct {
	// contains filtered or unexported fields
}

ChromeFetcher is used to fetch Java Script rendeded pages.

func (*ChromeFetcher) Fetch

func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.

type Config

type Config struct {
	Host string
}

Config provides basic configuration

type FetchService

type FetchService struct {
}

FetchService implements service with empty struct

func (FetchService) Fetch

func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)

Fetch method implements fetching content from web page with Base or Chrome fetcher.

type Fetcher

type Fetcher interface {
	//  Fetch is called to retrieve HTML content of a document from the remote server.
	Fetch(request Request) (io.ReadCloser, error)
	// contains filtered or unexported methods
}

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Note: Fetchers may or may not be safe to use concurrently. Please read the documentation for each fetcher for more details.

type HTMLServer

type HTMLServer struct {
	// contains filtered or unexported fields
}

HTMLServer represents the web service that serves up HTML

func Start

func Start(cfg Config) *HTMLServer

Start func launches Parsing service

func (*HTMLServer) Stop

func (htmlServer *HTMLServer) Stop() error

Stop turns off the HTML Server

type Request

type Request struct {
	Type string `json:"type"`
	//	URL to be retrieved
	URL string `json:"url"`
	//	HTTP method : GET, POST
	Method string
	// FormData is a string value for passing formdata parameters.
	//
	// For example it may be used for processing pages which require authentication
	//
	// Example:
	//
	// "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"
	//
	FormData string `json:"formData,omitempty"`
	//UserToken identifies user to keep personal cookies information.
	UserToken string `json:"userToken"`
	//InfiniteScroll option is used for fetching web pages with Continuous Scrolling
	InfiniteScroll bool `json:"infiniteScroll"`
}

Request struct contains request information sent to Fetchers

func (Request) Host

func (req Request) Host() (string, error)

Host returns Host value from Request

type Service

type Service interface {
	Fetch(req Request) (io.ReadCloser, error)
}

Service defines Fetch service interface

func NewHTTPClient

func NewHTTPClient(instance string) (Service, error)

NewHTTPClient returns an Fetch Service backed by an HTTP server living at the remote instance. We expect instance to come from a service discovery system, so likely of the form "host:port". We bake-in certain middlewares, implementing the client library pattern.

type ServiceMiddleware

type ServiceMiddleware func(Service) Service

ServiceMiddleware defines a middleware for a Fetch service

func LoggingMiddleware

func LoggingMiddleware(logger *logrus.Logger) ServiceMiddleware

LoggingMiddleware logs Service endpoints

func RobotsTxtMiddleware

func RobotsTxtMiddleware() ServiceMiddleware

RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt

type Type

type Type string

Type represents types of fetcher

const (
	//Base fetcher is used for downloading html web page using Go standard library's http
	Base Type = "Base"
	//Headless chrome is used to download content from JS driven web pages
	Chrome = "Chrome"
)

Fetcher types

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL