Version: v0.0.0-...-d33463d Latest Latest

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2020 License: BSD-3-Clause Imports: 10 Imported by: 0



Fetcher service of the Dataflow kit downloads html content from web pages to feed Dataflow kit scrapers.

Currently two fetcher types are available : Headless Chrome Fetcher and Base Fetcher.

Base fetcher is used for html web page download with Go standard Http library.

Chrome Fetcher connects to Headless Chrome which processes JavaScript pages and returns rendered content.

Accessing Fetcher endpoints


fetch a web page using Chrome Fetcher
curl -XPOST  localhost:8000/fetch -d '{"type":"chrome", "url":"http://example.com","formData":"auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"}'

Set type to either "chrome" or "base" value. formData is a string value for passing form data parameters. For example it may be used for processing pages which require authentication. "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"

fetch a web page with base fetcher. For base fetcher type parameter may be omitted.
curl -XPOST  localhost:8000/fetch -d '{"url":"http://example.com"}'

Flags and configuration settings

General settings

DFK_FETCH: HTTP listen address of Fetch service (defaults to "")
CHROME: Headless Chrome address. It is used for fetching JS driven web pages (defaults to
PROXY: Proxy address http://username:password@proxy-host:port . (defaults to "")

Storage settings

STORAGE_TYPE: Storage type may be Diskv or Cassandra. (defaults to "Diskv")
Storage stores auxiliary information generated by fetcher.
DISKV_BASE_DIR: diskv base directory for Diskv Storage type (defaults to "diskv").
Find more information about Diskv storage at https://github.com/peterbourgon/diskv

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL