downloader

package
v0.0.0-...-f065d94 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2018 License: MIT Imports: 18 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// AudistoAPIDomain the domain name endpoint for Audisto API
	AudistoAPIDomain = "api.audisto.com"

	// AudistoAPIEndpoint URL enpoint for Audisto API, put "" or "/" string if the endpoint is at the root domain
	AudistoAPIEndpoint = "/crawls/"

	// AudistoAPIVersion the version of Audisto API version this downloader will talk to
	AudistoAPIVersion = "2.0"

	// EndpointSchema http or https, this probably wont change, hence it is set here
	EndpointSchema = "https"

	// DefaultRequestMethod used when http request method is not explicitly set
	DefaultRequestMethod = "GET"

	// DefaultOutputFormat the default formatting or file extension for the response we get from Audisto API if not expilictly set
	DefaultOutputFormat = "tsv"

	// DefaultChunkSize the default chunk size for interacting with Audisto API if NOT expilicty set
	// This should not affect the way throttling works
	DefaultChunkSize = 10000

	// ContentType type of the http request to send using the http client
	ContentType = "application/x-www-form-urlencoded"

	// AcceptEncoding Content encoding for the http request
	AcceptEncoding = "gzip, deflate"

	// ConnectionType is the value of "Connection" http header to be send using the http client
	ConnectionType = "Keep-Alive"
)
View Source
const (
	// SMOOTHINGFACTOR -
	SMOOTHINGFACTOR = 0.005

	// SelfTargetSuffix used when --targets=self, the output filename will be appended this suffix
	SelfTargetSuffix = "_links"
)
View Source
const (
	// DebugEnvKey debug environment variable
	DebugEnvKey = "DD_DEBUG"
)
View Source
const (
	// ETAFactor ETA milliseconds estimation factor
	ETAFactor = 175
)

Variables

View Source
var (

	// RefreshInterval time between to progress updates
	// Export so the caller can fine-tune this
	RefreshInterval = time.Millisecond * 100
)

progress bar elements

View Source
var StatusCodesErrors = map[int]string{
	401: "Wrong credentials",
	403: "Access denied. Wrong credentials?",
	404: "Not found. Correct crawl ID?",
	429: "Error while getting total number of elements: 429, multiple requests",
	504: "Error while getting total number of elements: 504, server timeout",
}

StatusCodesErrors ..

Functions

func DownloadCompleted

func DownloadCompleted(outputFilename, resumeFilename string) bool

DownloadCompleted a helper function to check if a download for a given output filename has been completed. a download is "considered" completed when: the output filepath exists + its resume file does not exist we're "considering" and not 100% sure since we lack the meta-info resume file.

func IsInDebugMode

func IsInDebugMode() bool

IsInDebugMode checks if the app is running in debug mode

Types

type AudistoAPIClient

type AudistoAPIClient struct {

	// request path / DSN
	BasePath string
	Username string
	Password string
	Mode     string
	CrawlID  uint64

	// request query params
	Deep        bool
	Filter      string
	Order       string
	Output      string
	ChunkNumber uint64
	ChunkSize   uint64
	// contains filtered or unexported fields
}

AudistoAPIClient a struct holding all information required to construct a URL with query params for Audisto API

func NewClient

func NewClient(username string, password string, crawl uint64, mode string,
	noDetails bool, chunknumber uint64, chunkSize uint64, filter string,
	order string) (*AudistoAPIClient, error)

NewClient make a new Audisto API Client and checks if it's valid

func (*AudistoAPIClient) Do

func (api *AudistoAPIClient) Do(request *http.Request) (*http.Response, error)

Do execute an http request adding Audisto API header values This also do variable validation before executing the request for less http roundtrips

func (*AudistoAPIClient) FetchRawChunk

func (api *AudistoAPIClient) FetchRawChunk(forTheFirstRequest bool) ([]byte, int, error)

FetchRawChunk makes an http request to the server for a given chunk

func (*AudistoAPIClient) FetchTotalElements

func (api *AudistoAPIClient) FetchTotalElements() ([]byte, int, error)

FetchTotalElements sets up the request for the first chunk in json, containing the total number of elements.

func (*AudistoAPIClient) GetAPIEndpoint

func (api *AudistoAPIClient) GetAPIEndpoint() string

GetAPIEndpoint constructs the Audisto API endpoint without the query params nor the dsn part.

func (*AudistoAPIClient) GetBaseURL

func (api *AudistoAPIClient) GetBaseURL() string

GetBaseURL construct the base url for quering Audisto API in the form of: username:password@api.audisto.com

func (*AudistoAPIClient) GetFullQueryURL

func (api *AudistoAPIClient) GetFullQueryURL(forTheFirstRequest bool) string

GetFullQueryURL returns the full url for interacting with Audisto API, INCLUDING query params

func (*AudistoAPIClient) GetQueryParams

func (api *AudistoAPIClient) GetQueryParams(forTheFirstRequest bool) url.Values

GetQueryParams use net/url package to construct query params If forTheFirstRequest is set to true: chunk_size, deep are set to 0 and the output is forced to be json This is used to request the first chunk in json and get total number of elements

func (*AudistoAPIClient) GetRelativePath

func (api *AudistoAPIClient) GetRelativePath() string

GetRelativePath return the relative path to the api domain name e.g. /2.0/crawls/123456/links

func (*AudistoAPIClient) GetRequestMethod

func (api *AudistoAPIClient) GetRequestMethod() string

GetRequestMethod returns the HTTP request method, GET (by default)

func (*AudistoAPIClient) GetRequestURL

func (api *AudistoAPIClient) GetRequestURL() (*url.URL, error)

GetRequestURL returns a validated instance of url.URL, and an error if the validation fails

func (*AudistoAPIClient) GetTotalElements

func (api *AudistoAPIClient) GetTotalElements() (uint64, error)

GetTotalElements asks the server the total number of elements

func (*AudistoAPIClient) GetURLPath

func (api *AudistoAPIClient) GetURLPath() string

GetURLPath returns the full url for interacting with Audisto API, WITHOUT query params e.g. username:password@api.audisto.com/crawls/pages|links

func (*AudistoAPIClient) IsValid

func (api *AudistoAPIClient) IsValid() error

IsValid check if the struct info look good. This does not do any remote request.

func (*AudistoAPIClient) ResetChunkSize

func (api *AudistoAPIClient) ResetChunkSize()

func (*AudistoAPIClient) SetChunkSize

func (api *AudistoAPIClient) SetChunkSize(size uint64)

SetChunkSize set AudistoAPI.ChunkSize to a new size

func (*AudistoAPIClient) SetNextChunkNumber

func (api *AudistoAPIClient) SetNextChunkNumber(number uint64)

SetNextChunkNumber set AudistoAPI.ChunkNumber to the next chunk number

func (*AudistoAPIClient) SetRequestMethod

func (api *AudistoAPIClient) SetRequestMethod(method string) error

SetRequestMethod sets the HTTP request method for interacting with Audisto API Allowed method: GET, POST, PATCH, DELETE

func (*AudistoAPIClient) SetTargetPageFilter

func (api *AudistoAPIClient) SetTargetPageFilter(pageID uint64)

type Downloader

type Downloader struct {
	OutputFilename            string        `json:"outputFilename"`
	TargetsFilename           string        `json:"targetsFilename"`
	DoneElements              uint64        `json:"doneElements"`
	TotalElements             uint64        `json:"totalElements"`
	NoDetails                 bool          `json:"noDetails"`
	TargetsFileMD5            string        `json:"targetsFileMD5"`
	TargetsFileNextID         int           `json:"targetsFileNextID"`
	CurrentTarget             currentTarget `json:"currentTarget"`
	PagesSelfTargetsCompleted bool          `json:"pagesSelfTargetsCompleted"`

	// Stop a switch to stop the current download
	Stop bool
	// contains filtered or unexported fields
}

Downloader initiate or resume a persisted downloading process info using AudistoAPIClient This also follows and increments chunk number, considering total elements to be downloaded

func New

func New(reportProgress chan<- StatusReport) *Downloader

New creates a new downloader

func (*Downloader) PersistConfig

func (d *Downloader) PersistConfig() error

PersistConfig saves the resumer to file

func (*Downloader) ProgressReport

func (d *Downloader) ProgressReport() StatusReport

ProgressReport make the downloader tell its current status

func (*Downloader) Setup

func (d *Downloader) Setup(username string, password string, crawl uint64, mode string,
	noDetails bool, chunknumber uint64, chunkSize uint64, output string,
	filter string, noResume bool, order string, targets string) error

Setup assign params and execute the Run() function

func (*Downloader) Start

func (d *Downloader) Start() error

Start runs the overall download logic after the initialization and validation steps

type LogType

type LogType string

LogType is an alias of string type with predefined log levels.

const (
	INFO    LogType = "INFO"
	WARNING LogType = "WARNING"
	DEBUG   LogType = "DEBUG"
)

type StatusReport

type StatusReport struct {
	ETA                         time.Duration
	ChunkSize                   uint64
	TotalElements, DoneElements uint64
	Mode                        string
	TimeoutsCount, ErrorsCount  int
	ProgressPercentage          float64
	OutputFilename              string
	Logs                        []map[LogType]string
	IsIngTargetMode             bool
	TotalIDsCount               int
	CurrentIDOrderNumber        int
}

StatusReport a struct holding the progress status of the current download

func (*StatusReport) IsDone

func (ps *StatusReport) IsDone() bool

IsDone a helper function to know if the download is considered done.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL