crawler

package
v0.0.0-...-c5d5a31 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2020 License: Apache-2.0 Imports: 20 Imported by: 1

Documentation

Overview

Package crawler is a distributed web crawler.

Index

Constants

This section is empty.

Variables

View Source
var RobotsPath, _ = url.Parse("/robots.txt")

RobotsPath is robots.txt path

Functions

This section is empty.

Types

type Backend

type Backend interface {
	Setup() error
	CrawledAndCount(u, domain string) (time.Time, int, error) // gotta be a better name for this
	Upsert(*document.Document) error
}

Backend outlines methods to save documents and count the docs a domain has

type Crawler

type Crawler struct {
	HTTPClient *http.Client
	UserAgent

	Robots robots.Cacher
	Queue  queue.Queuer

	Backend
	ImageBackend
	// contains filtered or unexported fields
}

Crawler holds crawler settings for our UserAgent, Seed URLs, etc.

func New

func New(cfg config.Provider) *Crawler

New creates a Crawler from a config Provider

func (*Crawler) Close

func (c *Crawler) Close()

Close the crawler

func (*Crawler) Start

func (c *Crawler) Start(t time.Duration) error

Start the crawler

type ElasticSearch

type ElasticSearch struct {
	*document.ElasticSearch
	Bulk *elastic.BulkProcessor
	sync.Mutex
}

ElasticSearch satisfies the crawler's Backend interface

func (*ElasticSearch) CrawledAndCount

func (e *ElasticSearch) CrawledAndCount(u, domain string) (time.Time, int, error)

CrawledAndCount returns the crawled date of the url (if any) and the total number of links a domain has

func (*ElasticSearch) Upsert

func (e *ElasticSearch) Upsert(doc *document.Document) error

Upsert updates a document or inserts it if it doesn't exist NOTE: Elasticsearch has a 512-byte limit on an insert operation. Upsert does not have that limit.

type ImageBackend

type ImageBackend interface {
	Setup() error
	Upsert(*img.Image) error
}

ImageBackend outlines methods to save image links

type Stats

type Stats struct {
	sync.Mutex
	Start time.Time

	StatusCodes map[int]int64
	// contains filtered or unexported fields
}

Stats keeps track of time elapsed & status codes

func (*Stats) Elapsed

func (s *Stats) Elapsed() *Stats

Elapsed will set the total time the crawler has been running

func (*Stats) String

func (s *Stats) String() string

Print our stats into human-readable

func (*Stats) Update

func (s *Stats) Update(code int)

Update our stats from a document's results

type UserAgent

type UserAgent struct {
	Full  string
	Short string
}

UserAgent holds the full and short version of the crawler's useragent

Directories

Path Synopsis
cmd
Command crawler demonstrates how to run the crawler
Command crawler demonstrates how to run the crawler
new
Package queue manages the queue for a distributed crawler
Package queue manages the queue for a distributed crawler
Package robots handles caching robots.txt files
Package robots handles caching robots.txt files

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL