crawler

package

v0.0.0-...-c5d5a31 Latest Latest Go to latest Published: Nov 21, 2020 License: Apache-2.0 Imports: 20 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jivesearch/jivesearch

Documentation ¶

Overview ¶

Package crawler is a distributed web crawler.

Index ¶

Variables
type Backend
type Crawler
- func New(cfg config.Provider) *Crawler
- func (c *Crawler) Close()
- func (c *Crawler) Start(t time.Duration) error
type ElasticSearch
- func (e *ElasticSearch) CrawledAndCount(u, domain string) (time.Time, int, error)
- func (e *ElasticSearch) Upsert(doc *document.Document) error
type ImageBackend
type Stats
type UserAgent

Constants ¶

This section is empty.

Variables ¶

View Source

var RobotsPath, _ = url.Parse("/robots.txt")

RobotsPath is robots.txt path

Functions ¶

This section is empty.

Types ¶

type Backend ¶

type Backend interface {
	Setup() error
	CrawledAndCount(u, domain string) (time.Time, int, error) // gotta be a better name for this
	Upsert(*document.Document) error
}

Backend outlines methods to save documents and count the docs a domain has

type Crawler ¶

type Crawler struct {
	HTTPClient *http.Client
	UserAgent

	Robots robots.Cacher
	Queue  queue.Queuer

	Backend
	ImageBackend
	// contains filtered or unexported fields
}

Crawler holds crawler settings for our UserAgent, Seed URLs, etc.

func New ¶

func New(cfg config.Provider) *Crawler

New creates a Crawler from a config Provider

func (*Crawler) Close ¶

func (c *Crawler) Close()

Close the crawler

func (*Crawler) Start ¶

func (c *Crawler) Start(t time.Duration) error

Start the crawler

type ElasticSearch ¶

type ElasticSearch struct {
	*document.ElasticSearch
	Bulk *elastic.BulkProcessor
	sync.Mutex
}

ElasticSearch satisfies the crawler's Backend interface

func (*ElasticSearch) CrawledAndCount ¶

func (e *ElasticSearch) CrawledAndCount(u, domain string) (time.Time, int, error)

CrawledAndCount returns the crawled date of the url (if any) and the total number of links a domain has

func (*ElasticSearch) Upsert ¶

func (e *ElasticSearch) Upsert(doc *document.Document) error

Upsert updates a document or inserts it if it doesn't exist NOTE: Elasticsearch has a 512-byte limit on an insert operation. Upsert does not have that limit.

type ImageBackend ¶

type ImageBackend interface {
	Setup() error
	Upsert(*img.Image) error
}

ImageBackend outlines methods to save image links

type Stats ¶

type Stats struct {
	sync.Mutex
	Start time.Time

	StatusCodes map[int]int64
	// contains filtered or unexported fields
}

Stats keeps track of time elapsed & status codes

func (*Stats) Elapsed ¶

func (s *Stats) Elapsed() *Stats

Elapsed will set the total time the crawler has been running

func (*Stats) String ¶

func (s *Stats) String() string

Print our stats into human-readable

func (*Stats) Update ¶

func (s *Stats) Update(code int)

Update our stats from a document's results

type UserAgent ¶

type UserAgent struct {
	Full  string
	Short string
}

UserAgent holds the full and short version of the crawler's useragent

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd Command crawler demonstrates how to run the crawler	Command crawler demonstrates how to run the crawler
new
queue Package queue manages the queue for a distributed crawler	Package queue manages the queue for a distributed crawler
robots Package robots handles caching robots.txt files	Package robots handles caching robots.txt files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL