crawler

package
v0.0.0-...-8b501b0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 6, 2023 License: MIT Imports: 16 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// A PrivateNetworkDetector instance
	PrivateNetworkDetector PrivateNetworkDetector

	// A URLGetter instance for fetching links.
	URLGetter URLGetter

	// A GraphUpdater instance for addding new links to the link graph.
	Graph Graph

	// A TextIndexer instance for indexing the content of each retrieved link.
	Indexer Indexer

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int
}

Config encapsulates the configuration options for creating a new Crawler.

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler implements a web-page crawling pipeline consisting of the following stages:

  • Given a URL, retrieve the web-page contents from the remote server.
  • Extract and resolve absolute and relative links from the retrieved page.
  • Extract page title and text content from the retrieved page.
  • Update the link graph: add new links and create edges between the crawled page and the links within it.
  • Index crawled page title and text content.

func NewCrawler

func NewCrawler(cfg Config) *Crawler

NewCrawler returns a new crawler instance.

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context, linkIt graph.LinkIterator) (int, error)

Crawl iterates linkIt and sends each link through the crawler pipeline returning the total count of links that went through the pipeline. Calls to Crawl block until the link iterator is exhausted, an error occurs or the context is cancelled.

type Graph

type Graph interface {
	// UpsertLink creates a new link or updates an existing link.
	UpsertLink(link *graph.Link) error

	// UpsertEdge creates a new edge or updates an existing edge.
	UpsertEdge(edge *graph.Edge) error

	// RemoveStaleEdges removes any edge that originates from the specified
	// link ID and was updated before the specified timestamp.
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
}

Graph is implemented by objects that can upsert links and edges into a link graph instance.

type Indexer

type Indexer interface {
	// Index inserts a new document to the index or updates the index entry
	// for and existing document.
	Index(doc *index.Document) error
}

Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.

type PrivateNetworkDetector

type PrivateNetworkDetector interface {
	IsPrivate(host string) (bool, error)
}

PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address.

type URLGetter

type URLGetter interface {
	Get(url string) (*http.Response, error)
}

URLGetter is implemented by objects that can perform HTTP GET requests.

Directories

Path Synopsis
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL