crawl

package

v0.0.0-...-e2c53ed Latest Latest Go to latest Published: Mar 26, 2024 License: Apache-2.0 Imports: 7 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cloudengio/go.pkgs

Links

Open Source Insights

README ¶

Package cloudeng.io/file/crawl

import cloudeng.io/file/crawl

Package crawl provides a framework for multilevel/recursive crawling files. As files are downloaded, they may be processed by an outlinks extractor which yields more files to crawled. Typically such a multilevel crawl is limited to a set number of iterations referred to as the depth of the crawl. The interface to a crawler is channel based to allow for concurrency. The outlink extractor is called for all downloaded files and should implement duplicate detection and removal.

Functions

Func CrawledObjects

func CrawledObjects(crawled Crawled) (objs []content.Object[[]byte, download.Result])

CrawledObjects returns the downloaded objects as a slice of content.Objects using the download.AsObjects function.

Types

Type Crawled

type Crawled struct {
	download.Downloaded
	Outlinks []download.Request
	Depth    int // The depth at which the document was crawled.
}

Crawled represents all of the downloaded content in response to a given crawl request.

Type DownloaderFactory

type DownloaderFactory interface {
	New(ctx context.Context, depth int) (
		downloader download.T,
		input chan download.Request,
		output chan download.Downloaded)
}

DownloaderFactory is used to create a new downloader for each 'depth' in a multilevel crawl. The depth argument can be used to create different configurations of the downloader tailored to the depth of the crawl. For example, lower depths would use less concurrency in the downloader since there are very likely fewer files to be downloaded than at higher ones (since more links will have extracted).

Type Option

type Option func(o *options)

Option is used to configure the behaviour of a newly created Crawler.

Functions

func WithCrawlDepth(depth int) Option

WithCrawlDepth sets the depth of the crawl.

func WithNumExtractors(concurrency int) Option

WithNumExtractors sets the number of extractors to run.

Type Outlinks

type Outlinks interface {
	// Note that the implementation of Extract is responsible for removing
	// duplicates from the set of extracted links returned.
	Extract(ctx context.Context, depth int, download download.Downloaded) []download.Request
}

Outlinks is the interface to an 'outlink' extractor, that is, an entity that determines additional items to be downloaded based on the contents of an already downloaded one.

Type SimpleRequest

type SimpleRequest struct {
	download.SimpleRequest
	Depth int
}

SimpleRequest is a simple implementation of download.Request with an additional field to record the depth that the request was created at. This will typically be set by an outlink extractor.

Type T

type T interface {
	Run(ctx context.Context,
		factory DownloaderFactory,
		extractor Outlinks,
		input <-chan download.Request,
		output chan<- Crawled) error
}

T represents the interface to a crawler.

Functions

func New(opts ...Option) T

New creates a new instance of T that implements a multilevel, concurrent crawl. The crawl is implemented as a chain of downloaders and extractors, one per depth requested. This allows for concurrency within each level of the crawl as well as across each level.

Documentation ¶

Overview ¶

Package crawl provides a framework for multilevel/recursive crawling files. As files are downloaded, they may be processed by an outlinks extractor which yields more files to crawled. Typically such a multilevel crawl is limited to a set number of iterations referred to as the depth of the crawl. The interface to a crawler is channel based to allow for concurrency. The outlink extractor is called for all downloaded files and should implement duplicate detection and removal.

Index ¶

func CrawledObjects(crawled Crawled) (objs []content.Object[[]byte, download.Result])
type Crawled
type DownloaderFactory
type Option
- func WithCrawlDepth(depth int) Option
- func WithNumExtractors(concurrency int) Option
type Outlinks
type SimpleRequest
type T
- func New(opts ...Option) T

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CrawledObjects ¶

func CrawledObjects(crawled Crawled) (objs []content.Object[[]byte, download.Result])

CrawledObjects returns the downloaded objects as a slice of content.Objects using the download.AsObjects function.

Types ¶

type Crawled ¶

type Crawled struct {
	download.Downloaded
	Outlinks []download.Request
	Depth    int // The depth at which the document was crawled.
}

Crawled represents all of the downloaded content in response to a given crawl request.

type DownloaderFactory ¶

type DownloaderFactory interface {
	New(ctx context.Context, depth int) (
		downloader download.T,
		input chan download.Request,
		output chan download.Downloaded)
}

DownloaderFactory is used to create a new downloader for each 'depth' in a multilevel crawl. The depth argument can be used to create different configurations of the downloader tailored to the depth of the crawl. For example, lower depths would use less concurrency in the downloader since there are very likely fewer files to be downloaded than at higher ones (since more links will have extracted).

type Option ¶

type Option func(o *options)

Option is used to configure the behaviour of a newly created Crawler.

func WithCrawlDepth ¶

func WithCrawlDepth(depth int) Option

WithCrawlDepth sets the depth of the crawl.

func WithNumExtractors ¶

func WithNumExtractors(concurrency int) Option

WithNumExtractors sets the number of extractors to run.

type Outlinks ¶

type Outlinks interface {
	// Note that the implementation of Extract is responsible for removing
	// duplicates from the set of extracted links returned.
	Extract(ctx context.Context, depth int, download download.Downloaded) []download.Request
}

Outlinks is the interface to an 'outlink' extractor, that is, an entity that determines additional items to be downloaded based on the contents of an already downloaded one.

type SimpleRequest ¶

type SimpleRequest struct {
	download.SimpleRequest
	Depth int
}

SimpleRequest is a simple implementation of download.Request with an additional field to record the depth that the request was created at. This will typically be set by an outlink extractor.

type T ¶

type T interface {
	Run(ctx context.Context,
		factory DownloaderFactory,
		extractor Outlinks,
		input <-chan download.Request,
		output chan<- Crawled) error
}

T represents the interface to a crawler.

func New ¶

func New(opts ...Option) T

New creates a new instance of T that implements a multilevel, concurrent crawl. The crawl is implemented as a chain of downloaders and extractors, one per depth requested. This allows for concurrency within each level of the crawl as well as across each level.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
crawlcmd Package crawlcmd provides support for building command line tools for crawling.	Package crawlcmd provides support for building command line tools for crawling.
outlinks

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL