crawlcmd

package

v0.0.0-...-e2c53ed Latest Latest Go to latest Published: Mar 26, 2024 License: Apache-2.0 Imports: 19 Imported by: 8

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cloudengio/go.pkgs

Links

Open Source Insights

README ¶

Package cloudeng.io/file/crawl/crawlcmd

import cloudeng.io/file/crawl/crawlcmd

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Types

Type Config

type Config struct {
	Name          string           `yaml:"name"`
	Depth         int              `yaml:"depth"`
	Seeds         []string         `yaml:"seeds"`
	NoFollowRules []string         `yaml:"nofollow"`
	FollowRules   []string         `yaml:"follow"`
	RewriteRules  []string         `yaml:"rewrite"`
	Download      DownloadConfig   `yaml:"download"`
	NumExtractors int              `yaml:"num_extractors"`
	Extractors    []content.Type   `yaml:"extractors"`
	Cache         CrawlCacheConfig `yaml:"cache"`
}

Config represents the configuration for a single crawl.

Methods

func (c Config) CreateSeedCrawlRequests(ctx context.Context, factories map[string]file.FSFactory, seeds map[string][]cloudpath.Match) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing for outlinks.Extractor that can be used with outlinks.Extract.

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

Type CrawlCacheConfig

type CrawlCacheConfig struct {
	Prefix            string `yaml:"cache_prefix"`
	ClearBeforeCrawl  bool   `yaml:"cache_clear_before_crawl"`
	Checkpoint        string `yaml:"cache_checkpoint"`
	ShardingPrefixLen int    `yaml:"cache_sharding_prefix_len"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The cache is intended to be relative to the

Methods

func (c CrawlCacheConfig) Initialize(root string) (cachePath, checkpointPath string, err error)

Initialize creates the cache and checkpoint directories relative to the specified root, and optionally clears them before the crawl (if Cache.ClearBeforeCrawl is true). Any environment variables in the root or Cache.Prefix will be expanded.

Type Crawler

type Crawler struct {
	Config
	Extractors func() map[content.Type]outlinks.Extractor
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

Methods

func (c *Crawler) Run(ctx context.Context, fsMap map[string]file.FSFactory, cacheRoot string, displayOutlinks, displayProgress bool) error

Run runs the crawler.

Type DownloadConfig

type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}

Type DownloadFactoryConfig

type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `yaml:"default_concurrency"`
	DefaultRequestChanSize   int   `yaml:"default_request_chan_size"`
	DefaultCrawledChanSize   int   `yaml:"default_crawled_chan_size"`
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

Methods

func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

Type ExponentialBackoff

type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay"`
	Steps        int           `yaml:"steps"`
	StatusCodes  []int         `yaml:"status_codes,flow"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

Type Rate

type Rate struct {
	Tick            time.Duration `yaml:"tick"`
	RequestsPerTick int           `yaml:"requests_per_tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

Type RateControl

type RateControl struct {
	Rate               Rate               `yaml:"rate_control"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff"`
}

RateControl is the configuration for rate based control of download requests.

Methods

func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

Documentation ¶

Overview ¶

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Index ¶

type Config
type CrawlCacheConfig
type Crawler
- func NewCrawler(cfg Config, resources Resources) *Crawler
- func (c *Crawler) Run(ctx context.Context, displayOutlinks, displayProgress bool) error
type DownloadConfig
type DownloadFactoryConfig
- func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)
- func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory
type ExponentialBackoff
type FSFactory
type Rate
type RateControl
- func (c RateControl) NewRateController() (*ratecontrol.Controller, error)
type Resources

Examples ¶

CrawlCacheConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	Name          string           `yaml:"name" cmd:"the name of the crawl"`
	Depth         int              `yaml:"depth" cmd:"the maximum depth to crawl"`
	Seeds         []string         `yaml:"seeds" cmd:"the initial set of URIs to crawl"`
	NoFollowRules []string         `` /* 161-byte string literal not displayed */
	FollowRules   []string         `` /* 155-byte string literal not displayed */
	RewriteRules  []string         `` /* 138-byte string literal not displayed */
	Download      DownloadConfig   `yaml:"download" cmd:"the configuration for downloading documents"`
	NumExtractors int              `yaml:"num_extractors" cmd:"the number of concurrent link extractors to use"`
	Extractors    []content.Type   `yaml:"extractors" cmd:"the content types to extract links from"`
	Cache         CrawlCacheConfig `yaml:"cache" cmd:"the configuration for the cache of downloaded documents"`
}

Config represents the configuration for a single crawl.

func (Config) CreateSeedCrawlRequests ¶

func (c Config) CreateSeedCrawlRequests(
	ctx context.Context,
	factories map[string]FSFactory,
	seeds map[string][]cloudpath.Match,
) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (Config) ExtractorRegistry ¶

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing the outlinks.Extractor that can be used with outlinks.Extract.

func (Config) NewLinkProcessor ¶

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (Config) SeedsByScheme ¶

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

type CrawlCacheConfig ¶

type CrawlCacheConfig struct {
	Downloads         string    `` /* 147-byte string literal not displayed */
	ClearBeforeCrawl  bool      `yaml:"clear_before_crawl" cmd:"if true, the cache and checkpoint will be cleared before the crawl starts."`
	Checkpoint        string    `yaml:"checkpoint" cmd:"the location of any checkpoint data used to resume a crawl, this is an absolute path."`
	ShardingPrefixLen int       `` /* 187-byte string literal not displayed */
	Concurrency       int       `yaml:"concurrency" cmd:"the number of concurrent operations to use when reading/writing to the cache."`
	ServiceConfig     yaml.Node `yaml:"service_config,omitempty" cmd:"cache service specific configuration, eg. AWS specific configuration"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The ServiceSpecific field is intended to be parametized to some service specific configuration for cache services that require it, such as AWS S3. This is deliberately left to client packages to avoid depenedency bloat in core packages such as this. The type of the ServiceConfig file is generally determined using the scheme of the Downloads path (e.g s3://... would imply an AWS specific configuration).

Example ¶

package main

import (
	"fmt"

	"cloudeng.io/cmdutil/cmdyaml"
	"cloudeng.io/file/crawl/crawlcmd"
)

func main() {
	type cloudConfig struct {
		Region string `yaml:"region"`
	}
	var cfg crawlcmd.CrawlCacheConfig
	var service cloudConfig

	err := cmdyaml.ParseConfig([]byte(`
downloads: cloud-service://bucket/downloads
service_config:
  region: us-west-2
`), &cfg)
	if err != nil {
		fmt.Printf("error: %v\n", err)
	}
	if err := cfg.ServiceConfig.Decode(&service); err != nil {
		fmt.Printf("error: %v\n", err)
	}
	fmt.Println(cfg.Downloads)
	fmt.Println(service.Region)
}

Output:

cloud-service://bucket/downloads
us-west-2

func (CrawlCacheConfig) CheckpointPath ¶

func (c CrawlCacheConfig) CheckpointPath() string

CheckpointPath returns the expanded checkpoint path.

func (CrawlCacheConfig) DownloadPath ¶

func (c CrawlCacheConfig) DownloadPath() string

DownloadPath returns the expanded downloads path.

func (CrawlCacheConfig) PrepareCheckpoint ¶

func (c CrawlCacheConfig) PrepareCheckpoint(ctx context.Context, op checkpoint.Operation) error

PrepareCheckpoint initializes the checkpoint operation (ie. calls op.Init(ctx, checkpointPath)) and optionally clears the checkpoint if ClearBeforeCrawl is true. It returns an error if the checkpoint cannot be initialized or cleared.

func (CrawlCacheConfig) PrepareDownloads ¶

func (c CrawlCacheConfig) PrepareDownloads(ctx context.Context, fs content.FS) error

PrepareDownloads ensures that the cache directory exists and is empty if ClearBeforeCrawl is true. It returns an error if the directory cannot be created or cleared.

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

func NewCrawler ¶

func NewCrawler(cfg Config, resources Resources) *Crawler

NewCrawler creates a new crawler instance using the supplied configuration and resources.

func (*Crawler) Run ¶

func (c *Crawler) Run(ctx context.Context,
	displayOutlinks, displayProgress bool) error

Run runs the crawler.

type DownloadConfig ¶

type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}

type DownloadFactoryConfig ¶

type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `` /* 174-byte string literal not displayed */
	DefaultRequestChanSize   int   `` /* 282-byte string literal not displayed */
	DefaultCrawledChanSize   int   `` /* 274-byte string literal not displayed */
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency" cmd:"per crawl depth values for the number of concurrent downloads"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue download requests"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue downloaded items"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

func (DownloadFactoryConfig) Depth0Chans ¶

func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (DownloadFactoryConfig) NewFactory ¶

func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

type ExponentialBackoff ¶

type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay" cmd:"the initial delay between retries for exponential backoff"`
	Steps        int           `yaml:"steps" cmd:"the number of steps of exponential backoff before giving up"`
	StatusCodes  []int         `yaml:"status_codes,flow" cmd:"the status codes that trigger a retry"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

type FSFactory ¶

type FSFactory func(context.Context) (file.FS, error)

FSFactory is a function that returns a file.FS used to crawl a given FS.

type Rate ¶

type Rate struct {
	Tick            time.Duration `yaml:"tick" cmd:"the duration of a tick"`
	RequestsPerTick int           `yaml:"requests_per_tick" cmd:"the number of requests per tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick" cmd:"the number of bytes per tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

type RateControl ¶

type RateControl struct {
	Rate               Rate               `yaml:"rate_control" cmd:"the rate control parameters"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff" cmd:"the exponential backoff parameters"`
}

RateControl is the configuration for rate based control of download requests.

func (RateControl) NewRateController ¶

func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

type Resources ¶

type Resources struct {
	// Extractors are used to extract outlinks from crawled documents
	// based on their content type.
	Extractors map[content.Type]outlinks.Extractor
	// CrawlStoreFactories are used to create file.FS instances for
	// the files being crawled based on their scheme.
	CrawlStoreFactories map[string]FSFactory
	// ContentStoreFactory is a function that returns a content.FS used to store
	// the downloaded content.
	NewContentFS func(context.Context, CrawlCacheConfig) (content.FS, error)
}

Resources contains the resources required by the crawler.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL