crawler

package module
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 29, 2025 License: MIT Imports: 22 Imported by: 0

README

Crawler

This module is used for crawling the web page given and return Page object for each status that has configured channel for it.

You can configure new channels using the Channels map like

	chans := crawler.Channels{
		404: make(chan crawler.Page),
		200: make(chan crawler.Page),
	}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Channels

type Channels map[int]chan Page

Channels is a Page channels map where the index is the response code so we can define different behavior for the different resp codes

type Config added in v0.1.6

type Config struct {
	StartURL            string
	AllowedDomains      []string // Domains to stay within
	UserAgents          []string
	CrawlDelay          time.Duration // Delay between requests to the same domain
	MaxDepth            int           // Maximum crawl depth
	MaxRetries          int           // Max retries for a failed request
	RequestTimeout      time.Duration
	QueueIdleTimeout    time.Duration
	ProxyURL            string // e.g., "http://user:pass@host:port"
	RobotsUserAgent     string // User agent to use for robots.txt checks
	ConcurrentRequests  int    // Number of concurrent fetch workers
	Channels            Channels
	Headers             map[string]string
	LanguageCode        string
	Filters             []func(*Page, *Config) bool
	MaxIdleConnsPerHost int
	MaxIdleConns        int
	Proxies             []string
	RequireHeadless     bool
}

Config holds crawler configuration

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents the web crawler

func NewCrawler

func NewCrawler(config Config, queue queue.QueueInterface) (*Crawler, error)

NewCrawler initializes a new Crawler

func (*Crawler) Start added in v0.6.0

func (c *Crawler) Start()

Start begins the crawling process

type Headless added in v0.7.0

type Headless struct {
}

func NewHeadless added in v0.7.0

func NewHeadless() *Headless

type Page

type Page struct {
	URL  *url.URL      // Page url
	Resp *PageResponse // Page response as returned from the GET request
	Body string        // Response body string
}

Page is a struct that carries the scanned url, response and response body string

type PageResponse added in v0.7.0

type PageResponse struct {
	StatusCode int
	Headers    map[string]any
	Body       io.Reader
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL