engine

package

v0.0.23 Latest Latest Go to latest Published: May 5, 2026 License: MIT Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sadewadee/foxhound

Links

Open Source Insights

Documentation ¶

Overview ¶

Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.

Index ¶

type Checkpoint
- func LoadCheckpoint(path string) (*Checkpoint, error)
type CheckpointConfig
type DomainStats
- func (ds *DomainStats) AvgLatency() time.Duration
- func (ds *DomainStats) AvgProcessLatency() time.Duration
type Hunt
- func NewHunt(cfg HuntConfig) *Hunt
- func (h *Hunt) AdaptiveExtractor() *parse.AdaptiveExtractor
- func (h *Hunt) AddSession(name string, cfg SessionConfig) *Hunt
- func (h *Hunt) Pause()
- func (h *Hunt) Resume()
- func (h *Hunt) Run(ctx context.Context) error
- func (h *Hunt) SaveCheckpoint(path string) error
- func (h *Hunt) Session(name string) *foxhound.Session
- func (h *Hunt) SetLogger(logger *slog.Logger)
- func (h *Hunt) State() HuntState
- func (h *Hunt) Stats() *Stats
- func (h *Hunt) Stop()
- func (h *Hunt) Stream(ctx context.Context) (<-chan *foxhound.Item, error)
- func (h *Hunt) StreamWithStats(ctx context.Context, statsInterval time.Duration) (<-chan StreamEvent, error)
- func (h *Hunt) WithAdaptive(savePath string) *Hunt
- func (h *Hunt) WithBlockedDomains(domains ...string) *Hunt
- func (h *Hunt) WithDevelopmentMode(cacheDir string) *Hunt
- func (h *Hunt) WithDisableResources(types ...string) *Hunt
type HuntConfig
type HuntMetrics
- func NewHuntMetrics() *HuntMetrics
- func (hm *HuntMetrics) ElapsedSeconds() float64
- func (hm *HuntMetrics) IncrementResponseBytes(domain string, count int64)
- func (hm *HuntMetrics) IncrementStatus(status int)
- func (hm *HuntMetrics) RequestsPerSecond() float64
- func (hm *HuntMetrics) ToMap() map[string]any
type HuntResult
- func (hr *HuntResult) Completed() bool
- func (hr *HuntResult) Len() int
type HuntState
- func (s HuntState) String() string
type ItemList
- func NewItemList() *ItemList
- func (il *ItemList) Append(item *foxhound.Item)
- func (il *ItemList) Clear()
- func (il *ItemList) Items() []*foxhound.Item
- func (il *ItemList) Len() int
- func (il *ItemList) ToCSV(path string, columns []string) error
- func (il *ItemList) ToJSON(path string, indent bool) error
- func (il *ItemList) ToJSONL(path string) error
type MemoryPool
- func NewMemoryPool() *MemoryPool
- func (p *MemoryPool) Add(_ context.Context, url string) error
- func (p *MemoryPool) AddBatch(_ context.Context, urls []string) error
- func (p *MemoryPool) Close() error
- func (p *MemoryPool) Drain(_ context.Context) ([]string, error)
- func (p *MemoryPool) Len() int
type Pool
type PostgresPool
- func NewPostgresPool(dsn, table string) (*PostgresPool, error)
- func (p *PostgresPool) Add(_ context.Context, url string) error
- func (p *PostgresPool) AddBatch(_ context.Context, urls []string) error
- func (p *PostgresPool) Close() error
- func (p *PostgresPool) Drain(_ context.Context) ([]string, error)
- func (p *PostgresPool) Len() int
type RetryPolicy
- func DefaultRetryPolicy() *RetryPolicy
- func (rp *RetryPolicy) Delay(attempt int) time.Duration
- func (rp *RetryPolicy) ShouldRetry(attempt int, err error, resp *foxhound.Response) bool
type SQLitePool
- func NewSQLitePool(dbPath string) (*SQLitePool, error)
- func (p *SQLitePool) Add(_ context.Context, url string) error
- func (p *SQLitePool) AddBatch(_ context.Context, urls []string) error
- func (p *SQLitePool) Close() error
- func (p *SQLitePool) Drain(_ context.Context) ([]string, error)
- func (p *SQLitePool) Len() int
type Scheduler
- func NewScheduler(queue foxhound.Queue, maxWorkers int) *Scheduler
- func (s *Scheduler) Start(ctx context.Context, handler func(context.Context, *foxhound.Job) error) error
- func (s *Scheduler) Stop()
- func (s *Scheduler) Submit(ctx context.Context, jobs ...*foxhound.Job) error
- func (s *Scheduler) Wait()
type SessionConfig
type Stats
- func NewStats() *Stats
- func (s *Stats) DomainStatsFor(domain string) *DomainStats
- func (s *Stats) RecordBlock(domain string)
- func (s *Stats) RecordBytes(n int64)
- func (s *Stats) RecordEscalation()
- func (s *Stats) RecordItems(count int)
- func (s *Stats) RecordProcessDuration(domain string, duration time.Duration)
- func (s *Stats) RecordRequest(domain string, duration time.Duration, err error, blocked bool)
- func (s *Stats) Summary() string
- func (s *Stats) ToMap() map[string]any
type Step
type StepAction
type StreamEvent
type Trail
- func Login(...) *Trail
- func NewTrail(name string) *Trail
- func (t *Trail) Adaptive(name, selector string) *Trail
- func (t *Trail) CaptureXHR(urlPattern string) *Trail
- func (t *Trail) Click(selector string) *Trail
- func (t *Trail) ClickOptional(selector string) *Trail
- func (t *Trail) Collect(selector, attr string) *Trail
- func (t *Trail) Evaluate(script string) *Trail
- func (t *Trail) Extract(processor foxhound.Processor) *Trail
- func (t *Trail) Fill(selector, value string) *Trail
- func (t *Trail) InfiniteScroll(maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollIn(container string, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollInUntil(container, stopSelector string, stopCount, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollUntil(stopSelector string, stopCount int, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollWithWait(maxScrolls int, scrollWait time.Duration) *Trail
- func (t *Trail) LoadMore(selector string, maxClicks int) *Trail
- func (t *Trail) Navigate(url string) *Trail
- func (t *Trail) NoWarmup() *Trail
- func (t *Trail) Paginate(selector string, maxPages int) *Trail
- func (t *Trail) Scroll() *Trail
- func (t *Trail) ToJobs() []*foxhound.Job
- func (t *Trail) Wait(selector string, timeout time.Duration) *Trail
- func (t *Trail) WaitOptional(selector string, timeout time.Duration) *Trail
type Walker
- func (w *Walker) Run(ctx context.Context) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Checkpoint ¶

type Checkpoint struct {
	HuntID         string    `json:"hunt_id"`
	HuntName       string    `json:"hunt_name"`
	Domain         string    `json:"domain"`
	ItemsProcessed int64     `json:"items_processed"`
	RequestsDone   int64     `json:"requests_done"`
	ErrorCount     int64     `json:"errors"`
	LastURL        string    `json:"last_url"`
	Timestamp      time.Time `json:"timestamp"`
	QueueLen       int       `json:"queue_len"`
	ElapsedMs      int64     `json:"elapsed_ms"`
}

Checkpoint captures the observable state of a Hunt at a point in time. All fields are exported so they round-trip cleanly through JSON.

func LoadCheckpoint ¶

func LoadCheckpoint(path string) (*Checkpoint, error)

LoadCheckpoint reads a Checkpoint from the JSON file at path.

type CheckpointConfig ¶

type CheckpointConfig struct {
	// Enabled turns auto-checkpointing on. When false, no file is written.
	Enabled bool
	// Path is the file path where the checkpoint JSON is written.
	Path string
	// Interval is how many items must be processed between saves.
	// A value of 0 is treated as 100 (the default) to avoid division by zero.
	Interval int
}

CheckpointConfig controls automatic checkpoint saving.

type DomainStats ¶

type DomainStats struct {
	Requests atomic.Int64
	Errors   atomic.Int64
	Blocked  atomic.Int64
	// contains filtered or unexported fields
}

DomainStats holds per-domain request counters and latency tracking. All fields use atomic operations so callers can read/write without locks.

func (*DomainStats) AvgLatency ¶

func (ds *DomainStats) AvgLatency() time.Duration

AvgLatency returns the average fetch latency for this domain.

func (*DomainStats) AvgProcessLatency ¶

func (ds *DomainStats) AvgProcessLatency() time.Duration

AvgProcessLatency returns the average end-to-end processing latency.

type Hunt ¶

type Hunt struct {
	// contains filtered or unexported fields
}

Hunt is the top-level campaign coordinator. It owns the lifecycle of all walkers, applies middleware to the fetcher, seeds the queue, drains results, and emits stats.

Typical usage:

h := engine.NewHunt(cfg)
if err := h.Run(ctx); err != nil {
    log.Fatal(err)
}

func NewHunt ¶

func NewHunt(cfg HuntConfig) *Hunt

NewHunt creates a Hunt from cfg. It does not start any goroutines; call Run to begin processing.

func (*Hunt) AdaptiveExtractor ¶

func (h *Hunt) AdaptiveExtractor() *parse.AdaptiveExtractor

AdaptiveExtractor returns the Hunt-scoped adaptive extractor configured via WithAdaptive, or nil when adaptive mode has not been enabled.

func (*Hunt) AddSession ¶

func (h *Hunt) AddSession(name string, cfg SessionConfig) *Hunt

AddSession registers a named session bundle. When a Job's SessionID equals name, the walker uses cfg.Fetcher instead of the hunt's default fetcher. Subsequent calls with the same name overwrite the previous registration.

AddSession is safe to call before Run; calling it after Run starts is also supported but the in-flight job currently being processed by a walker will not pick up the new registration until its next Pop.

func (*Hunt) Pause ¶

func (h *Hunt) Pause()

Pause signals all walkers to suspend work. It transitions the state to HuntPaused. Resuming is done via Resume.

func (*Hunt) Resume ¶

func (h *Hunt) Resume()

Resume transitions the hunt back to HuntRunning after a Pause.

func (*Hunt) Run ¶

func (h *Hunt) Run(ctx context.Context) error

Run executes the campaign and blocks until all jobs are processed or ctx is cancelled. It returns nil on clean completion and a non-nil error on failure.

Run lifecycle:

Build the middleware-wrapped fetcher.
Push seed jobs to the queue.
Launch N walker goroutines.
Wait until the queue is empty AND all walkers are idle.
Flush all writers.
Transition state to HuntDone.

func (*Hunt) SaveCheckpoint ¶

func (h *Hunt) SaveCheckpoint(path string) error

SaveCheckpoint writes the current hunt state to a JSON file at path. The file is written atomically (temp file + rename) so a partial write cannot corrupt a previously valid checkpoint.

func (*Hunt) Session ¶

func (h *Hunt) Session(name string) *foxhound.Session

Session returns the named session previously registered via AddSession, or nil when no session with that name exists.

func (*Hunt) SetLogger ¶

func (h *Hunt) SetLogger(logger *slog.Logger)

SetLogger replaces the Hunt's logger. Walkers created after this call will inherit the new logger. Intended for testing.

func (*Hunt) State ¶

func (h *Hunt) State() HuntState

State returns the current HuntState.

func (*Hunt) Stats ¶

func (h *Hunt) Stats() *Stats

Stats returns the live statistics for this Hunt.

func (*Hunt) Stop ¶

func (h *Hunt) Stop()

Stop cancels the hunt context, causing all walkers to exit after finishing their current job. Run will return shortly after Stop is called.

func (*Hunt) Stream ¶

func (h *Hunt) Stream(ctx context.Context) (<-chan *foxhound.Item, error)

Stream starts the hunt in a background goroutine and returns a channel that receives each item as it is produced. The channel is closed when the hunt completes or the context is cancelled, making it safe to use in a range loop:

for item := range hunt.Stream(ctx) { ... }

Stream returns an error only when the hunt configuration is invalid. The channel is buffered (100 items) so that slow consumers do not block walkers; items are dropped with a warning log when the buffer is full.

func (*Hunt) StreamWithStats ¶

func (h *Hunt) StreamWithStats(ctx context.Context, statsInterval time.Duration) (<-chan StreamEvent, error)

StreamWithStats starts the hunt and returns a channel of StreamEvent values. Item events arrive as items are scraped; Stats events are emitted every statsInterval. The channel is closed when the hunt finishes.

A statsInterval of 0 disables periodic stats events (only item events are sent). Use Stream instead when stats events are not needed.

func (*Hunt) WithAdaptive ¶

func (h *Hunt) WithAdaptive(savePath string) *Hunt

WithAdaptive enables adaptive selector mode for the Hunt. The savePath argument is the file where learned element signatures are persisted as JSON across runs; pass an empty string for in-memory only (signatures are lost between runs).

Once enabled, the walker attaches a shared *parse.AdaptiveExtractor to every Response, so user code can call resp.Adaptive("name"), resp.CSSAdaptive(selector, name), or resp.CSSAdaptiveAll(selector, name) without manually constructing an extractor.

WithAdaptive returns the Hunt for fluent chaining and is safe to call before Run.

func (*Hunt) WithBlockedDomains ¶

func (h *Hunt) WithBlockedDomains(domains ...string) *Hunt

WithBlockedDomains registers fully-qualified domain names whose requests must be aborted by the browser fetcher. Subdomains are also matched, so "example.com" blocks "tracker.example.com". Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher created and managed by the application; otherwise the call is silently ignored at Run time with a warning log.

func (*Hunt) WithDevelopmentMode ¶

func (h *Hunt) WithDevelopmentMode(cacheDir string) *Hunt

WithDevelopmentMode enables on-disk response replay. The first time a job for a given URL is fetched the real fetcher is invoked and the response is serialised to dir/<sha256(url)>.json. Subsequent fetches for the same URL hit the cache and skip the network entirely, which is the standard fast inner-loop pattern when iterating on a parser.

Pass an empty dir to disable. Errors creating the cache directory are returned at Run time, not here.

func (*Hunt) WithDisableResources ¶

func (h *Hunt) WithDisableResources(types ...string) *Hunt

WithDisableResources registers browser resource types to abort. Valid values: "image", "font", "media", "stylesheet", "object", "imageset", "texttrack", "websocket", "csp_report", "beacon". Unknown values are dropped at apply time. Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher; otherwise the call is silently ignored at Run time with a warning log.

type HuntConfig ¶

type HuntConfig struct {
	// Name is a human-readable label used in logs and metrics.
	Name string
	// Domain is the primary target domain (used for metrics grouping).
	Domain string
	// Walkers is the number of concurrent virtual-user goroutines.
	Walkers int
	// MaxConcurrency is the global cap on simultaneous in-flight requests.
	// When 0, defaults to Walkers count.
	MaxConcurrency int
	// Seeds are the initial jobs pushed to the queue before walkers start.
	Seeds []*foxhound.Job
	// Processor is the user-supplied response handler.
	Processor foxhound.Processor
	// Fetcher is the base fetcher before middleware wrapping.
	Fetcher foxhound.Fetcher
	// Queue is the job storage backend.
	Queue foxhound.Queue
	// Pipelines are applied to each extracted Item in order.
	Pipelines []foxhound.Pipeline
	// Writers receive items that survive the pipeline chain.
	Writers []foxhound.Writer
	// Middlewares are wrapped around the Fetcher (first middleware is outermost).
	Middlewares []foxhound.Middleware
	// BehaviorProfile selects the human-simulation preset applied by each
	// Walker: "careful", "moderate", or "aggressive". Defaults to "moderate"
	// when empty so walkers always apply timing and rhythm delays.
	BehaviorProfile string
	// Checkpoint controls automatic state saving. Optional — checkpointing is
	// inactive when Checkpoint.Enabled is false (the zero value).
	Checkpoint CheckpointConfig
	// ItemCallback is invoked for every item that survives the pipeline chain,
	// before it is written. This enables streaming item processing during the
	// crawl without needing to use Stream(). The callback runs synchronously
	// in the walker goroutine so it must be fast.
	ItemCallback func(ctx context.Context, item *foxhound.Item)
	// OnStart is called once when the hunt begins (after seeds are queued).
	OnStart func(ctx context.Context)
	// OnClose is called once when the hunt completes (after writers flush).
	OnClose func(ctx context.Context, stats *Stats)
	// OnError is called when a fetch or process error occurs. Errors are
	// still logged; this hook enables custom error handling.
	OnError func(ctx context.Context, job *foxhound.Job, err error)
	// OnItem is called for each item after pipeline processing. Unlike
	// ItemCallback, OnItem receives the originating Job for context.
	OnItem func(ctx context.Context, job *foxhound.Job, item *foxhound.Item)
	// PageActions are JavaScript snippets executed after page load when using
	// the browser fetcher. They are injected as JobSteps of type
	// JobStepEvaluate on every job that uses browser mode.
	PageActions []string
	// Pool is an optional URL pool from a collect phase. When set, all URLs
	// in the pool are drained and added as seed jobs before walkers start.
	// This enables the two-phase pattern: collect URLs first, process concurrently.
	Pool Pool
	// PoolFetchMode sets the FetchMode for jobs created from pool URLs.
	// Defaults to FetchBrowser when PoolFetchModeSet is false.
	PoolFetchMode foxhound.FetchMode
	// PoolFetchModeSet indicates the user explicitly set PoolFetchMode.
	// When false and pool URLs exist, the mode defaults to FetchBrowser.
	PoolFetchModeSet bool
}

HuntConfig holds all dependencies and settings for a single scraping campaign.

type HuntMetrics ¶

type HuntMetrics struct {
	RequestsCount       int64
	FailedRequestsCount int64
	OffsiteRequests     int64
	BlockedRequests     int64
	ItemsScraped        int64
	ItemsDropped        int64
	ResponseBytes       int64
	StartTime           time.Time
	EndTime             time.Time
	RequestDelay        time.Duration
	ParallelRequests    int

	StatusCounts map[int]int64
	DomainBytes  map[string]int64
	LogCounts    map[string]int64
	// contains filtered or unexported fields
}

HuntMetrics holds extended statistics for a hunt beyond what Stats tracks. It adds offsite/blocked counters, status code breakdown, and per-domain byte tracking.

func NewHuntMetrics ¶

func NewHuntMetrics() *HuntMetrics

NewHuntMetrics creates a HuntMetrics initialised with the current time.

func (*HuntMetrics) ElapsedSeconds ¶

func (hm *HuntMetrics) ElapsedSeconds() float64

ElapsedSeconds returns the duration of the hunt in seconds.

func (*HuntMetrics) IncrementResponseBytes ¶

func (hm *HuntMetrics) IncrementResponseBytes(domain string, count int64)

IncrementResponseBytes adds byte count for a domain.

func (*HuntMetrics) IncrementStatus ¶

func (hm *HuntMetrics) IncrementStatus(status int)

IncrementStatus records an HTTP status code occurrence.

func (*HuntMetrics) RequestsPerSecond ¶

func (hm *HuntMetrics) RequestsPerSecond() float64

RequestsPerSecond returns the average requests per second.

func (*HuntMetrics) ToMap ¶

func (hm *HuntMetrics) ToMap() map[string]any

ToMap returns the metrics as a map for structured logging or JSON export.

type HuntResult ¶

type HuntResult struct {
	// Metrics holds the hunt metrics.
	Metrics *HuntMetrics
	// Items holds all scraped items.
	Items *ItemList
	// Paused is true if the hunt was paused (not completed).
	Paused bool
}

HuntResult is the complete result from a hunt execution.

func (*HuntResult) Completed ¶

func (hr *HuntResult) Completed() bool

Completed returns true if the hunt finished normally (was not paused).

func (*HuntResult) Len ¶

func (hr *HuntResult) Len() int

Len returns the number of scraped items.

type HuntState ¶

type HuntState int

HuntState represents the lifecycle state of a Hunt.

const (
	// HuntIdle means the hunt has not started yet.
	HuntIdle HuntState = iota
	// HuntRunning means walkers are actively processing jobs.
	HuntRunning
	// HuntPaused means the hunt is temporarily suspended.
	HuntPaused
	// HuntDone means all jobs have been processed successfully.
	HuntDone
	// HuntFailed means the hunt terminated with an unrecoverable error.
	HuntFailed
)

func (HuntState) String ¶

func (s HuntState) String() string

String returns a human-readable state name.

type ItemList ¶

type ItemList struct {
	// contains filtered or unexported fields
}

ItemList is a thread-safe collection of scraped items with batch export methods (JSON, JSONL, CSV). Use it with Hunt.ItemCallback or Hunt.StreamWithStats to accumulate items during a hunt.

func NewItemList ¶

func NewItemList() *ItemList

NewItemList creates an empty ItemList.

func (*ItemList) Append ¶

func (il *ItemList) Append(item *foxhound.Item)

Append adds an item to the list.

func (*ItemList) Clear ¶

func (il *ItemList) Clear()

Clear removes all items.

func (*ItemList) Items ¶

func (il *ItemList) Items() []*foxhound.Item

Items returns a copy of the items slice.

func (*ItemList) Len ¶

func (il *ItemList) Len() int

Len returns the number of items.

func (*ItemList) ToCSV ¶

func (il *ItemList) ToCSV(path string, columns []string) error

ToCSV exports items as CSV with the given column order.

func (*ItemList) ToJSON ¶

func (il *ItemList) ToJSON(path string, indent bool) error

ToJSON exports all items to a JSON file. When indent is true, the output is pretty-printed with 2-space indentation.

func (*ItemList) ToJSONL ¶

func (il *ItemList) ToJSONL(path string) error

ToJSONL exports items as JSON Lines (one JSON object per line).

type MemoryPool ¶

type MemoryPool struct {
	// contains filtered or unexported fields
}

MemoryPool is an in-memory Pool backed by a slice + dedup set.

func NewMemoryPool ¶

func NewMemoryPool() *MemoryPool

NewMemoryPool creates an empty in-memory pool.

func (*MemoryPool) Add ¶

func (p *MemoryPool) Add(_ context.Context, url string) error

func (*MemoryPool) AddBatch ¶

func (p *MemoryPool) AddBatch(_ context.Context, urls []string) error

func (*MemoryPool) Close ¶

func (p *MemoryPool) Close() error

func (*MemoryPool) Drain ¶

func (p *MemoryPool) Drain(_ context.Context) ([]string, error)

func (*MemoryPool) Len ¶

func (p *MemoryPool) Len() int

type Pool ¶

type Pool interface {
	// Add stores a URL. Duplicates are silently ignored.
	Add(ctx context.Context, url string) error
	// AddBatch stores multiple URLs. Duplicates are silently ignored.
	AddBatch(ctx context.Context, urls []string) error
	// Drain returns all stored URLs and empties the pool.
	Drain(ctx context.Context) ([]string, error)
	// Len returns the number of URLs in the pool.
	Len() int
	// Close releases resources.
	Close() error
}

Pool stores discovered URLs between collect and process phases. Implementations must be safe for concurrent use.

type PostgresPool ¶

type PostgresPool struct {
	// contains filtered or unexported fields
}

func NewPostgresPool ¶

func NewPostgresPool(dsn, table string) (*PostgresPool, error)

func (*PostgresPool) Add ¶

func (p *PostgresPool) Add(_ context.Context, url string) error

func (*PostgresPool) AddBatch ¶

func (p *PostgresPool) AddBatch(_ context.Context, urls []string) error

func (*PostgresPool) Close ¶

func (p *PostgresPool) Close() error

func (*PostgresPool) Drain ¶

func (p *PostgresPool) Drain(_ context.Context) ([]string, error)

func (*PostgresPool) Len ¶

func (p *PostgresPool) Len() int

type RetryPolicy ¶

type RetryPolicy struct {
	// MaxRetries is the maximum number of retry attempts (not counting the
	// original attempt). A value of 3 means up to 4 total attempts.
	MaxRetries int

	// BaseDelay is the initial delay before the first retry.
	BaseDelay time.Duration

	// MaxDelay caps the computed delay regardless of the attempt number.
	MaxDelay time.Duration

	// Backoff is the exponential multiplier applied to each successive delay.
	// A value of 2.0 doubles the delay each attempt.
	Backoff float64
}

RetryPolicy controls when and how often failed requests are retried.

func DefaultRetryPolicy ¶

func DefaultRetryPolicy() *RetryPolicy

DefaultRetryPolicy returns a sensible retry policy suitable for most scraping workloads: 3 retries, starting at 1 second, doubling up to 30 seconds.

func (*RetryPolicy) Delay ¶

func (rp *RetryPolicy) Delay(attempt int) time.Duration

Delay returns how long to wait before the given retry attempt. It uses exponential backoff with full-jitter so that concurrent walkers do not stampede the same target simultaneously.

func (*RetryPolicy) ShouldRetry ¶

func (rp *RetryPolicy) ShouldRetry(attempt int, err error, resp *foxhound.Response) bool

ShouldRetry reports whether the request should be retried. attempt is the zero-based number of retries already performed. err is the fetch error (may be nil). resp is the response (may be nil on network errors).

type SQLitePool ¶

type SQLitePool struct {
	// contains filtered or unexported fields
}

func NewSQLitePool ¶

func NewSQLitePool(dbPath string) (*SQLitePool, error)

func (*SQLitePool) Add ¶

func (p *SQLitePool) Add(_ context.Context, url string) error

func (*SQLitePool) AddBatch ¶

func (p *SQLitePool) AddBatch(_ context.Context, urls []string) error

func (*SQLitePool) Close ¶

func (p *SQLitePool) Close() error

func (*SQLitePool) Drain ¶

func (p *SQLitePool) Drain(_ context.Context) ([]string, error)

func (*SQLitePool) Len ¶

func (p *SQLitePool) Len() int

type Scheduler ¶

type Scheduler struct {
	// contains filtered or unexported fields
}

Scheduler manages a fixed-size pool of worker goroutines that consume jobs from a Queue and invoke a user-supplied handler for each one. It is designed for use as an internal component of Hunt but can also be used standalone.

func NewScheduler ¶

func NewScheduler(queue foxhound.Queue, maxWorkers int) *Scheduler

NewScheduler creates a Scheduler backed by queue with a pool of at most maxWorkers concurrent workers.

func (*Scheduler) Start ¶

func (s *Scheduler) Start(ctx context.Context, handler func(context.Context, *foxhound.Job) error) error

Start launches maxWorkers goroutines that each loop over queue.Pop → handler. It blocks until Stop is called or ctx is cancelled, and returns only after all in-flight handlers have returned.

Calling Start again after it returns is not supported.

func (*Scheduler) Stop ¶

func (s *Scheduler) Stop()

Stop signals all workers to stop after finishing their current job. It is safe to call more than once.

func (*Scheduler) Submit ¶

func (s *Scheduler) Submit(ctx context.Context, jobs ...*foxhound.Job) error

Submit pushes one or more jobs directly to the underlying queue. It is safe to call from any goroutine.

func (*Scheduler) Wait ¶

func (s *Scheduler) Wait()

Wait blocks until Start has returned and all worker goroutines have exited. It is safe to call concurrently with Stop.

type SessionConfig ¶

type SessionConfig struct {
	// Name is the unique identifier for this session within a Hunt.
	// Must match Job.SessionID for routing to work.
	Name string
	// Fetcher is the underlying fetcher for the session. Required.
	Fetcher foxhound.Fetcher
	// Identity is the optional identity profile attached to the session.
	// Stored as `any` to avoid an import cycle with the identity package.
	Identity any
	// Proxy is the optional proxy URL recorded with the session for
	// inspection. Wire it through the fetcher's own option at construction.
	Proxy string
}

SessionConfig describes a named session bundle that a Hunt can route jobs to via Job.SessionID. Each session has its own fetcher, identity, and proxy URL — useful when one campaign needs to mix fast-static fetches for index pages with slow stealth fetches for detail pages, with separate cookie jars per role.

type Stats ¶

type Stats struct {
	StartedAt      time.Time
	RequestCount   atomic.Int64
	SuccessCount   atomic.Int64
	ErrorCount     atomic.Int64
	BlockedCount   atomic.Int64
	ItemCount      atomic.Int64
	EscalatedCount atomic.Int64
	BytesReceived  atomic.Int64
	// contains filtered or unexported fields
}

Stats holds runtime metrics for a Hunt. All top-level counters use atomic.Int64 so callers can read them without holding any lock. Per-domain stats use sync.Map for lock-free read path.

func NewStats ¶

func NewStats() *Stats

NewStats creates a Stats instance ready for use.

func (*Stats) DomainStatsFor ¶

func (s *Stats) DomainStatsFor(domain string) *DomainStats

DomainStatsFor returns the DomainStats for the given domain, or nil if no requests have been recorded for it.

func (*Stats) RecordBlock ¶

func (s *Stats) RecordBlock(domain string)

RecordBlock increments the request and blocked counters without double-counting the request as a success. Used when a block is detected outside the normal fetch path (e.g. CAPTCHA detection in the walker).

func (*Stats) RecordBytes ¶

func (s *Stats) RecordBytes(n int64)

RecordBytes adds n to the total bytes-received counter.

func (*Stats) RecordEscalation ¶

func (s *Stats) RecordEscalation()

RecordEscalation increments the count of requests that were escalated from the static fetcher to the browser fetcher.

func (*Stats) RecordItems ¶

func (s *Stats) RecordItems(count int)

RecordItems increments the scraped item counter by count.

func (*Stats) RecordProcessDuration ¶

func (s *Stats) RecordProcessDuration(domain string, duration time.Duration)

RecordProcessDuration records the end-to-end processing time for a job (fetch + process + pipeline + write) for the given domain.

func (*Stats) RecordRequest ¶

func (s *Stats) RecordRequest(domain string, duration time.Duration, err error, blocked bool)

RecordRequest records a completed fetch attempt for the given domain. Pass err=nil and blocked=false for a clean success.

func (*Stats) Summary ¶

func (s *Stats) Summary() string

Summary returns a human-readable snapshot of current statistics.

func (*Stats) ToMap ¶

func (s *Stats) ToMap() map[string]any

ToMap returns a structured snapshot of current statistics suitable for JSON serialisation or structured logging.

type Step ¶

type Step struct {
	// Action is the kind of step.
	Action StepAction
	// URL is the target for StepNavigate.
	URL string
	// Selector is the CSS selector for StepClick, StepWait, and StepExtract.
	// For InfiniteScroll, Selector is the scrollable container (empty = whole page).
	Selector string
	// Duration is the timeout/wait for StepWait.
	Duration time.Duration
	// Process is the extraction logic for StepExtract.
	Process foxhound.Processor
	// MaxScrolls limits InfiniteScroll iterations.
	MaxScrolls int
	// MaxClicks limits LoadMore button clicks.
	MaxClicks int
	// MaxPages limits Paginate page follows.
	MaxPages int
	// Script is the JavaScript code for StepEvaluate.
	Script string
	// StopSelector is a CSS selector; InfiniteScroll stops when
	// document.querySelectorAll(StopSelector).length >= StopCount.
	StopSelector string
	// StopCount is the target element count for StopSelector.
	StopCount int
	// ScrollWait is the duration to wait after each scroll iteration before
	// checking for new content. Defaults to 2s when zero.
	ScrollWait time.Duration
	// Optional marks this step as non-fatal: if it fails, execution continues.
	Optional bool
	// Value is the text to type into an input field for StepFill.
	Value string
}

Step is a single action within a Trail.

type StepAction ¶

type StepAction int

StepAction identifies what a Trail Step should do.

const (
	// StepNavigate navigates to a URL (creates a Job).
	StepNavigate StepAction = iota
	// StepClick clicks a CSS selector (browser-mode only).
	StepClick
	// StepWait waits for a CSS selector to appear or a fixed duration.
	StepWait
	// StepExtract runs a Processor against the current page.
	StepExtract
	// StepScroll scrolls the page (browser-mode only).
	StepScroll
	// StepInfiniteScroll scrolls to bottom repeatedly until no new content
	// loads (for lazy-load / infinite scroll pages like Google Maps).
	StepInfiniteScroll
	// StepLoadMore clicks a "load more" button repeatedly until it
	// disappears or max clicks reached.
	StepLoadMore
	// StepPaginate detects pagination links ("Next", page numbers) and
	// follows them, collecting content from each page.
	StepPaginate
	// StepEvaluate executes custom JavaScript on the page.
	StepEvaluate
	// StepFill types text into an input field with human-like keystrokes.
	StepFill
	// StepCollect extracts URLs from matching elements into a Pool.
	StepCollect
)

type StreamEvent ¶

type StreamEvent struct {
	// Item is non-nil for item events.
	Item *foxhound.Item
	// Stats is non-nil for periodic stats snapshot events.
	Stats *Stats
}

StreamEvent is emitted on the channel returned by StreamWithStats. Exactly one of Item or Stats is non-nil per event.

type Trail ¶

type Trail struct {
	// Name is a human-readable label for this navigation path.
	Name string
	// Steps is the ordered sequence of actions.
	Steps []Step
	// contains filtered or unexported fields
}

Trail is a reusable navigation blueprint composed of ordered Steps. It is built via a fluent builder API and converted to Jobs when submitted to a Hunt.

func Login(name, loginURL, userSelector, passSelector, submitSelector, username, password string) *Trail

Login builds a login trail that navigates to the login page, fills credentials, and submits the form. The returned trail can be further chained with additional steps (e.g. WaitOptional for a post-login element).

func NewTrail ¶

func NewTrail(name string) *Trail

NewTrail creates a new empty Trail with the given name.

func (*Trail) Adaptive ¶

func (t *Trail) Adaptive(name, selector string) *Trail

Adaptive registers an adaptive selector that survives DOM rewrites by falling back to similarity matching when the primary CSS selector fails to match on a future run. The element signature is learned automatically on the first successful extraction and persisted via the Hunt's adaptive store (configured by Hunt.WithAdaptive).

Adaptive only records the registration intent on the Trail; the actual Register and signature learning happens when the produced Job is fetched by a walker, so the Hunt must have been configured with WithAdaptive before Run is called.

func (*Trail) CaptureXHR ¶

func (t *Trail) CaptureXHR(urlPattern string) *Trail

CaptureXHR registers a URL regexp pattern. While any job produced by this Trail is fetched, the browser fetcher captures every XHR or fetch response whose URL matches the pattern, storing the request URL, status, headers, and body in Response.CapturedXHR. Use this to discover the JSON API behind a JavaScript-rendered page without parsing the DOM.

Multiple calls accumulate; all patterns are matched (logical OR). Patterns must be valid Go regexps; invalid patterns are silently dropped at fetch time.

func (*Trail) Click ¶

func (t *Trail) Click(selector string) *Trail

Click appends a StepClick step that clicks the element matching selector. This step is only meaningful when using the browser fetcher.

func (*Trail) ClickOptional ¶

func (t *Trail) ClickOptional(selector string) *Trail

ClickOptional appends a StepClick step that does NOT abort the fetch on failure. Useful for dismissing elements that may or may not be present.

func (*Trail) Collect ¶

func (t *Trail) Collect(selector, attr string) *Trail

Collect appends a step that extracts URLs from all elements matching selector, reading the given attribute (typically "href"). The collected URLs are stored in Response.StepResults as []string.

This step is implemented as a JS Evaluate that runs querySelectorAll(selector) and returns the attribute values.

func (*Trail) Evaluate ¶

func (t *Trail) Evaluate(script string) *Trail

Evaluate appends a step that executes custom JavaScript on the page. The return value of the script is available in Response.StepResults.

func (*Trail) Extract ¶

func (t *Trail) Extract(processor foxhound.Processor) *Trail

Extract appends a StepExtract step that runs processor against the current page response.

func (*Trail) Fill ¶

func (t *Trail) Fill(selector, value string) *Trail

Fill appends a StepFill step that types value into the input matching selector with human-like keystrokes (using behavior.Keyboard).

func (*Trail) InfiniteScroll ¶

func (t *Trail) InfiniteScroll(maxScrolls int) *Trail

InfiniteScroll appends a step that scrolls to the bottom repeatedly until no new content loads (for lazy-load / infinite scroll pages). maxScrolls limits iterations (0 = default 50). Scrolls the whole page.

func (*Trail) InfiniteScrollIn ¶

func (t *Trail) InfiniteScrollIn(container string, maxScrolls int) *Trail

InfiniteScrollIn appends an InfiniteScroll step that scrolls inside a specific container element (e.g. Google Maps results panel, Facebook feed). container is a CSS selector for the scrollable element.

func (*Trail) InfiniteScrollInUntil ¶

func (t *Trail) InfiniteScrollInUntil(container, stopSelector string, stopCount, maxScrolls int) *Trail

InfiniteScrollInUntil combines container scrolling with a stop condition.

func (*Trail) InfiniteScrollUntil ¶

func (t *Trail) InfiniteScrollUntil(stopSelector string, stopCount int, maxScrolls int) *Trail

InfiniteScrollUntil appends an InfiniteScroll step that stops when stopSelector matches at least stopCount elements. This scrolls until the target is reached rather than until content stops loading.

func (*Trail) InfiniteScrollWithWait ¶

func (t *Trail) InfiniteScrollWithWait(maxScrolls int, scrollWait time.Duration) *Trail

InfiniteScrollWithWait appends an InfiniteScroll with custom post-scroll wait.

func (*Trail) LoadMore ¶

func (t *Trail) LoadMore(selector string, maxClicks int) *Trail

LoadMore appends a step that clicks the element matching selector repeatedly until it disappears or maxClicks is reached (0 = default 20).

func (*Trail) Navigate ¶

func (t *Trail) Navigate(url string) *Trail

Navigate appends a StepNavigate step that fetches url.

func (*Trail) NoWarmup ¶

func (t *Trail) NoWarmup() *Trail

NoWarmup disables the automatic homepage warm-up visit that ToJobs() prepends by default. Use this when speed is more important than stealth, or when the trail already starts at the homepage.

func (*Trail) Paginate ¶

func (t *Trail) Paginate(selector string, maxPages int) *Trail

Paginate appends a step that detects pagination links matching selector (e.g. "a.next", "li.next a") and follows them, collecting content from each page. maxPages limits how many pages to follow (0 = default 10).

func (*Trail) Scroll ¶

func (t *Trail) Scroll() *Trail

Scroll appends a StepScroll step that scrolls the page. This step is only meaningful when using the browser fetcher.

func (*Trail) ToJobs ¶

func (t *Trail) ToJobs() []*foxhound.Job

ToJobs converts the Trail into foxhound.Jobs. Each StepNavigate starts a new Job; subsequent browser steps (Click, Wait, Scroll) are attached as JobSteps on that Job and set FetchMode to FetchBrowser.

Extract steps are NOT converted to JobSteps because their Processor (an interface) cannot survive JSON serialization through queue backends. Extraction is handled by the hunt-level Processor after the fetch completes.

Steps that appear before the first Navigate are silently skipped.

By default, when the trail has browser steps and the first Navigate URL is not the site homepage, ToJobs prepends a warm-up Job that visits the homepage first to seed cookies and build a natural referrer chain. Call NoWarmup() to disable this behaviour.

func (*Trail) Wait ¶

func (t *Trail) Wait(selector string, timeout time.Duration) *Trail

Wait appends a StepWait step that blocks until selector appears or timeout elapses.

func (*Trail) WaitOptional ¶

func (t *Trail) WaitOptional(selector string, timeout time.Duration) *Trail

WaitOptional appends a StepWait step that does NOT abort the fetch on failure. Useful for waiting on elements that may not appear on every page.

type Walker ¶

type Walker struct {
	// contains filtered or unexported fields
}

Walker is a virtual user that pops jobs from the queue, fetches them, processes the responses, runs items through the pipeline chain, and writes results — looping until the context is cancelled or the queue is drained.

func (*Walker) Run ¶

func (w *Walker) Run(ctx context.Context) error

Run is the main loop. It exits when ctx is cancelled. The caller's WaitGroup must be decremented after Run returns.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL