engine

package
v0.0.23 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: MIT Imports: 26 Imported by: 0

Documentation

Overview

Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Checkpoint

type Checkpoint struct {
	HuntID         string    `json:"hunt_id"`
	HuntName       string    `json:"hunt_name"`
	Domain         string    `json:"domain"`
	ItemsProcessed int64     `json:"items_processed"`
	RequestsDone   int64     `json:"requests_done"`
	ErrorCount     int64     `json:"errors"`
	LastURL        string    `json:"last_url"`
	Timestamp      time.Time `json:"timestamp"`
	QueueLen       int       `json:"queue_len"`
	ElapsedMs      int64     `json:"elapsed_ms"`
}

Checkpoint captures the observable state of a Hunt at a point in time. All fields are exported so they round-trip cleanly through JSON.

func LoadCheckpoint

func LoadCheckpoint(path string) (*Checkpoint, error)

LoadCheckpoint reads a Checkpoint from the JSON file at path.

type CheckpointConfig

type CheckpointConfig struct {
	// Enabled turns auto-checkpointing on. When false, no file is written.
	Enabled bool
	// Path is the file path where the checkpoint JSON is written.
	Path string
	// Interval is how many items must be processed between saves.
	// A value of 0 is treated as 100 (the default) to avoid division by zero.
	Interval int
}

CheckpointConfig controls automatic checkpoint saving.

type DomainStats

type DomainStats struct {
	Requests atomic.Int64
	Errors   atomic.Int64
	Blocked  atomic.Int64
	// contains filtered or unexported fields
}

DomainStats holds per-domain request counters and latency tracking. All fields use atomic operations so callers can read/write without locks.

func (*DomainStats) AvgLatency

func (ds *DomainStats) AvgLatency() time.Duration

AvgLatency returns the average fetch latency for this domain.

func (*DomainStats) AvgProcessLatency

func (ds *DomainStats) AvgProcessLatency() time.Duration

AvgProcessLatency returns the average end-to-end processing latency.

type Hunt

type Hunt struct {
	// contains filtered or unexported fields
}

Hunt is the top-level campaign coordinator. It owns the lifecycle of all walkers, applies middleware to the fetcher, seeds the queue, drains results, and emits stats.

Typical usage:

h := engine.NewHunt(cfg)
if err := h.Run(ctx); err != nil {
    log.Fatal(err)
}

func NewHunt

func NewHunt(cfg HuntConfig) *Hunt

NewHunt creates a Hunt from cfg. It does not start any goroutines; call Run to begin processing.

func (*Hunt) AdaptiveExtractor

func (h *Hunt) AdaptiveExtractor() *parse.AdaptiveExtractor

AdaptiveExtractor returns the Hunt-scoped adaptive extractor configured via WithAdaptive, or nil when adaptive mode has not been enabled.

func (*Hunt) AddSession

func (h *Hunt) AddSession(name string, cfg SessionConfig) *Hunt

AddSession registers a named session bundle. When a Job's SessionID equals name, the walker uses cfg.Fetcher instead of the hunt's default fetcher. Subsequent calls with the same name overwrite the previous registration.

AddSession is safe to call before Run; calling it after Run starts is also supported but the in-flight job currently being processed by a walker will not pick up the new registration until its next Pop.

func (*Hunt) Pause

func (h *Hunt) Pause()

Pause signals all walkers to suspend work. It transitions the state to HuntPaused. Resuming is done via Resume.

func (*Hunt) Resume

func (h *Hunt) Resume()

Resume transitions the hunt back to HuntRunning after a Pause.

func (*Hunt) Run

func (h *Hunt) Run(ctx context.Context) error

Run executes the campaign and blocks until all jobs are processed or ctx is cancelled. It returns nil on clean completion and a non-nil error on failure.

Run lifecycle:

  1. Build the middleware-wrapped fetcher.
  2. Push seed jobs to the queue.
  3. Launch N walker goroutines.
  4. Wait until the queue is empty AND all walkers are idle.
  5. Flush all writers.
  6. Transition state to HuntDone.

func (*Hunt) SaveCheckpoint

func (h *Hunt) SaveCheckpoint(path string) error

SaveCheckpoint writes the current hunt state to a JSON file at path. The file is written atomically (temp file + rename) so a partial write cannot corrupt a previously valid checkpoint.

func (*Hunt) Session

func (h *Hunt) Session(name string) *foxhound.Session

Session returns the named session previously registered via AddSession, or nil when no session with that name exists.

func (*Hunt) SetLogger

func (h *Hunt) SetLogger(logger *slog.Logger)

SetLogger replaces the Hunt's logger. Walkers created after this call will inherit the new logger. Intended for testing.

func (*Hunt) State

func (h *Hunt) State() HuntState

State returns the current HuntState.

func (*Hunt) Stats

func (h *Hunt) Stats() *Stats

Stats returns the live statistics for this Hunt.

func (*Hunt) Stop

func (h *Hunt) Stop()

Stop cancels the hunt context, causing all walkers to exit after finishing their current job. Run will return shortly after Stop is called.

func (*Hunt) Stream

func (h *Hunt) Stream(ctx context.Context) (<-chan *foxhound.Item, error)

Stream starts the hunt in a background goroutine and returns a channel that receives each item as it is produced. The channel is closed when the hunt completes or the context is cancelled, making it safe to use in a range loop:

for item := range hunt.Stream(ctx) { ... }

Stream returns an error only when the hunt configuration is invalid. The channel is buffered (100 items) so that slow consumers do not block walkers; items are dropped with a warning log when the buffer is full.

func (*Hunt) StreamWithStats

func (h *Hunt) StreamWithStats(ctx context.Context, statsInterval time.Duration) (<-chan StreamEvent, error)

StreamWithStats starts the hunt and returns a channel of StreamEvent values. Item events arrive as items are scraped; Stats events are emitted every statsInterval. The channel is closed when the hunt finishes.

A statsInterval of 0 disables periodic stats events (only item events are sent). Use Stream instead when stats events are not needed.

func (*Hunt) WithAdaptive

func (h *Hunt) WithAdaptive(savePath string) *Hunt

WithAdaptive enables adaptive selector mode for the Hunt. The savePath argument is the file where learned element signatures are persisted as JSON across runs; pass an empty string for in-memory only (signatures are lost between runs).

Once enabled, the walker attaches a shared *parse.AdaptiveExtractor to every Response, so user code can call resp.Adaptive("name"), resp.CSSAdaptive(selector, name), or resp.CSSAdaptiveAll(selector, name) without manually constructing an extractor.

WithAdaptive returns the Hunt for fluent chaining and is safe to call before Run.

func (*Hunt) WithBlockedDomains

func (h *Hunt) WithBlockedDomains(domains ...string) *Hunt

WithBlockedDomains registers fully-qualified domain names whose requests must be aborted by the browser fetcher. Subdomains are also matched, so "example.com" blocks "tracker.example.com". Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher created and managed by the application; otherwise the call is silently ignored at Run time with a warning log.

func (*Hunt) WithDevelopmentMode

func (h *Hunt) WithDevelopmentMode(cacheDir string) *Hunt

WithDevelopmentMode enables on-disk response replay. The first time a job for a given URL is fetched the real fetcher is invoked and the response is serialised to dir/<sha256(url)>.json. Subsequent fetches for the same URL hit the cache and skip the network entirely, which is the standard fast inner-loop pattern when iterating on a parser.

Pass an empty dir to disable. Errors creating the cache directory are returned at Run time, not here.

func (*Hunt) WithDisableResources

func (h *Hunt) WithDisableResources(types ...string) *Hunt

WithDisableResources registers browser resource types to abort. Valid values: "image", "font", "media", "stylesheet", "object", "imageset", "texttrack", "websocket", "csp_report", "beacon". Unknown values are dropped at apply time. Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher; otherwise the call is silently ignored at Run time with a warning log.

type HuntConfig

type HuntConfig struct {
	// Name is a human-readable label used in logs and metrics.
	Name string
	// Domain is the primary target domain (used for metrics grouping).
	Domain string
	// Walkers is the number of concurrent virtual-user goroutines.
	Walkers int
	// MaxConcurrency is the global cap on simultaneous in-flight requests.
	// When 0, defaults to Walkers count.
	MaxConcurrency int
	// Seeds are the initial jobs pushed to the queue before walkers start.
	Seeds []*foxhound.Job
	// Processor is the user-supplied response handler.
	Processor foxhound.Processor
	// Fetcher is the base fetcher before middleware wrapping.
	Fetcher foxhound.Fetcher
	// Queue is the job storage backend.
	Queue foxhound.Queue
	// Pipelines are applied to each extracted Item in order.
	Pipelines []foxhound.Pipeline
	// Writers receive items that survive the pipeline chain.
	Writers []foxhound.Writer
	// Middlewares are wrapped around the Fetcher (first middleware is outermost).
	Middlewares []foxhound.Middleware
	// BehaviorProfile selects the human-simulation preset applied by each
	// Walker: "careful", "moderate", or "aggressive". Defaults to "moderate"
	// when empty so walkers always apply timing and rhythm delays.
	BehaviorProfile string
	// Checkpoint controls automatic state saving. Optional — checkpointing is
	// inactive when Checkpoint.Enabled is false (the zero value).
	Checkpoint CheckpointConfig
	// ItemCallback is invoked for every item that survives the pipeline chain,
	// before it is written. This enables streaming item processing during the
	// crawl without needing to use Stream(). The callback runs synchronously
	// in the walker goroutine so it must be fast.
	ItemCallback func(ctx context.Context, item *foxhound.Item)
	// OnStart is called once when the hunt begins (after seeds are queued).
	OnStart func(ctx context.Context)
	// OnClose is called once when the hunt completes (after writers flush).
	OnClose func(ctx context.Context, stats *Stats)
	// OnError is called when a fetch or process error occurs. Errors are
	// still logged; this hook enables custom error handling.
	OnError func(ctx context.Context, job *foxhound.Job, err error)
	// OnItem is called for each item after pipeline processing. Unlike
	// ItemCallback, OnItem receives the originating Job for context.
	OnItem func(ctx context.Context, job *foxhound.Job, item *foxhound.Item)
	// PageActions are JavaScript snippets executed after page load when using
	// the browser fetcher. They are injected as JobSteps of type
	// JobStepEvaluate on every job that uses browser mode.
	PageActions []string
	// Pool is an optional URL pool from a collect phase. When set, all URLs
	// in the pool are drained and added as seed jobs before walkers start.
	// This enables the two-phase pattern: collect URLs first, process concurrently.
	Pool Pool
	// PoolFetchMode sets the FetchMode for jobs created from pool URLs.
	// Defaults to FetchBrowser when PoolFetchModeSet is false.
	PoolFetchMode foxhound.FetchMode
	// PoolFetchModeSet indicates the user explicitly set PoolFetchMode.
	// When false and pool URLs exist, the mode defaults to FetchBrowser.
	PoolFetchModeSet bool
}

HuntConfig holds all dependencies and settings for a single scraping campaign.

type HuntMetrics

type HuntMetrics struct {
	RequestsCount       int64
	FailedRequestsCount int64
	OffsiteRequests     int64
	BlockedRequests     int64
	ItemsScraped        int64
	ItemsDropped        int64
	ResponseBytes       int64
	StartTime           time.Time
	EndTime             time.Time
	RequestDelay        time.Duration
	ParallelRequests    int

	StatusCounts map[int]int64
	DomainBytes  map[string]int64
	LogCounts    map[string]int64
	// contains filtered or unexported fields
}

HuntMetrics holds extended statistics for a hunt beyond what Stats tracks. It adds offsite/blocked counters, status code breakdown, and per-domain byte tracking.

func NewHuntMetrics

func NewHuntMetrics() *HuntMetrics

NewHuntMetrics creates a HuntMetrics initialised with the current time.

func (*HuntMetrics) ElapsedSeconds

func (hm *HuntMetrics) ElapsedSeconds() float64

ElapsedSeconds returns the duration of the hunt in seconds.

func (*HuntMetrics) IncrementResponseBytes

func (hm *HuntMetrics) IncrementResponseBytes(domain string, count int64)

IncrementResponseBytes adds byte count for a domain.

func (*HuntMetrics) IncrementStatus

func (hm *HuntMetrics) IncrementStatus(status int)

IncrementStatus records an HTTP status code occurrence.

func (*HuntMetrics) RequestsPerSecond

func (hm *HuntMetrics) RequestsPerSecond() float64

RequestsPerSecond returns the average requests per second.

func (*HuntMetrics) ToMap

func (hm *HuntMetrics) ToMap() map[string]any

ToMap returns the metrics as a map for structured logging or JSON export.

type HuntResult

type HuntResult struct {
	// Metrics holds the hunt metrics.
	Metrics *HuntMetrics
	// Items holds all scraped items.
	Items *ItemList
	// Paused is true if the hunt was paused (not completed).
	Paused bool
}

HuntResult is the complete result from a hunt execution.

func (*HuntResult) Completed

func (hr *HuntResult) Completed() bool

Completed returns true if the hunt finished normally (was not paused).

func (*HuntResult) Len

func (hr *HuntResult) Len() int

Len returns the number of scraped items.

type HuntState

type HuntState int

HuntState represents the lifecycle state of a Hunt.

const (
	// HuntIdle means the hunt has not started yet.
	HuntIdle HuntState = iota
	// HuntRunning means walkers are actively processing jobs.
	HuntRunning
	// HuntPaused means the hunt is temporarily suspended.
	HuntPaused
	// HuntDone means all jobs have been processed successfully.
	HuntDone
	// HuntFailed means the hunt terminated with an unrecoverable error.
	HuntFailed
)

func (HuntState) String

func (s HuntState) String() string

String returns a human-readable state name.

type ItemList

type ItemList struct {
	// contains filtered or unexported fields
}

ItemList is a thread-safe collection of scraped items with batch export methods (JSON, JSONL, CSV). Use it with Hunt.ItemCallback or Hunt.StreamWithStats to accumulate items during a hunt.

func NewItemList

func NewItemList() *ItemList

NewItemList creates an empty ItemList.

func (*ItemList) Append

func (il *ItemList) Append(item *foxhound.Item)

Append adds an item to the list.

func (*ItemList) Clear

func (il *ItemList) Clear()

Clear removes all items.

func (*ItemList) Items

func (il *ItemList) Items() []*foxhound.Item

Items returns a copy of the items slice.

func (*ItemList) Len

func (il *ItemList) Len() int

Len returns the number of items.

func (*ItemList) ToCSV

func (il *ItemList) ToCSV(path string, columns []string) error

ToCSV exports items as CSV with the given column order.

func (*ItemList) ToJSON

func (il *ItemList) ToJSON(path string, indent bool) error

ToJSON exports all items to a JSON file. When indent is true, the output is pretty-printed with 2-space indentation.

func (*ItemList) ToJSONL

func (il *ItemList) ToJSONL(path string) error

ToJSONL exports items as JSON Lines (one JSON object per line).

type MemoryPool

type MemoryPool struct {
	// contains filtered or unexported fields
}

MemoryPool is an in-memory Pool backed by a slice + dedup set.

func NewMemoryPool

func NewMemoryPool() *MemoryPool

NewMemoryPool creates an empty in-memory pool.

func (*MemoryPool) Add

func (p *MemoryPool) Add(_ context.Context, url string) error

func (*MemoryPool) AddBatch

func (p *MemoryPool) AddBatch(_ context.Context, urls []string) error

func (*MemoryPool) Close

func (p *MemoryPool) Close() error

func (*MemoryPool) Drain

func (p *MemoryPool) Drain(_ context.Context) ([]string, error)

func (*MemoryPool) Len

func (p *MemoryPool) Len() int

type Pool

type Pool interface {
	// Add stores a URL. Duplicates are silently ignored.
	Add(ctx context.Context, url string) error
	// AddBatch stores multiple URLs. Duplicates are silently ignored.
	AddBatch(ctx context.Context, urls []string) error
	// Drain returns all stored URLs and empties the pool.
	Drain(ctx context.Context) ([]string, error)
	// Len returns the number of URLs in the pool.
	Len() int
	// Close releases resources.
	Close() error
}

Pool stores discovered URLs between collect and process phases. Implementations must be safe for concurrent use.

type PostgresPool

type PostgresPool struct {
	// contains filtered or unexported fields
}

func NewPostgresPool

func NewPostgresPool(dsn, table string) (*PostgresPool, error)

func (*PostgresPool) Add

func (p *PostgresPool) Add(_ context.Context, url string) error

func (*PostgresPool) AddBatch

func (p *PostgresPool) AddBatch(_ context.Context, urls []string) error

func (*PostgresPool) Close

func (p *PostgresPool) Close() error

func (*PostgresPool) Drain

func (p *PostgresPool) Drain(_ context.Context) ([]string, error)

func (*PostgresPool) Len

func (p *PostgresPool) Len() int

type RetryPolicy

type RetryPolicy struct {
	// MaxRetries is the maximum number of retry attempts (not counting the
	// original attempt). A value of 3 means up to 4 total attempts.
	MaxRetries int

	// BaseDelay is the initial delay before the first retry.
	BaseDelay time.Duration

	// MaxDelay caps the computed delay regardless of the attempt number.
	MaxDelay time.Duration

	// Backoff is the exponential multiplier applied to each successive delay.
	// A value of 2.0 doubles the delay each attempt.
	Backoff float64
}

RetryPolicy controls when and how often failed requests are retried.

func DefaultRetryPolicy

func DefaultRetryPolicy() *RetryPolicy

DefaultRetryPolicy returns a sensible retry policy suitable for most scraping workloads: 3 retries, starting at 1 second, doubling up to 30 seconds.

func (*RetryPolicy) Delay

func (rp *RetryPolicy) Delay(attempt int) time.Duration

Delay returns how long to wait before the given retry attempt. It uses exponential backoff with full-jitter so that concurrent walkers do not stampede the same target simultaneously.

func (*RetryPolicy) ShouldRetry

func (rp *RetryPolicy) ShouldRetry(attempt int, err error, resp *foxhound.Response) bool

ShouldRetry reports whether the request should be retried. attempt is the zero-based number of retries already performed. err is the fetch error (may be nil). resp is the response (may be nil on network errors).

type SQLitePool

type SQLitePool struct {
	// contains filtered or unexported fields
}

func NewSQLitePool

func NewSQLitePool(dbPath string) (*SQLitePool, error)

func (*SQLitePool) Add

func (p *SQLitePool) Add(_ context.Context, url string) error

func (*SQLitePool) AddBatch

func (p *SQLitePool) AddBatch(_ context.Context, urls []string) error

func (*SQLitePool) Close

func (p *SQLitePool) Close() error

func (*SQLitePool) Drain

func (p *SQLitePool) Drain(_ context.Context) ([]string, error)

func (*SQLitePool) Len

func (p *SQLitePool) Len() int

type Scheduler

type Scheduler struct {
	// contains filtered or unexported fields
}

Scheduler manages a fixed-size pool of worker goroutines that consume jobs from a Queue and invoke a user-supplied handler for each one. It is designed for use as an internal component of Hunt but can also be used standalone.

func NewScheduler

func NewScheduler(queue foxhound.Queue, maxWorkers int) *Scheduler

NewScheduler creates a Scheduler backed by queue with a pool of at most maxWorkers concurrent workers.

func (*Scheduler) Start

func (s *Scheduler) Start(ctx context.Context, handler func(context.Context, *foxhound.Job) error) error

Start launches maxWorkers goroutines that each loop over queue.Pop → handler. It blocks until Stop is called or ctx is cancelled, and returns only after all in-flight handlers have returned.

Calling Start again after it returns is not supported.

func (*Scheduler) Stop

func (s *Scheduler) Stop()

Stop signals all workers to stop after finishing their current job. It is safe to call more than once.

func (*Scheduler) Submit

func (s *Scheduler) Submit(ctx context.Context, jobs ...*foxhound.Job) error

Submit pushes one or more jobs directly to the underlying queue. It is safe to call from any goroutine.

func (*Scheduler) Wait

func (s *Scheduler) Wait()

Wait blocks until Start has returned and all worker goroutines have exited. It is safe to call concurrently with Stop.

type SessionConfig

type SessionConfig struct {
	// Name is the unique identifier for this session within a Hunt.
	// Must match Job.SessionID for routing to work.
	Name string
	// Fetcher is the underlying fetcher for the session. Required.
	Fetcher foxhound.Fetcher
	// Identity is the optional identity profile attached to the session.
	// Stored as `any` to avoid an import cycle with the identity package.
	Identity any
	// Proxy is the optional proxy URL recorded with the session for
	// inspection. Wire it through the fetcher's own option at construction.
	Proxy string
}

SessionConfig describes a named session bundle that a Hunt can route jobs to via Job.SessionID. Each session has its own fetcher, identity, and proxy URL — useful when one campaign needs to mix fast-static fetches for index pages with slow stealth fetches for detail pages, with separate cookie jars per role.

type Stats

type Stats struct {
	StartedAt      time.Time
	RequestCount   atomic.Int64
	SuccessCount   atomic.Int64
	ErrorCount     atomic.Int64
	BlockedCount   atomic.Int64
	ItemCount      atomic.Int64
	EscalatedCount atomic.Int64
	BytesReceived  atomic.Int64
	// contains filtered or unexported fields
}

Stats holds runtime metrics for a Hunt. All top-level counters use atomic.Int64 so callers can read them without holding any lock. Per-domain stats use sync.Map for lock-free read path.

func NewStats

func NewStats() *Stats

NewStats creates a Stats instance ready for use.

func (*Stats) DomainStatsFor

func (s *Stats) DomainStatsFor(domain string) *DomainStats

DomainStatsFor returns the DomainStats for the given domain, or nil if no requests have been recorded for it.

func (*Stats) RecordBlock

func (s *Stats) RecordBlock(domain string)

RecordBlock increments the request and blocked counters without double-counting the request as a success. Used when a block is detected outside the normal fetch path (e.g. CAPTCHA detection in the walker).

func (*Stats) RecordBytes

func (s *Stats) RecordBytes(n int64)

RecordBytes adds n to the total bytes-received counter.

func (*Stats) RecordEscalation

func (s *Stats) RecordEscalation()

RecordEscalation increments the count of requests that were escalated from the static fetcher to the browser fetcher.

func (*Stats) RecordItems

func (s *Stats) RecordItems(count int)

RecordItems increments the scraped item counter by count.

func (*Stats) RecordProcessDuration

func (s *Stats) RecordProcessDuration(domain string, duration time.Duration)

RecordProcessDuration records the end-to-end processing time for a job (fetch + process + pipeline + write) for the given domain.

func (*Stats) RecordRequest

func (s *Stats) RecordRequest(domain string, duration time.Duration, err error, blocked bool)

RecordRequest records a completed fetch attempt for the given domain. Pass err=nil and blocked=false for a clean success.

func (*Stats) Summary

func (s *Stats) Summary() string

Summary returns a human-readable snapshot of current statistics.

func (*Stats) ToMap

func (s *Stats) ToMap() map[string]any

ToMap returns a structured snapshot of current statistics suitable for JSON serialisation or structured logging.

type Step

type Step struct {
	// Action is the kind of step.
	Action StepAction
	// URL is the target for StepNavigate.
	URL string
	// Selector is the CSS selector for StepClick, StepWait, and StepExtract.
	// For InfiniteScroll, Selector is the scrollable container (empty = whole page).
	Selector string
	// Duration is the timeout/wait for StepWait.
	Duration time.Duration
	// Process is the extraction logic for StepExtract.
	Process foxhound.Processor
	// MaxScrolls limits InfiniteScroll iterations.
	MaxScrolls int
	// MaxClicks limits LoadMore button clicks.
	MaxClicks int
	// MaxPages limits Paginate page follows.
	MaxPages int
	// Script is the JavaScript code for StepEvaluate.
	Script string
	// StopSelector is a CSS selector; InfiniteScroll stops when
	// document.querySelectorAll(StopSelector).length >= StopCount.
	StopSelector string
	// StopCount is the target element count for StopSelector.
	StopCount int
	// ScrollWait is the duration to wait after each scroll iteration before
	// checking for new content. Defaults to 2s when zero.
	ScrollWait time.Duration
	// Optional marks this step as non-fatal: if it fails, execution continues.
	Optional bool
	// Value is the text to type into an input field for StepFill.
	Value string
}

Step is a single action within a Trail.

type StepAction

type StepAction int

StepAction identifies what a Trail Step should do.

const (
	// StepNavigate navigates to a URL (creates a Job).
	StepNavigate StepAction = iota
	// StepClick clicks a CSS selector (browser-mode only).
	StepClick
	// StepWait waits for a CSS selector to appear or a fixed duration.
	StepWait
	// StepExtract runs a Processor against the current page.
	StepExtract
	// StepScroll scrolls the page (browser-mode only).
	StepScroll
	// StepInfiniteScroll scrolls to bottom repeatedly until no new content
	// loads (for lazy-load / infinite scroll pages like Google Maps).
	StepInfiniteScroll
	// StepLoadMore clicks a "load more" button repeatedly until it
	// disappears or max clicks reached.
	StepLoadMore
	// StepPaginate detects pagination links ("Next", page numbers) and
	// follows them, collecting content from each page.
	StepPaginate
	// StepEvaluate executes custom JavaScript on the page.
	StepEvaluate
	// StepFill types text into an input field with human-like keystrokes.
	StepFill
	// StepCollect extracts URLs from matching elements into a Pool.
	StepCollect
)

type StreamEvent

type StreamEvent struct {
	// Item is non-nil for item events.
	Item *foxhound.Item
	// Stats is non-nil for periodic stats snapshot events.
	Stats *Stats
}

StreamEvent is emitted on the channel returned by StreamWithStats. Exactly one of Item or Stats is non-nil per event.

type Trail

type Trail struct {
	// Name is a human-readable label for this navigation path.
	Name string
	// Steps is the ordered sequence of actions.
	Steps []Step
	// contains filtered or unexported fields
}

Trail is a reusable navigation blueprint composed of ordered Steps. It is built via a fluent builder API and converted to Jobs when submitted to a Hunt.

func Login

func Login(name, loginURL, userSelector, passSelector, submitSelector, username, password string) *Trail

Login builds a login trail that navigates to the login page, fills credentials, and submits the form. The returned trail can be further chained with additional steps (e.g. WaitOptional for a post-login element).

func NewTrail

func NewTrail(name string) *Trail

NewTrail creates a new empty Trail with the given name.

func (*Trail) Adaptive

func (t *Trail) Adaptive(name, selector string) *Trail

Adaptive registers an adaptive selector that survives DOM rewrites by falling back to similarity matching when the primary CSS selector fails to match on a future run. The element signature is learned automatically on the first successful extraction and persisted via the Hunt's adaptive store (configured by Hunt.WithAdaptive).

Adaptive only records the registration intent on the Trail; the actual Register and signature learning happens when the produced Job is fetched by a walker, so the Hunt must have been configured with WithAdaptive before Run is called.

func (*Trail) CaptureXHR

func (t *Trail) CaptureXHR(urlPattern string) *Trail

CaptureXHR registers a URL regexp pattern. While any job produced by this Trail is fetched, the browser fetcher captures every XHR or fetch response whose URL matches the pattern, storing the request URL, status, headers, and body in Response.CapturedXHR. Use this to discover the JSON API behind a JavaScript-rendered page without parsing the DOM.

Multiple calls accumulate; all patterns are matched (logical OR). Patterns must be valid Go regexps; invalid patterns are silently dropped at fetch time.

func (*Trail) Click

func (t *Trail) Click(selector string) *Trail

Click appends a StepClick step that clicks the element matching selector. This step is only meaningful when using the browser fetcher.

func (*Trail) ClickOptional

func (t *Trail) ClickOptional(selector string) *Trail

ClickOptional appends a StepClick step that does NOT abort the fetch on failure. Useful for dismissing elements that may or may not be present.

func (*Trail) Collect

func (t *Trail) Collect(selector, attr string) *Trail

Collect appends a step that extracts URLs from all elements matching selector, reading the given attribute (typically "href"). The collected URLs are stored in Response.StepResults as []string.

This step is implemented as a JS Evaluate that runs querySelectorAll(selector) and returns the attribute values.

func (*Trail) Evaluate

func (t *Trail) Evaluate(script string) *Trail

Evaluate appends a step that executes custom JavaScript on the page. The return value of the script is available in Response.StepResults.

func (*Trail) Extract

func (t *Trail) Extract(processor foxhound.Processor) *Trail

Extract appends a StepExtract step that runs processor against the current page response.

func (*Trail) Fill

func (t *Trail) Fill(selector, value string) *Trail

Fill appends a StepFill step that types value into the input matching selector with human-like keystrokes (using behavior.Keyboard).

func (*Trail) InfiniteScroll

func (t *Trail) InfiniteScroll(maxScrolls int) *Trail

InfiniteScroll appends a step that scrolls to the bottom repeatedly until no new content loads (for lazy-load / infinite scroll pages). maxScrolls limits iterations (0 = default 50). Scrolls the whole page.

func (*Trail) InfiniteScrollIn

func (t *Trail) InfiniteScrollIn(container string, maxScrolls int) *Trail

InfiniteScrollIn appends an InfiniteScroll step that scrolls inside a specific container element (e.g. Google Maps results panel, Facebook feed). container is a CSS selector for the scrollable element.

func (*Trail) InfiniteScrollInUntil

func (t *Trail) InfiniteScrollInUntil(container, stopSelector string, stopCount, maxScrolls int) *Trail

InfiniteScrollInUntil combines container scrolling with a stop condition.

func (*Trail) InfiniteScrollUntil

func (t *Trail) InfiniteScrollUntil(stopSelector string, stopCount int, maxScrolls int) *Trail

InfiniteScrollUntil appends an InfiniteScroll step that stops when stopSelector matches at least stopCount elements. This scrolls until the target is reached rather than until content stops loading.

func (*Trail) InfiniteScrollWithWait

func (t *Trail) InfiniteScrollWithWait(maxScrolls int, scrollWait time.Duration) *Trail

InfiniteScrollWithWait appends an InfiniteScroll with custom post-scroll wait.

func (*Trail) LoadMore

func (t *Trail) LoadMore(selector string, maxClicks int) *Trail

LoadMore appends a step that clicks the element matching selector repeatedly until it disappears or maxClicks is reached (0 = default 20).

func (*Trail) Navigate

func (t *Trail) Navigate(url string) *Trail

Navigate appends a StepNavigate step that fetches url.

func (*Trail) NoWarmup

func (t *Trail) NoWarmup() *Trail

NoWarmup disables the automatic homepage warm-up visit that ToJobs() prepends by default. Use this when speed is more important than stealth, or when the trail already starts at the homepage.

func (*Trail) Paginate

func (t *Trail) Paginate(selector string, maxPages int) *Trail

Paginate appends a step that detects pagination links matching selector (e.g. "a.next", "li.next a") and follows them, collecting content from each page. maxPages limits how many pages to follow (0 = default 10).

func (*Trail) Scroll

func (t *Trail) Scroll() *Trail

Scroll appends a StepScroll step that scrolls the page. This step is only meaningful when using the browser fetcher.

func (*Trail) ToJobs

func (t *Trail) ToJobs() []*foxhound.Job

ToJobs converts the Trail into foxhound.Jobs. Each StepNavigate starts a new Job; subsequent browser steps (Click, Wait, Scroll) are attached as JobSteps on that Job and set FetchMode to FetchBrowser.

Extract steps are NOT converted to JobSteps because their Processor (an interface) cannot survive JSON serialization through queue backends. Extraction is handled by the hunt-level Processor after the fetch completes.

Steps that appear before the first Navigate are silently skipped.

By default, when the trail has browser steps and the first Navigate URL is not the site homepage, ToJobs prepends a warm-up Job that visits the homepage first to seed cookies and build a natural referrer chain. Call NoWarmup() to disable this behaviour.

func (*Trail) Wait

func (t *Trail) Wait(selector string, timeout time.Duration) *Trail

Wait appends a StepWait step that blocks until selector appears or timeout elapses.

func (*Trail) WaitOptional

func (t *Trail) WaitOptional(selector string, timeout time.Duration) *Trail

WaitOptional appends a StepWait step that does NOT abort the fetch on failure. Useful for waiting on elements that may not appear on every page.

type Walker

type Walker struct {
	// contains filtered or unexported fields
}

Walker is a virtual user that pops jobs from the queue, fetches them, processes the responses, runs items through the pipeline chain, and writes results — looping until the context is cancelled or the queue is drained.

func (*Walker) Run

func (w *Walker) Run(ctx context.Context) error

Run is the main loop. It exits when ctx is cancelled. The caller's WaitGroup must be decremented after Run returns.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL