Documentation
¶
Overview ¶
Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.
Index ¶
- type Checkpoint
- type CheckpointConfig
- type DomainStats
- type Hunt
- func (h *Hunt) AdaptiveExtractor() *parse.AdaptiveExtractor
- func (h *Hunt) AddSession(name string, cfg SessionConfig) *Hunt
- func (h *Hunt) Pause()
- func (h *Hunt) Resume()
- func (h *Hunt) Run(ctx context.Context) error
- func (h *Hunt) SaveCheckpoint(path string) error
- func (h *Hunt) Session(name string) *foxhound.Session
- func (h *Hunt) SetLogger(logger *slog.Logger)
- func (h *Hunt) State() HuntState
- func (h *Hunt) Stats() *Stats
- func (h *Hunt) Stop()
- func (h *Hunt) Stream(ctx context.Context) (<-chan *foxhound.Item, error)
- func (h *Hunt) StreamWithStats(ctx context.Context, statsInterval time.Duration) (<-chan StreamEvent, error)
- func (h *Hunt) WithAdaptive(savePath string) *Hunt
- func (h *Hunt) WithBlockedDomains(domains ...string) *Hunt
- func (h *Hunt) WithDevelopmentMode(cacheDir string) *Hunt
- func (h *Hunt) WithDisableResources(types ...string) *Hunt
- type HuntConfig
- type HuntMetrics
- type HuntResult
- type HuntState
- type ItemList
- func (il *ItemList) Append(item *foxhound.Item)
- func (il *ItemList) Clear()
- func (il *ItemList) Items() []*foxhound.Item
- func (il *ItemList) Len() int
- func (il *ItemList) ToCSV(path string, columns []string) error
- func (il *ItemList) ToJSON(path string, indent bool) error
- func (il *ItemList) ToJSONL(path string) error
- type MemoryPool
- type Pool
- type PostgresPool
- type RetryPolicy
- type SQLitePool
- type Scheduler
- type SessionConfig
- type Stats
- func (s *Stats) DomainStatsFor(domain string) *DomainStats
- func (s *Stats) RecordBlock(domain string)
- func (s *Stats) RecordBytes(n int64)
- func (s *Stats) RecordEscalation()
- func (s *Stats) RecordItems(count int)
- func (s *Stats) RecordProcessDuration(domain string, duration time.Duration)
- func (s *Stats) RecordRequest(domain string, duration time.Duration, err error, blocked bool)
- func (s *Stats) Summary() string
- func (s *Stats) ToMap() map[string]any
- type Step
- type StepAction
- type StreamEvent
- type Trail
- func (t *Trail) Adaptive(name, selector string) *Trail
- func (t *Trail) CaptureXHR(urlPattern string) *Trail
- func (t *Trail) Click(selector string) *Trail
- func (t *Trail) ClickOptional(selector string) *Trail
- func (t *Trail) Collect(selector, attr string) *Trail
- func (t *Trail) Evaluate(script string) *Trail
- func (t *Trail) Extract(processor foxhound.Processor) *Trail
- func (t *Trail) Fill(selector, value string) *Trail
- func (t *Trail) InfiniteScroll(maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollIn(container string, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollInUntil(container, stopSelector string, stopCount, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollUntil(stopSelector string, stopCount int, maxScrolls int) *Trail
- func (t *Trail) InfiniteScrollWithWait(maxScrolls int, scrollWait time.Duration) *Trail
- func (t *Trail) LoadMore(selector string, maxClicks int) *Trail
- func (t *Trail) Navigate(url string) *Trail
- func (t *Trail) NoWarmup() *Trail
- func (t *Trail) Paginate(selector string, maxPages int) *Trail
- func (t *Trail) Scroll() *Trail
- func (t *Trail) ToJobs() []*foxhound.Job
- func (t *Trail) Wait(selector string, timeout time.Duration) *Trail
- func (t *Trail) WaitOptional(selector string, timeout time.Duration) *Trail
- type Walker
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Checkpoint ¶
type Checkpoint struct {
HuntID string `json:"hunt_id"`
HuntName string `json:"hunt_name"`
Domain string `json:"domain"`
ItemsProcessed int64 `json:"items_processed"`
RequestsDone int64 `json:"requests_done"`
ErrorCount int64 `json:"errors"`
LastURL string `json:"last_url"`
Timestamp time.Time `json:"timestamp"`
QueueLen int `json:"queue_len"`
ElapsedMs int64 `json:"elapsed_ms"`
}
Checkpoint captures the observable state of a Hunt at a point in time. All fields are exported so they round-trip cleanly through JSON.
func LoadCheckpoint ¶
func LoadCheckpoint(path string) (*Checkpoint, error)
LoadCheckpoint reads a Checkpoint from the JSON file at path.
type CheckpointConfig ¶
type CheckpointConfig struct {
// Enabled turns auto-checkpointing on. When false, no file is written.
Enabled bool
// Path is the file path where the checkpoint JSON is written.
Path string
// Interval is how many items must be processed between saves.
// A value of 0 is treated as 100 (the default) to avoid division by zero.
Interval int
}
CheckpointConfig controls automatic checkpoint saving.
type DomainStats ¶
type DomainStats struct {
Requests atomic.Int64
Errors atomic.Int64
Blocked atomic.Int64
// contains filtered or unexported fields
}
DomainStats holds per-domain request counters and latency tracking. All fields use atomic operations so callers can read/write without locks.
func (*DomainStats) AvgLatency ¶
func (ds *DomainStats) AvgLatency() time.Duration
AvgLatency returns the average fetch latency for this domain.
func (*DomainStats) AvgProcessLatency ¶
func (ds *DomainStats) AvgProcessLatency() time.Duration
AvgProcessLatency returns the average end-to-end processing latency.
type Hunt ¶
type Hunt struct {
// contains filtered or unexported fields
}
Hunt is the top-level campaign coordinator. It owns the lifecycle of all walkers, applies middleware to the fetcher, seeds the queue, drains results, and emits stats.
Typical usage:
h := engine.NewHunt(cfg)
if err := h.Run(ctx); err != nil {
log.Fatal(err)
}
func NewHunt ¶
func NewHunt(cfg HuntConfig) *Hunt
NewHunt creates a Hunt from cfg. It does not start any goroutines; call Run to begin processing.
func (*Hunt) AdaptiveExtractor ¶
func (h *Hunt) AdaptiveExtractor() *parse.AdaptiveExtractor
AdaptiveExtractor returns the Hunt-scoped adaptive extractor configured via WithAdaptive, or nil when adaptive mode has not been enabled.
func (*Hunt) AddSession ¶
func (h *Hunt) AddSession(name string, cfg SessionConfig) *Hunt
AddSession registers a named session bundle. When a Job's SessionID equals name, the walker uses cfg.Fetcher instead of the hunt's default fetcher. Subsequent calls with the same name overwrite the previous registration.
AddSession is safe to call before Run; calling it after Run starts is also supported but the in-flight job currently being processed by a walker will not pick up the new registration until its next Pop.
func (*Hunt) Pause ¶
func (h *Hunt) Pause()
Pause signals all walkers to suspend work. It transitions the state to HuntPaused. Resuming is done via Resume.
func (*Hunt) Resume ¶
func (h *Hunt) Resume()
Resume transitions the hunt back to HuntRunning after a Pause.
func (*Hunt) Run ¶
Run executes the campaign and blocks until all jobs are processed or ctx is cancelled. It returns nil on clean completion and a non-nil error on failure.
Run lifecycle:
- Build the middleware-wrapped fetcher.
- Push seed jobs to the queue.
- Launch N walker goroutines.
- Wait until the queue is empty AND all walkers are idle.
- Flush all writers.
- Transition state to HuntDone.
func (*Hunt) SaveCheckpoint ¶
SaveCheckpoint writes the current hunt state to a JSON file at path. The file is written atomically (temp file + rename) so a partial write cannot corrupt a previously valid checkpoint.
func (*Hunt) Session ¶
Session returns the named session previously registered via AddSession, or nil when no session with that name exists.
func (*Hunt) SetLogger ¶
SetLogger replaces the Hunt's logger. Walkers created after this call will inherit the new logger. Intended for testing.
func (*Hunt) Stop ¶
func (h *Hunt) Stop()
Stop cancels the hunt context, causing all walkers to exit after finishing their current job. Run will return shortly after Stop is called.
func (*Hunt) Stream ¶
Stream starts the hunt in a background goroutine and returns a channel that receives each item as it is produced. The channel is closed when the hunt completes or the context is cancelled, making it safe to use in a range loop:
for item := range hunt.Stream(ctx) { ... }
Stream returns an error only when the hunt configuration is invalid. The channel is buffered (100 items) so that slow consumers do not block walkers; items are dropped with a warning log when the buffer is full.
func (*Hunt) StreamWithStats ¶
func (h *Hunt) StreamWithStats(ctx context.Context, statsInterval time.Duration) (<-chan StreamEvent, error)
StreamWithStats starts the hunt and returns a channel of StreamEvent values. Item events arrive as items are scraped; Stats events are emitted every statsInterval. The channel is closed when the hunt finishes.
A statsInterval of 0 disables periodic stats events (only item events are sent). Use Stream instead when stats events are not needed.
func (*Hunt) WithAdaptive ¶
WithAdaptive enables adaptive selector mode for the Hunt. The savePath argument is the file where learned element signatures are persisted as JSON across runs; pass an empty string for in-memory only (signatures are lost between runs).
Once enabled, the walker attaches a shared *parse.AdaptiveExtractor to every Response, so user code can call resp.Adaptive("name"), resp.CSSAdaptive(selector, name), or resp.CSSAdaptiveAll(selector, name) without manually constructing an extractor.
WithAdaptive returns the Hunt for fluent chaining and is safe to call before Run.
func (*Hunt) WithBlockedDomains ¶
WithBlockedDomains registers fully-qualified domain names whose requests must be aborted by the browser fetcher. Subdomains are also matched, so "example.com" blocks "tracker.example.com". Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher created and managed by the application; otherwise the call is silently ignored at Run time with a warning log.
func (*Hunt) WithDevelopmentMode ¶
WithDevelopmentMode enables on-disk response replay. The first time a job for a given URL is fetched the real fetcher is invoked and the response is serialised to dir/<sha256(url)>.json. Subsequent fetches for the same URL hit the cache and skip the network entirely, which is the standard fast inner-loop pattern when iterating on a parser.
Pass an empty dir to disable. Errors creating the cache directory are returned at Run time, not here.
func (*Hunt) WithDisableResources ¶
WithDisableResources registers browser resource types to abort. Valid values: "image", "font", "media", "stylesheet", "object", "imageset", "texttrack", "websocket", "csp_report", "beacon". Unknown values are dropped at apply time. Only effective when the Hunt's Fetcher is a *fetch.CamoufoxFetcher; otherwise the call is silently ignored at Run time with a warning log.
type HuntConfig ¶
type HuntConfig struct {
// Name is a human-readable label used in logs and metrics.
Name string
// Domain is the primary target domain (used for metrics grouping).
Domain string
// Walkers is the number of concurrent virtual-user goroutines.
Walkers int
// MaxConcurrency is the global cap on simultaneous in-flight requests.
// When 0, defaults to Walkers count.
MaxConcurrency int
// Seeds are the initial jobs pushed to the queue before walkers start.
Seeds []*foxhound.Job
// Processor is the user-supplied response handler.
Processor foxhound.Processor
// Fetcher is the base fetcher before middleware wrapping.
Fetcher foxhound.Fetcher
// Queue is the job storage backend.
Queue foxhound.Queue
// Pipelines are applied to each extracted Item in order.
Pipelines []foxhound.Pipeline
// Writers receive items that survive the pipeline chain.
Writers []foxhound.Writer
// Middlewares are wrapped around the Fetcher (first middleware is outermost).
Middlewares []foxhound.Middleware
// BehaviorProfile selects the human-simulation preset applied by each
// Walker: "careful", "moderate", or "aggressive". Defaults to "moderate"
// when empty so walkers always apply timing and rhythm delays.
BehaviorProfile string
// Checkpoint controls automatic state saving. Optional — checkpointing is
// inactive when Checkpoint.Enabled is false (the zero value).
Checkpoint CheckpointConfig
// ItemCallback is invoked for every item that survives the pipeline chain,
// before it is written. This enables streaming item processing during the
// crawl without needing to use Stream(). The callback runs synchronously
// in the walker goroutine so it must be fast.
ItemCallback func(ctx context.Context, item *foxhound.Item)
// OnStart is called once when the hunt begins (after seeds are queued).
OnStart func(ctx context.Context)
// OnClose is called once when the hunt completes (after writers flush).
OnClose func(ctx context.Context, stats *Stats)
// OnError is called when a fetch or process error occurs. Errors are
// still logged; this hook enables custom error handling.
OnError func(ctx context.Context, job *foxhound.Job, err error)
// OnItem is called for each item after pipeline processing. Unlike
// ItemCallback, OnItem receives the originating Job for context.
OnItem func(ctx context.Context, job *foxhound.Job, item *foxhound.Item)
// PageActions are JavaScript snippets executed after page load when using
// the browser fetcher. They are injected as JobSteps of type
// JobStepEvaluate on every job that uses browser mode.
PageActions []string
// Pool is an optional URL pool from a collect phase. When set, all URLs
// in the pool are drained and added as seed jobs before walkers start.
// This enables the two-phase pattern: collect URLs first, process concurrently.
Pool Pool
// PoolFetchMode sets the FetchMode for jobs created from pool URLs.
// Defaults to FetchBrowser when PoolFetchModeSet is false.
PoolFetchMode foxhound.FetchMode
// PoolFetchModeSet indicates the user explicitly set PoolFetchMode.
// When false and pool URLs exist, the mode defaults to FetchBrowser.
PoolFetchModeSet bool
}
HuntConfig holds all dependencies and settings for a single scraping campaign.
type HuntMetrics ¶
type HuntMetrics struct {
RequestsCount int64
FailedRequestsCount int64
OffsiteRequests int64
BlockedRequests int64
ItemsScraped int64
ItemsDropped int64
ResponseBytes int64
StartTime time.Time
EndTime time.Time
RequestDelay time.Duration
ParallelRequests int
StatusCounts map[int]int64
DomainBytes map[string]int64
LogCounts map[string]int64
// contains filtered or unexported fields
}
HuntMetrics holds extended statistics for a hunt beyond what Stats tracks. It adds offsite/blocked counters, status code breakdown, and per-domain byte tracking.
func NewHuntMetrics ¶
func NewHuntMetrics() *HuntMetrics
NewHuntMetrics creates a HuntMetrics initialised with the current time.
func (*HuntMetrics) ElapsedSeconds ¶
func (hm *HuntMetrics) ElapsedSeconds() float64
ElapsedSeconds returns the duration of the hunt in seconds.
func (*HuntMetrics) IncrementResponseBytes ¶
func (hm *HuntMetrics) IncrementResponseBytes(domain string, count int64)
IncrementResponseBytes adds byte count for a domain.
func (*HuntMetrics) IncrementStatus ¶
func (hm *HuntMetrics) IncrementStatus(status int)
IncrementStatus records an HTTP status code occurrence.
func (*HuntMetrics) RequestsPerSecond ¶
func (hm *HuntMetrics) RequestsPerSecond() float64
RequestsPerSecond returns the average requests per second.
func (*HuntMetrics) ToMap ¶
func (hm *HuntMetrics) ToMap() map[string]any
ToMap returns the metrics as a map for structured logging or JSON export.
type HuntResult ¶
type HuntResult struct {
// Metrics holds the hunt metrics.
Metrics *HuntMetrics
// Items holds all scraped items.
Items *ItemList
// Paused is true if the hunt was paused (not completed).
Paused bool
}
HuntResult is the complete result from a hunt execution.
func (*HuntResult) Completed ¶
func (hr *HuntResult) Completed() bool
Completed returns true if the hunt finished normally (was not paused).
type HuntState ¶
type HuntState int
HuntState represents the lifecycle state of a Hunt.
const ( // HuntIdle means the hunt has not started yet. HuntIdle HuntState = iota // HuntRunning means walkers are actively processing jobs. HuntRunning // HuntPaused means the hunt is temporarily suspended. HuntPaused // HuntDone means all jobs have been processed successfully. HuntDone // HuntFailed means the hunt terminated with an unrecoverable error. HuntFailed )
type ItemList ¶
type ItemList struct {
// contains filtered or unexported fields
}
ItemList is a thread-safe collection of scraped items with batch export methods (JSON, JSONL, CSV). Use it with Hunt.ItemCallback or Hunt.StreamWithStats to accumulate items during a hunt.
type MemoryPool ¶
type MemoryPool struct {
// contains filtered or unexported fields
}
MemoryPool is an in-memory Pool backed by a slice + dedup set.
func NewMemoryPool ¶
func NewMemoryPool() *MemoryPool
NewMemoryPool creates an empty in-memory pool.
func (*MemoryPool) Close ¶
func (p *MemoryPool) Close() error
func (*MemoryPool) Len ¶
func (p *MemoryPool) Len() int
type Pool ¶
type Pool interface {
// Add stores a URL. Duplicates are silently ignored.
Add(ctx context.Context, url string) error
// AddBatch stores multiple URLs. Duplicates are silently ignored.
AddBatch(ctx context.Context, urls []string) error
// Drain returns all stored URLs and empties the pool.
Drain(ctx context.Context) ([]string, error)
// Len returns the number of URLs in the pool.
Len() int
// Close releases resources.
Close() error
}
Pool stores discovered URLs between collect and process phases. Implementations must be safe for concurrent use.
type PostgresPool ¶
type PostgresPool struct {
// contains filtered or unexported fields
}
func NewPostgresPool ¶
func NewPostgresPool(dsn, table string) (*PostgresPool, error)
func (*PostgresPool) AddBatch ¶
func (p *PostgresPool) AddBatch(_ context.Context, urls []string) error
func (*PostgresPool) Close ¶
func (p *PostgresPool) Close() error
func (*PostgresPool) Len ¶
func (p *PostgresPool) Len() int
type RetryPolicy ¶
type RetryPolicy struct {
// MaxRetries is the maximum number of retry attempts (not counting the
// original attempt). A value of 3 means up to 4 total attempts.
MaxRetries int
// BaseDelay is the initial delay before the first retry.
BaseDelay time.Duration
// MaxDelay caps the computed delay regardless of the attempt number.
MaxDelay time.Duration
// Backoff is the exponential multiplier applied to each successive delay.
// A value of 2.0 doubles the delay each attempt.
Backoff float64
}
RetryPolicy controls when and how often failed requests are retried.
func DefaultRetryPolicy ¶
func DefaultRetryPolicy() *RetryPolicy
DefaultRetryPolicy returns a sensible retry policy suitable for most scraping workloads: 3 retries, starting at 1 second, doubling up to 30 seconds.
func (*RetryPolicy) Delay ¶
func (rp *RetryPolicy) Delay(attempt int) time.Duration
Delay returns how long to wait before the given retry attempt. It uses exponential backoff with full-jitter so that concurrent walkers do not stampede the same target simultaneously.
func (*RetryPolicy) ShouldRetry ¶
ShouldRetry reports whether the request should be retried. attempt is the zero-based number of retries already performed. err is the fetch error (may be nil). resp is the response (may be nil on network errors).
type SQLitePool ¶
type SQLitePool struct {
// contains filtered or unexported fields
}
func NewSQLitePool ¶
func NewSQLitePool(dbPath string) (*SQLitePool, error)
func (*SQLitePool) Close ¶
func (p *SQLitePool) Close() error
func (*SQLitePool) Len ¶
func (p *SQLitePool) Len() int
type Scheduler ¶
type Scheduler struct {
// contains filtered or unexported fields
}
Scheduler manages a fixed-size pool of worker goroutines that consume jobs from a Queue and invoke a user-supplied handler for each one. It is designed for use as an internal component of Hunt but can also be used standalone.
func NewScheduler ¶
NewScheduler creates a Scheduler backed by queue with a pool of at most maxWorkers concurrent workers.
func (*Scheduler) Start ¶
func (s *Scheduler) Start(ctx context.Context, handler func(context.Context, *foxhound.Job) error) error
Start launches maxWorkers goroutines that each loop over queue.Pop → handler. It blocks until Stop is called or ctx is cancelled, and returns only after all in-flight handlers have returned.
Calling Start again after it returns is not supported.
func (*Scheduler) Stop ¶
func (s *Scheduler) Stop()
Stop signals all workers to stop after finishing their current job. It is safe to call more than once.
type SessionConfig ¶
type SessionConfig struct {
// Name is the unique identifier for this session within a Hunt.
// Must match Job.SessionID for routing to work.
Name string
// Fetcher is the underlying fetcher for the session. Required.
Fetcher foxhound.Fetcher
// Identity is the optional identity profile attached to the session.
// Stored as `any` to avoid an import cycle with the identity package.
Identity any
// Proxy is the optional proxy URL recorded with the session for
// inspection. Wire it through the fetcher's own option at construction.
Proxy string
}
SessionConfig describes a named session bundle that a Hunt can route jobs to via Job.SessionID. Each session has its own fetcher, identity, and proxy URL — useful when one campaign needs to mix fast-static fetches for index pages with slow stealth fetches for detail pages, with separate cookie jars per role.
type Stats ¶
type Stats struct {
StartedAt time.Time
RequestCount atomic.Int64
SuccessCount atomic.Int64
ErrorCount atomic.Int64
BlockedCount atomic.Int64
ItemCount atomic.Int64
EscalatedCount atomic.Int64
BytesReceived atomic.Int64
// contains filtered or unexported fields
}
Stats holds runtime metrics for a Hunt. All top-level counters use atomic.Int64 so callers can read them without holding any lock. Per-domain stats use sync.Map for lock-free read path.
func (*Stats) DomainStatsFor ¶
func (s *Stats) DomainStatsFor(domain string) *DomainStats
DomainStatsFor returns the DomainStats for the given domain, or nil if no requests have been recorded for it.
func (*Stats) RecordBlock ¶
RecordBlock increments the request and blocked counters without double-counting the request as a success. Used when a block is detected outside the normal fetch path (e.g. CAPTCHA detection in the walker).
func (*Stats) RecordBytes ¶
RecordBytes adds n to the total bytes-received counter.
func (*Stats) RecordEscalation ¶
func (s *Stats) RecordEscalation()
RecordEscalation increments the count of requests that were escalated from the static fetcher to the browser fetcher.
func (*Stats) RecordItems ¶
RecordItems increments the scraped item counter by count.
func (*Stats) RecordProcessDuration ¶
RecordProcessDuration records the end-to-end processing time for a job (fetch + process + pipeline + write) for the given domain.
func (*Stats) RecordRequest ¶
RecordRequest records a completed fetch attempt for the given domain. Pass err=nil and blocked=false for a clean success.
type Step ¶
type Step struct {
// Action is the kind of step.
Action StepAction
// URL is the target for StepNavigate.
URL string
// Selector is the CSS selector for StepClick, StepWait, and StepExtract.
// For InfiniteScroll, Selector is the scrollable container (empty = whole page).
Selector string
// Duration is the timeout/wait for StepWait.
Duration time.Duration
// Process is the extraction logic for StepExtract.
Process foxhound.Processor
// MaxScrolls limits InfiniteScroll iterations.
MaxScrolls int
// MaxClicks limits LoadMore button clicks.
MaxClicks int
// MaxPages limits Paginate page follows.
MaxPages int
// Script is the JavaScript code for StepEvaluate.
Script string
// StopSelector is a CSS selector; InfiniteScroll stops when
// document.querySelectorAll(StopSelector).length >= StopCount.
StopSelector string
// StopCount is the target element count for StopSelector.
StopCount int
// ScrollWait is the duration to wait after each scroll iteration before
// checking for new content. Defaults to 2s when zero.
ScrollWait time.Duration
// Optional marks this step as non-fatal: if it fails, execution continues.
Optional bool
// Value is the text to type into an input field for StepFill.
Value string
}
Step is a single action within a Trail.
type StepAction ¶
type StepAction int
StepAction identifies what a Trail Step should do.
const ( StepNavigate StepAction = iota // StepClick clicks a CSS selector (browser-mode only). StepClick // StepWait waits for a CSS selector to appear or a fixed duration. StepWait // StepExtract runs a Processor against the current page. StepExtract // StepScroll scrolls the page (browser-mode only). StepScroll // StepInfiniteScroll scrolls to bottom repeatedly until no new content // loads (for lazy-load / infinite scroll pages like Google Maps). StepInfiniteScroll // StepLoadMore clicks a "load more" button repeatedly until it // disappears or max clicks reached. StepLoadMore // StepPaginate detects pagination links ("Next", page numbers) and // follows them, collecting content from each page. StepPaginate // StepEvaluate executes custom JavaScript on the page. StepEvaluate // StepFill types text into an input field with human-like keystrokes. StepFill // StepCollect extracts URLs from matching elements into a Pool. StepCollect )
type StreamEvent ¶
type StreamEvent struct {
// Item is non-nil for item events.
Item *foxhound.Item
// Stats is non-nil for periodic stats snapshot events.
Stats *Stats
}
StreamEvent is emitted on the channel returned by StreamWithStats. Exactly one of Item or Stats is non-nil per event.
type Trail ¶
type Trail struct {
// Name is a human-readable label for this navigation path.
Name string
// Steps is the ordered sequence of actions.
Steps []Step
// contains filtered or unexported fields
}
Trail is a reusable navigation blueprint composed of ordered Steps. It is built via a fluent builder API and converted to Jobs when submitted to a Hunt.
func Login ¶
func Login(name, loginURL, userSelector, passSelector, submitSelector, username, password string) *Trail
Login builds a login trail that navigates to the login page, fills credentials, and submits the form. The returned trail can be further chained with additional steps (e.g. WaitOptional for a post-login element).
func (*Trail) Adaptive ¶
Adaptive registers an adaptive selector that survives DOM rewrites by falling back to similarity matching when the primary CSS selector fails to match on a future run. The element signature is learned automatically on the first successful extraction and persisted via the Hunt's adaptive store (configured by Hunt.WithAdaptive).
Adaptive only records the registration intent on the Trail; the actual Register and signature learning happens when the produced Job is fetched by a walker, so the Hunt must have been configured with WithAdaptive before Run is called.
func (*Trail) CaptureXHR ¶
CaptureXHR registers a URL regexp pattern. While any job produced by this Trail is fetched, the browser fetcher captures every XHR or fetch response whose URL matches the pattern, storing the request URL, status, headers, and body in Response.CapturedXHR. Use this to discover the JSON API behind a JavaScript-rendered page without parsing the DOM.
Multiple calls accumulate; all patterns are matched (logical OR). Patterns must be valid Go regexps; invalid patterns are silently dropped at fetch time.
func (*Trail) Click ¶
Click appends a StepClick step that clicks the element matching selector. This step is only meaningful when using the browser fetcher.
func (*Trail) ClickOptional ¶
ClickOptional appends a StepClick step that does NOT abort the fetch on failure. Useful for dismissing elements that may or may not be present.
func (*Trail) Collect ¶
Collect appends a step that extracts URLs from all elements matching selector, reading the given attribute (typically "href"). The collected URLs are stored in Response.StepResults as []string.
This step is implemented as a JS Evaluate that runs querySelectorAll(selector) and returns the attribute values.
func (*Trail) Evaluate ¶
Evaluate appends a step that executes custom JavaScript on the page. The return value of the script is available in Response.StepResults.
func (*Trail) Extract ¶
Extract appends a StepExtract step that runs processor against the current page response.
func (*Trail) Fill ¶
Fill appends a StepFill step that types value into the input matching selector with human-like keystrokes (using behavior.Keyboard).
func (*Trail) InfiniteScroll ¶
InfiniteScroll appends a step that scrolls to the bottom repeatedly until no new content loads (for lazy-load / infinite scroll pages). maxScrolls limits iterations (0 = default 50). Scrolls the whole page.
func (*Trail) InfiniteScrollIn ¶
InfiniteScrollIn appends an InfiniteScroll step that scrolls inside a specific container element (e.g. Google Maps results panel, Facebook feed). container is a CSS selector for the scrollable element.
func (*Trail) InfiniteScrollInUntil ¶
func (t *Trail) InfiniteScrollInUntil(container, stopSelector string, stopCount, maxScrolls int) *Trail
InfiniteScrollInUntil combines container scrolling with a stop condition.
func (*Trail) InfiniteScrollUntil ¶
InfiniteScrollUntil appends an InfiniteScroll step that stops when stopSelector matches at least stopCount elements. This scrolls until the target is reached rather than until content stops loading.
func (*Trail) InfiniteScrollWithWait ¶
InfiniteScrollWithWait appends an InfiniteScroll with custom post-scroll wait.
func (*Trail) LoadMore ¶
LoadMore appends a step that clicks the element matching selector repeatedly until it disappears or maxClicks is reached (0 = default 20).
func (*Trail) NoWarmup ¶
NoWarmup disables the automatic homepage warm-up visit that ToJobs() prepends by default. Use this when speed is more important than stealth, or when the trail already starts at the homepage.
func (*Trail) Paginate ¶
Paginate appends a step that detects pagination links matching selector (e.g. "a.next", "li.next a") and follows them, collecting content from each page. maxPages limits how many pages to follow (0 = default 10).
func (*Trail) Scroll ¶
Scroll appends a StepScroll step that scrolls the page. This step is only meaningful when using the browser fetcher.
func (*Trail) ToJobs ¶
ToJobs converts the Trail into foxhound.Jobs. Each StepNavigate starts a new Job; subsequent browser steps (Click, Wait, Scroll) are attached as JobSteps on that Job and set FetchMode to FetchBrowser.
Extract steps are NOT converted to JobSteps because their Processor (an interface) cannot survive JSON serialization through queue backends. Extraction is handled by the hunt-level Processor after the fetch completes.
Steps that appear before the first Navigate are silently skipped.
By default, when the trail has browser steps and the first Navigate URL is not the site homepage, ToJobs prepends a warm-up Job that visits the homepage first to seed cookies and build a natural referrer chain. Call NoWarmup() to disable this behaviour.