Documentation
¶
Overview ¶
Package foxhound is a high-performance Go web scraping framework with native anti-detection built on Camoufox, a Firefox fork designed to evade bot fingerprinting.
Foxhound is a scraping framework for Go — it handles the full lifecycle of web data extraction: fetching pages (with or without a real browser), navigating JavaScript-heavy sites, solving CAPTCHAs, rotating identities and proxies, extracting structured data, and exporting results. Think of it as Scrapy for Go, but with first-class browser automation and anti-detection built in from day one.
Why Foxhound ¶
Modern websites deploy increasingly sophisticated bot detection: TLS fingerprinting, JavaScript challenges (Cloudflare, DataDome, PerimeterX), canvas/WebGL fingerprint checks, and behavioral analysis. Traditional HTTP-only scrapers fail silently against these defenses. Headless Chrome is widely fingerprinted. Foxhound solves this by combining two fetching strategies behind a single API:
- A TLS-impersonating HTTP client for static pages (~5-50ms per request)
- A Camoufox browser (Firefox fork) via playwright-go for JS-heavy and protected pages (~500ms-5s per request)
The smart router starts with the fast static client and automatically escalates to the full browser when it detects blocks (403, 429, 503, CAPTCHA pages). This means you get HTTP-client speed on easy targets and browser-level evasion on hard ones, without changing your code.
Architecture Overview ¶
Foxhound is organized around five core concepts:
Hunt is the top-level campaign orchestrator. It owns the queue, spawns Walker goroutines, collects stats, and coordinates shutdown. You configure a Hunt with seed URLs, a Processor (your extraction logic), middleware, pipelines, and writers.
Trail is a fluent navigation path builder. It chains browser actions — Navigate, Click, Fill, Wait, Scroll, InfiniteScroll, Evaluate (custom JS), and CaptureXHR — into a reusable sequence that gets compiled into Jobs. Trails describe what a human would do on the page.
Walker is a goroutine that acts as a virtual user. Each Walker pops Jobs from the queue, fetches pages through the middleware chain, runs your Processor, writes extracted Items through the pipeline, and enqueues discovered follow-up Jobs. A Hunt runs multiple Walkers concurrently.
Job is the unit of work: a URL plus fetch mode, priority, browser steps, metadata, and optional session routing. Jobs flow through the queue and are consumed by Walkers.
Session is a stateful client that wraps a fetcher, cookie jar, identity profile, and proxy into a reusable unit. Use it standalone for ad-hoc scraping, or register multiple Sessions with a Hunt via Hunt.AddSession to route different Jobs through different identities and proxies.
Dual-Mode Fetching ¶
Every request flows through a middleware chain before reaching the fetcher:
Job → middleware (rate limit → dedup → autothrottle → cookies → referer → retry)
→ Smart Fetcher (static or browser) → Browser Steps → Parser → Processor
→ Result{Items, Jobs} → Pipeline (validate → clean → dedup → transform)
→ Writers (CSV/JSON/SQLite/Webhook) + Queue (new jobs)
The static fetcher (fetch.NewStealth) uses Go's HTTP client with precise header ordering and TLS fingerprints matched to the identity profile. The browser fetcher (fetch.NewCamoufox) launches a real Camoufox browser instance via the Juggler protocol (Firefox's native remote protocol, less targeted by anti-bot than CDP). The smart fetcher (fetch.NewSmart) wraps both and auto-escalates based on response signals and Bayesian domain risk scoring.
Identity System ¶
Every request uses a complete, internally consistent identity profile: user agent, TLS fingerprint, header order, OS, hardware specs, screen dimensions, locale, timezone, and geolocation all match. Randomness without consistency is the number one cause of bot detection — a Windows UA with a macOS font list, or a US locale with a Tokyo timezone, triggers instant blocks.
Foxhound ships 60 embedded device profiles. The identity package generates profiles with functional options:
id := identity.Generate(
identity.WithBrowser(identity.BrowserFirefox),
identity.WithOS(identity.OSWindows),
)
When using Camoufox, the identity is serialized to a JSON config that sets navigator properties, WebGL vendor/renderer, canvas noise, OS-specific fonts, screen dimensions, and timezone at the C++ level inside the browser — not via JavaScript injection that anti-bot scripts can detect.
Human Behavior Simulation ¶
Foxhound models human behavior using statistical distributions observed from real user sessions:
- Timing uses Weibull and Gamma distributions (right-skewed, matching human reaction times), not uniform random
- Mouse movements follow Bezier curves with natural acceleration/deceleration
- Scroll patterns simulate reading speed with variable pause durations
- Keyboard input uses per-key timing with realistic inter-keystroke intervals
- Session fatigue: warmup slowdown at start, cruise speed mid-session, gradual fatigue buildup — with per-call noise to prevent smooth-curve detection
- Per-session jitter: all behavior parameters are perturbed ±15% to prevent anti-bot ML from clustering sessions into discrete archetypes
Three built-in profiles ("careful", "moderate", "aggressive") control the overall pacing. Configure via BehaviorConfig or Hunt options.
NopeCHA CAPTCHA Solving ¶
The NopeCHA browser extension is automatically downloaded from GitHub and loaded into Camoufox on first launch. It solves reCAPTCHA, hCAPTCHA, and Cloudflare Turnstile challenges without API keys. The extension is cached at ~/.cache/foxhound/extensions/nopecha/ and updated automatically.
The design philosophy: the goal is to never trigger a CAPTCHA. If one appears, earlier layers (identity, timing, proxy rotation) failed. NopeCHA is the safety net, not the primary strategy.
Disable with extension_path: "none" in config or WithExtensionPath("none").
Middleware Chain ¶
Foxhound provides 13 middleware layers that wrap the fetcher:
- Rate limiting (token bucket per domain)
- Request deduplication (URL + method fingerprint)
- Autothrottle (adaptive delay based on response times)
- Cookie persistence (file-backed or in-memory jar)
- Referer chain (natural browsing simulation)
- Blocked response detection (403/429/503/CAPTCHA triggers)
- Redirect following (with loop detection)
- Depth limiting (max crawl depth from seed)
- Retry with exponential backoff
- Delta-fetch (skip unchanged pages via ETag/Last-Modified)
- Circuit breaker (3-state FSM: closed → open → half-open)
- Metrics collection (Prometheus counters)
- Robots.txt compliance
Middleware is composable: each layer wraps a Fetcher and returns a Fetcher, so you can stack them in any order or add custom middleware.
Adaptive Selectors ¶
Websites frequently change their DOM structure — class names rotate, IDs are randomized, layouts shift. Foxhound's adaptive selector system survives these rewrites by building element signatures (tag, position, text patterns, ancestor structure) alongside CSS selectors. When a selector stops matching, the system falls back to similarity matching against saved signatures.
Enable with Hunt.WithAdaptive and use via Response.Adaptive, Response.CSSAdaptive, Response.CSSAdaptiveAll, or Trail.Adaptive. Signatures can be stored in JSON files or SQLite.
Example: Hunt Campaign ¶
A Hunt is the standard way to scrape at scale. Define a Processor, configure middleware and writers, add seed URLs, and run:
hunt := engine.NewHunt("bookstore",
engine.WithDomain("books.toscrape.com"),
engine.WithWalkers(4),
engine.WithProcessor(foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
result := &foxhound.Result{}
titles := resp.CSS("h3 a").Texts()
prices := resp.CSS(".price_color").Texts()
for i, title := range titles {
item := foxhound.NewItem()
item.Set("title", title)
if i < len(prices) {
item.Set("price", prices[i])
}
result.Items = append(result.Items, item)
}
// Follow pagination links
result.Jobs = resp.Follow("li.next a[href]")
return result, nil
})),
)
hunt.AddSeed("https://books.toscrape.com/")
huntResult, err := hunt.Run(ctx)
Example: Trail Navigation ¶
Trails describe multi-step browser interactions for JS-heavy pages. This example searches Google Maps and scrolls through results:
trail := engine.NewTrail("maps-search").
Navigate("https://www.google.com/maps").
Fill("input#searchboxinput", "cafe in canggu").
Click("button#searchbox-searchbutton").
WaitOptional("div[role='feed']", 10*time.Second).
InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 20, 100).
Evaluate("() => document.querySelectorAll('.Nv2PK').length")
jobs := trail.ToJobs()
Example: Session (Ad-Hoc Scraping) ¶
Session is the lightweight alternative to Hunt for quick, stateful fetches. Cookies persist across calls, and the identity stays consistent:
sess := foxhound.NewSession(
foxhound.WithSessionFetcher(fetch.NewStealth()),
foxhound.WithSessionIdentity(identity.Generate()),
foxhound.WithSessionProxy("http://user:pass@proxy.example:8080"),
)
defer sess.Close()
resp, err := sess.Get(ctx, "https://example.com/login")
// cookies from login response are automatically persisted
resp2, err := sess.Get(ctx, "https://example.com/dashboard")
Example: CSS and XPath Selectors ¶
Response provides built-in CSS and XPath querying without importing the parse package:
// Single element
title := resp.CSS("h1.title").Text()
price := resp.CSS("span.price").Text()
image := resp.CSS("img.product").Attr("src")
// Multiple elements
allTitles := resp.CSS("h3 a").Texts()
allLinks := resp.CSS("a.product[href]").Attrs("href")
count := resp.CSS("div.result").Len()
// XPath (subset converted to CSS internally)
author := resp.XPath("//span[@class='author']")
Example: Follow Links ¶
Response.Follow extracts links from the page and generates follow-up Jobs:
// Follow all product links, route to a different handler
jobs := resp.Follow("a.product-link[href]",
foxhound.WithFollowCallback("parseProduct"),
foxhound.WithFollowReferer(true),
)
// Follow a single known URL
nextPage := resp.FollowURL("/api/products?page=2")
// Follow all anchor links on the page
allJobs := resp.FollowAll()
Example: XHR/Fetch Capture ¶
Capture background API calls that JavaScript makes after page load. This is essential for SPAs where data loads via XHR/fetch, not in the initial HTML:
trail := engine.NewTrail("api-capture").
Navigate("https://example.com/app").
CaptureXHR("*/api/v2/products*").
Click("button.load-data").
Wait("div.results", 5*time.Second)
The captured responses are available in Response.CapturedXHR as a slice of maps with keys: request_url, request_method, status, headers, body.
Example: Cloudflare Solve ¶
For sites behind Cloudflare's JavaScript challenge, Foxhound can detect and wait for the challenge to complete:
fetcher := fetch.NewCamoufox(
fetch.WithSolveCloudflare(30 * time.Second),
)
// resp.CloudflareSolved is true when the challenge was detected and solved.
// Verification checks: cf_clearance cookie, absence of Turnstile DOM markers,
// and a non-empty cf-turnstile-response token.
Example: Multi-Session Campaigns ¶
Route different jobs through different identities and proxies within a single Hunt:
indexSession := foxhound.NewSession(
foxhound.WithSessionFetcher(fetch.NewStealth()),
foxhound.WithSessionProxy("http://proxy-a:8080"),
)
detailSession := foxhound.NewSession(
foxhound.WithSessionFetcher(fetch.NewCamoufox()),
foxhound.WithSessionProxy("http://proxy-b:8080"),
)
hunt := engine.NewHunt("multi-session", /* ... */)
hunt.AddSession("index", indexSession)
hunt.AddSession("detail", detailSession)
// Jobs with SessionID "index" use indexSession's fetcher and proxy;
// jobs with SessionID "detail" use detailSession's browser.
Example: Development Mode ¶
Cache responses on disk for zero-network iteration during development:
hunt := engine.NewHunt("dev",
engine.WithDevelopmentMode("./dev-cache"),
// ... other options
)
// First run: fetches from network, saves responses to ./dev-cache/
// Subsequent runs: replays cached responses instantly
Sub-Packages ¶
The foxhound module is organized into focused sub-packages:
[engine] — Hunt, Trail, Walker, scheduler, retry logic, stats collection, and ItemList for thread-safe item accumulation with CSV/JSON/JSONL export.
[fetch] — Stealth HTTP client (TLS fingerprinting + header ordering), Camoufox browser automation (Juggler protocol), Smart router (auto-escalation), XHR capture, page pool management, domain risk scoring, and SOCKS5 auth relay.
[identity] — Profile generation with 60 embedded device profiles. Produces consistent identity bundles (UA, TLS, headers, OS, hardware, screen, locale, geo) and Camoufox fingerprint configs.
[behavior] — Human behavior simulation: timing (Weibull/Gamma distributions), mouse (Bezier curves), scroll patterns, keyboard input, navigation profiles, and session fatigue modeling.
[middleware] — 13 composable middleware layers: rate limiting, dedup, retry, autothrottle, cookies, referer, redirect, depth, delta-fetch, circuit breaker, metrics, blocked detection, and robots.txt.
[parse] — Content extraction: CSS (goquery), JSON (dot-path), XPath (subset), regex, structured schema, Markdown/text conversion, metadata (JSON-LD, OpenGraph, NextData, NuxtData), contact deobfuscation, sitemap/feed parsing, adaptive selectors, HTML table extraction, JS preload detection, directory listings, pagination detection, and auto-detection with Readability-style scoring.
[pipeline] — Item processing stages: validation, cleaning, deduplication, field transformation (regex, rename, type coercion), and chain composition.
pipeline/export — Output writers: JSON, JSONL, CSV, XML, SQLite, PostgreSQL, Markdown, Text, and Webhook.
[proxy] — Proxy pool management with geo-aware selection, health checking, cooldown tracking, and provider adapters (BrightData, Oxylabs, Smartproxy).
[queue] — Job queue implementations: in-memory (heap-based priority queue), Redis (sorted sets), and SQLite (persistent).
[cache] — Response caching: in-memory (LRU + TTL), file-based (SHA256 keys), Redis, and SQLite.
[captcha] — CAPTCHA detection (Cloudflare, reCAPTCHA, hCAPTCHA, GeeTest) and solving via NopeCHA, CapSolver, 2Captcha, and Turnstile.
[monitor] — Observability: atomic stat counters, Prometheus metrics (isolated registry), and webhook-based alerting rules.
cmd/foxhound — CLI tool: init, run, check, proxy-test, shell, browser-shell, resume, curl2fox, and preview commands.
Index ¶
- Constants
- func RegisterAdaptiveHooks(extractText func(extractor any, body []byte, name string) string, ...)
- func RegisterHTMLSelectors(textsFunc func(body []byte, selector string) []string, ...)
- func SetupLogging(cfg LoggingConfig, verbose int)
- type AlertingExportConfig
- type AutoThrottleMiddlewareConfig
- type BehaviorConfig
- type BrowserFetchConfig
- type CacheConfig
- type CaptchaConfig
- type CleanConfig
- type ConcurrencyConfig
- type Config
- type DedupConfig
- type DeltaFetchConfig
- type DepthLimitConfig
- type DownloadDelayConfig
- type Duration
- type ExportConfig
- type FetchConfig
- type FetchMode
- type Fetcher
- type FetcherFunc
- type FollowOption
- func WithFollowCallback(callback string) FollowOption
- func WithFollowDontFilter(dontFilter bool) FollowOption
- func WithFollowMeta(meta map[string]any) FollowOption
- func WithFollowMode(mode FetchMode) FollowOption
- func WithFollowPriority(p Priority) FollowOption
- func WithFollowReferer(referer bool) FollowOption
- type HuntConfig
- type IdentityConfig
- type Item
- func (it *Item) Get(key string) (any, bool)
- func (it *Item) GetFloat(key string) float64
- func (it *Item) GetInt(key string) int
- func (it *Item) GetString(key string) string
- func (it *Item) Has(key string) bool
- func (it *Item) Keys() []string
- func (it *Item) Set(key string, value any)
- func (it *Item) String() string
- func (it *Item) ToCSVRow(columns []string) []string
- func (it *Item) ToJSON() ([]byte, error)
- func (it *Item) ToJSONPretty() ([]byte, error)
- func (it *Item) ToMap() map[string]any
- func (it *Item) ToMarkdown() string
- func (it *Item) ToText() string
- type Job
- type JobStep
- type LoggingConfig
- type MetricsExportConfig
- type Middleware
- type MiddlewareConfig
- type MiddlewareFunc
- type MonitorConfig
- type PageActionsConfig
- type Pipeline
- type PipelineEntry
- type PipelineFunc
- type Priority
- type Processor
- type ProcessorFunc
- type ProviderEntry
- type ProxyConfig
- type Queue
- type QueueConfig
- type RateLimitConfig
- type Response
- func (r *Response) Adaptive(name string) string
- func (r *Response) AdaptiveExtractor() any
- func (r *Response) CSS(selector string) *Selection
- func (r *Response) CSSAdaptive(selector, name string) *Selection
- func (r *Response) CSSAdaptiveAll(selector, name string) *Selection
- func (r *Response) Follow(selector string, opts ...FollowOption) []*Job
- func (r *Response) FollowAll(opts ...FollowOption) []*Job
- func (r *Response) FollowURL(rawURL string, opts ...FollowOption) *Job
- func (r *Response) IsSuccess() bool
- func (r *Response) SetAdaptiveExtractor(ae any)
- func (r *Response) TextBody() string
- func (r *Response) XPath(expr string) string
- func (r *Response) XPathAll(expr string) []string
- type Result
- type RobotsTxtConfig
- type Selection
- type Selector
- type Session
- func (s *Session) Close() error
- func (s *Session) Cookies() []*http.Cookie
- func (s *Session) CookiesFor(rawURL string) []*http.Cookie
- func (s *Session) Fetch(ctx context.Context, job *Job) (*Response, error)
- func (s *Session) Fetcher() Fetcher
- func (s *Session) Get(ctx context.Context, rawURL string) (*Response, error)
- func (s *Session) Identity() any
- func (s *Session) Name() string
- func (s *Session) ProxyURL() string
- func (s *Session) SetFetcher(f Fetcher)
- func (s *Session) SetName(name string)
- type SessionOption
- type StaticFetchConfig
- type ValidateConfig
- type Writer
Constants ¶
const ( JobStepClick = 1 JobStepWait = 2 JobStepExtract = 3 JobStepScroll = 4 JobStepInfiniteScroll = 5 // scroll to bottom until no new content loads JobStepLoadMore = 6 // click "load more" button repeatedly until gone JobStepPaginate = 7 // detect and follow pagination links JobStepEvaluate = 8 // execute custom JavaScript on the page JobStepFill = 9 // type text into input field with human-like keystrokes JobStepCollect = 10 // collect URLs from page into pool )
Step action constants for JobStep. These are package-level int constants (not engine.StepAction) to avoid an import cycle between foxhound ↔ engine.
Variables ¶
This section is empty.
Functions ¶
func RegisterAdaptiveHooks ¶
func RegisterAdaptiveHooks( extractText func(extractor any, body []byte, name string) string, register func(extractor any, body []byte, name, selector string, all bool), )
RegisterAdaptiveHooks is called by the parse package to wire its AdaptiveExtractor implementation into Response.Adaptive / CSSAdaptive.
func RegisterHTMLSelectors ¶
func RegisterHTMLSelectors( textsFunc func(body []byte, selector string) []string, attrsFunc func(body []byte, selector, attr string) []string, countFunc func(body []byte, selector string) int, xpathFunc func(expr string) string, )
RegisterHTMLSelectors is called by the parse package to provide the HTML selection implementations used by Response.CSS() and Response.XPath().
func SetupLogging ¶
func SetupLogging(cfg LoggingConfig, verbose int)
SetupLogging configures the global slog logger from a LoggingConfig. The verbose parameter overrides the config level:
0 = use config level (default info) 1 = debug (-v) 2 = debug with source location (-vv)
Types ¶
type AlertingExportConfig ¶
type AlertingExportConfig struct {
WebhookURL string `yaml:"webhook_url"`
ErrorRateThreshold float64 `yaml:"error_rate_threshold"`
BlockRateThreshold float64 `yaml:"block_rate_threshold"`
Cooldown Duration `yaml:"cooldown"`
}
AlertingExportConfig configures webhook alerting.
type AutoThrottleMiddlewareConfig ¶
type AutoThrottleMiddlewareConfig struct {
Enabled bool `yaml:"enabled"`
TargetConcurrency float64 `yaml:"target_concurrency"`
InitialDelay Duration `yaml:"initial_delay"`
MinDelay Duration `yaml:"min_delay"`
MaxDelay Duration `yaml:"max_delay"`
}
AutoThrottleMiddlewareConfig configures the adaptive per-domain throttle.
type BehaviorConfig ¶
type BehaviorConfig struct {
// Profile selects the preset behavior profile: "careful", "moderate", or
// "aggressive". Defaults to "moderate" when unset.
Profile string `yaml:"profile"`
}
BehaviorConfig configures the human-simulation behavior profile for walkers.
type BrowserFetchConfig ¶
type BrowserFetchConfig struct {
Timeout Duration `yaml:"timeout"`
BlockImages bool `yaml:"block_images"`
BlockWebRTC bool `yaml:"block_webrtc"`
Headless string `yaml:"headless"`
Instances int `yaml:"instances"`
ExtensionPath string `yaml:"extension_path"` // path to extension dir/xpi, or "none" to disable NopeCHA auto-load
}
BrowserFetchConfig configures the Camoufox browser.
type CacheConfig ¶
type CacheConfig struct {
Backend string `yaml:"backend"` // "memory" | "file" | "sqlite" | "redis" | "" (disabled)
TTL Duration `yaml:"ttl"`
MaxSize int `yaml:"max_size"` // max entries for memory cache
}
CacheConfig configures response caching.
type CaptchaConfig ¶
type CaptchaConfig struct {
Enabled bool `yaml:"enabled"`
Provider string `yaml:"provider"` // "capsolver" | "twocaptcha" | "nopecha"
APIKey string `yaml:"api_key"`
}
CaptchaConfig configures CAPTCHA detection and solving.
type CleanConfig ¶
type CleanConfig struct {
TrimWhitespace bool `yaml:"trim_whitespace"`
NormalizePrice bool `yaml:"normalize_price"`
NormalizeDate bool `yaml:"normalize_date"`
}
CleanConfig configures the cleaning pipeline stage.
type ConcurrencyConfig ¶
type ConcurrencyConfig struct {
PerDomain int `yaml:"per_domain"` // max concurrent requests per domain (default 2)
}
ConcurrencyConfig limits concurrent in-flight requests per domain.
type Config ¶
type Config struct {
Hunt HuntConfig `yaml:"hunt"`
Identity IdentityConfig `yaml:"identity"`
Proxy ProxyConfig `yaml:"proxy"`
Fetch FetchConfig `yaml:"fetch"`
Middleware MiddlewareConfig `yaml:"middleware"`
Pipeline []PipelineEntry `yaml:"pipeline"`
Queue QueueConfig `yaml:"queue"`
Cache CacheConfig `yaml:"cache"`
Monitor MonitorConfig `yaml:"monitor"`
Captcha CaptchaConfig `yaml:"captcha"`
Logging LoggingConfig `yaml:"logging"`
Behavior BehaviorConfig `yaml:"behavior"`
PageActions PageActionsConfig `yaml:"page_actions"`
}
Config is the top-level configuration for a Foxhound instance.
func LoadConfig ¶
LoadConfig reads and parses a YAML configuration file.
type DedupConfig ¶
DedupConfig configures URL deduplication.
type DeltaFetchConfig ¶
type DeltaFetchConfig struct {
Enabled bool `yaml:"enabled"`
Strategy string `yaml:"strategy"` // "skip_seen" | "skip_recent"
TTL Duration `yaml:"ttl"`
Store string `yaml:"store"` // "memory" | "redis" | "sqlite"
}
DeltaFetchConfig configures cross-run URL deduplication.
type DepthLimitConfig ¶
type DepthLimitConfig struct {
Max int `yaml:"max"`
}
DepthLimitConfig configures crawl depth limiting.
type DownloadDelayConfig ¶
type DownloadDelayConfig struct {
Enabled bool `yaml:"enabled"`
Default Duration `yaml:"default"` // base delay between same-domain requests
Domains map[string]string `yaml:"domains"` // per-domain delay overrides (domain -> duration string)
Randomize bool `yaml:"randomize"` // add ±25% jitter
}
DownloadDelayConfig configures per-domain download delays.
type Duration ¶
Duration is a time.Duration that supports YAML marshaling.
func (Duration) MarshalYAML ¶
MarshalYAML serializes the duration as a string.
type ExportConfig ¶
type ExportConfig struct {
Type string `yaml:"type"`
Path string `yaml:"path,omitempty"`
Table string `yaml:"table,omitempty"`
UpsertKey string `yaml:"upsert_key,omitempty"`
BatchSize int `yaml:"batch_size,omitempty"`
}
ExportConfig defines an export destination.
type FetchConfig ¶
type FetchConfig struct {
Static StaticFetchConfig `yaml:"static"`
Browser BrowserFetchConfig `yaml:"browser"`
}
FetchConfig configures the fetch layer.
type Fetcher ¶
type Fetcher interface {
// Fetch performs an HTTP request and returns the response.
Fetch(ctx context.Context, job *Job) (*Response, error)
// Close releases any resources held by the fetcher.
Close() error
}
Fetcher defines the interface for making HTTP requests.
type FetcherFunc ¶
FetcherFunc is an adapter to allow use of ordinary functions as Fetchers.
func (FetcherFunc) Close ¶
func (f FetcherFunc) Close() error
Close is a no-op to satisfy the Fetcher interface.
type FollowOption ¶
type FollowOption func(*followConfig)
FollowOption configures how Follow generates jobs from discovered links.
func WithFollowCallback ¶
func WithFollowCallback(callback string) FollowOption
WithFollowCallback sets a callback name in Meta["callback"] for generated jobs, allowing spider-style routing of responses to different handlers.
func WithFollowDontFilter ¶
func WithFollowDontFilter(dontFilter bool) FollowOption
WithFollowDontFilter marks generated jobs to skip deduplication filtering. Useful for pages that need to be re-fetched (e.g. pagination, monitoring).
func WithFollowMeta ¶
func WithFollowMeta(meta map[string]any) FollowOption
WithFollowMeta sets metadata on generated follow-up jobs.
func WithFollowMode ¶
func WithFollowMode(mode FetchMode) FollowOption
WithFollowMode sets the FetchMode for generated follow-up jobs.
func WithFollowPriority ¶
func WithFollowPriority(p Priority) FollowOption
WithFollowPriority sets the Priority for generated follow-up jobs.
func WithFollowReferer ¶
func WithFollowReferer(referer bool) FollowOption
WithFollowReferer sets the current response URL as referer in the generated job's Meta["referer"]. This maintains referer chain for natural browsing simulation.
type HuntConfig ¶
type HuntConfig struct {
Domain string `yaml:"domain"`
Walkers int `yaml:"walkers"`
MaxConcurrency int `yaml:"max_concurrency"` // global max concurrent requests (0 = walkers count)
}
HuntConfig configures the scraping campaign.
type IdentityConfig ¶
type IdentityConfig struct {
Browser string `yaml:"browser"`
OS []string `yaml:"os"`
FingerprintDB string `yaml:"fingerprint_db"`
}
IdentityConfig configures identity generation.
type Item ¶
type Item struct {
// Fields holds the extracted data as key-value pairs.
Fields map[string]any
// Meta carries metadata from the originating job.
Meta map[string]any
// URL is the source URL.
URL string
// Timestamp is when the item was created.
Timestamp time.Time
}
Item represents a scraped data item passing through the pipeline.
func (*Item) GetFloat ¶
GetFloat returns the field value as float64. Accepts float64 and int/int64 stored in the Fields map. Returns 0 if the field is absent or non-numeric.
func (*Item) GetInt ¶
GetInt returns the field value as int. Accepts int, int64, and float64 stored in the Fields map (float64 is truncated). Returns 0 if the field is absent or non-numeric.
func (*Item) GetString ¶
GetString returns the field value as a string. Returns "" if the field is absent or its underlying type is not string.
func (*Item) Has ¶
Has reports whether the field exists and has a non-empty string representation. A field set to nil or "" is treated as absent.
func (*Item) String ¶
String implements fmt.Stringer. It returns a compact JSON representation of the item fields, falling back to a key=value format on marshal error.
func (*Item) ToCSVRow ¶
ToCSVRow returns field values as a string slice following the given column order. Missing fields are returned as empty strings.
func (*Item) ToJSONPretty ¶
ToJSONPretty returns item.Fields serialised as indented JSON bytes.
func (*Item) ToMap ¶
ToMap returns a shallow copy of item.Fields. Mutations to the returned map do not affect the Item.
func (*Item) ToMarkdown ¶
ToMarkdown returns a compact Markdown representation of the item as a bullet list: the first key (sorted) is bolded; the rest are appended.
type Job ¶
type Job struct {
// ID is a unique identifier for this job.
ID string
// URL is the target URL to fetch.
URL string
// Method is the HTTP method (default GET).
Method string
// Headers are additional HTTP headers to include.
Headers http.Header
// Body is the request body for POST/PUT requests.
Body []byte
// FetchMode determines which fetcher to use.
FetchMode FetchMode
// Priority determines processing order.
Priority Priority
// MaxRetries overrides the default retry count.
MaxRetries int
// Meta is arbitrary metadata passed through the pipeline.
Meta map[string]any
// Depth is the crawl depth from the seed URL.
Depth int
// Domain is the target domain extracted from URL.
Domain string
// CreatedAt is when the job was created.
CreatedAt time.Time
// Steps are browser-side actions to execute after page load (optional).
// When non-empty, the job requires a browser fetcher. The omitempty tag
// ensures backward compatibility with existing queue serialization.
Steps []JobStep `json:"steps,omitempty"`
// for this specific job. Useful for pages known to be slow (e.g. later
// pagination pages on Google SERP). Zero means use the fetcher default.
NavigationTimeout time.Duration `json:"navigation_timeout,omitempty"`
// DontFilter when true skips deduplication for this specific job.
// Useful for pages that need to be re-fetched (e.g. pagination, monitoring).
DontFilter bool `json:"dont_filter,omitempty"`
// Callback is an optional handler name that the spider routes to a
// specific Parse method. When empty, the default processor is used.
Callback string `json:"callback,omitempty"`
// SessionID names a session previously registered with Hunt.AddSession.
// When set, the walker routes this job through the named session's
// fetcher (with its own cookie jar, identity, and proxy) instead of the
// hunt's default fetcher. Empty (default) preserves backward-compatible
// behaviour: the hunt's default fetcher is used.
SessionID string `json:"session_id,omitempty"`
}
Job represents a unit of work to be processed by the engine.
type JobStep ¶
type JobStep struct {
// Action identifies the step type (JobStepClick, JobStepWait, etc.).
// Zero value (JobStepNavigate) is intentionally NOT omitempty so it
// always serializes.
Action int `json:"action"`
// Selector is the CSS selector for Click, Wait, and Extract steps.
Selector string `json:"selector,omitempty"`
// Duration is the timeout for Wait steps.
Duration time.Duration `json:"duration,omitempty"`
// ScrollAxis is 0 for vertical, 1 for horizontal (only for Scroll steps).
ScrollAxis int `json:"scroll_axis,omitempty"`
// ScrollExtent is the target scroll distance in pixels. Defaults to 3000
// when zero.
ScrollExtent int `json:"scroll_extent,omitempty"`
// ScrollMode is 0 for ScrollReading, 1 for ScrollScan. Zero value
// (omitted in JSON) defaults to ScrollReading.
ScrollMode int `json:"scroll_mode,omitempty"`
// MaxScrolls is the maximum number of scroll-to-bottom iterations for
// InfiniteScroll steps. Defaults to 50 when zero.
MaxScrolls int `json:"max_scrolls,omitempty"`
// MaxClicks is the maximum number of "load more" button clicks for
// LoadMore steps. Defaults to 20 when zero.
MaxClicks int `json:"max_clicks,omitempty"`
// MaxPages is the maximum number of pagination pages to follow for
// Paginate steps. Defaults to 10 when zero.
MaxPages int `json:"max_pages,omitempty"`
// Script is the JavaScript code to execute for Evaluate steps.
Script string `json:"script,omitempty"`
// WaitState specifies what state to wait for in Wait steps:
// "attached" (default), "detached", "visible", or "hidden".
// Maps to playwright's WaitForSelectorState.
WaitState string `json:"wait_state,omitempty"`
// Optional marks this step as non-fatal: if it fails, execution continues
// instead of aborting the fetch. Useful for steps that may not always be
// present on the page (e.g. a cookie banner dismiss button).
Optional bool `json:"optional,omitempty"`
// StopSelector is a CSS selector that signals InfiniteScroll to stop
// when the target element count is reached. Used with StopCount to scroll
// until N items exist (e.g. "div.result" + StopCount=20).
StopSelector string `json:"stop_selector,omitempty"`
// StopCount is the target element count for StopSelector. InfiniteScroll
// stops when document.querySelectorAll(StopSelector).length >= StopCount.
// Only used when StopSelector is set. Defaults to 1 when zero.
StopCount int `json:"stop_count,omitempty"`
// ScrollWait is the duration to wait after each scroll iteration before
// checking for new content. Defaults to 2s when zero. Increase for slow
// sites like Google Maps (3-5s recommended).
ScrollWait time.Duration `json:"scroll_wait,omitempty"`
// Value is the text to type into an input field for Fill steps.
Value string `json:"value,omitempty"`
}
JobStep is a single browser-side action that should be executed after the page loads. Steps are attached to a Job by Trail.ToJobs() and executed by the CamoufoxFetcher before content extraction.
type LoggingConfig ¶
type LoggingConfig struct {
Level string `yaml:"level"`
Format string `yaml:"format"`
Output string `yaml:"output"`
}
LoggingConfig configures structured logging.
type MetricsExportConfig ¶
MetricsExportConfig configures Prometheus metrics.
type Middleware ¶
type Middleware interface {
// Wrap takes a Fetcher and returns a wrapped Fetcher.
Wrap(next Fetcher) Fetcher
}
Middleware wraps a Fetcher to add cross-cutting behavior.
type MiddlewareConfig ¶
type MiddlewareConfig struct {
RateLimit RateLimitConfig `yaml:"ratelimit"`
AutoThrottle AutoThrottleMiddlewareConfig `yaml:"autothrottle"`
Dedup DedupConfig `yaml:"dedup"`
DeltaFetch DeltaFetchConfig `yaml:"deltafetch"`
RobotsTxt RobotsTxtConfig `yaml:"robots_txt"`
DepthLimit DepthLimitConfig `yaml:"depth_limit"`
Concurrency ConcurrencyConfig `yaml:"concurrency"`
DownloadDelay DownloadDelayConfig `yaml:"download_delay"`
}
MiddlewareConfig configures request/response processing middleware.
type MiddlewareFunc ¶
MiddlewareFunc is an adapter for using functions as Middleware.
type MonitorConfig ¶
type MonitorConfig struct {
Metrics MetricsExportConfig `yaml:"metrics"`
Alerting AlertingExportConfig `yaml:"alerting"`
}
MonitorConfig configures observability.
type PageActionsConfig ¶
type PageActionsConfig struct {
Scripts []string `yaml:"scripts"` // JS snippets to run after page load
}
PageActionsConfig configures JavaScript execution after page load.
type Pipeline ¶
type Pipeline interface {
// Process takes an item and returns the (possibly modified) item.
// Return nil to drop the item. Return an error to log and continue.
Process(ctx context.Context, item *Item) (*Item, error)
}
Pipeline processes items after extraction.
type PipelineEntry ¶
type PipelineEntry struct {
Validate *ValidateConfig `yaml:"validate,omitempty"`
Clean *CleanConfig `yaml:"clean,omitempty"`
Dedup *DedupConfig `yaml:"dedup,omitempty"`
Export []ExportConfig `yaml:"export,omitempty"`
}
PipelineEntry is a polymorphic pipeline stage definition.
type PipelineFunc ¶
PipelineFunc is an adapter for using functions as Pipeline stages.
type Processor ¶
type Processor interface {
// Process takes a response and returns extracted items and new jobs.
Process(ctx context.Context, resp *Response) (*Result, error)
}
Processor defines the user-provided logic for handling responses. This is the main extension point: users implement this to extract data.
type ProcessorFunc ¶
ProcessorFunc is an adapter to allow use of ordinary functions as Processors.
type ProviderEntry ¶
type ProviderEntry struct {
Type string `yaml:"type"`
List []string `yaml:"list,omitempty"`
APIKey string `yaml:"api_key,omitempty"`
Username string `yaml:"username,omitempty"`
Password string `yaml:"password,omitempty"`
Product string `yaml:"product,omitempty"`
Country string `yaml:"country,omitempty"`
}
ProviderEntry defines a proxy provider in configuration.
type ProxyConfig ¶
type ProxyConfig struct {
Providers []ProviderEntry `yaml:"providers"`
Rotation string `yaml:"rotation"`
Cooldown Duration `yaml:"cooldown"`
MaxRequestsPerProxy int `yaml:"max_requests_per_proxy"`
HealthCheckInterval Duration `yaml:"health_check_interval"`
}
ProxyConfig configures proxy management.
type Queue ¶
type Queue interface {
// Push adds a job to the queue.
Push(ctx context.Context, job *Job) error
// Pop removes and returns the highest priority job. Blocks until available
// or context is cancelled.
Pop(ctx context.Context) (*Job, error)
// Len returns the number of jobs in the queue.
Len() int
// Close releases queue resources.
Close() error
}
Queue defines the interface for job storage and retrieval.
type QueueConfig ¶
type QueueConfig struct {
Backend string `yaml:"backend"`
}
QueueConfig configures the job queue backend.
type RateLimitConfig ¶
type RateLimitConfig struct {
Enabled bool `yaml:"enabled"`
RequestsPerSec float64 `yaml:"requests_per_sec"`
BurstSize int `yaml:"burst_size"`
}
RateLimitConfig configures per-domain rate limiting.
type Response ¶
type Response struct {
// StatusCode is the HTTP status code.
StatusCode int
// Headers are the response headers.
Headers http.Header
// Body is the response body bytes.
Body []byte
// URL is the final URL after redirects.
URL string
// FetchMode indicates which fetcher was used.
FetchMode FetchMode
// Duration is how long the fetch took.
Duration time.Duration
// Job is the original job that produced this response.
Job *Job
// StepResults holds return values from JobStepEvaluate steps, keyed by
// step index (e.g. "step_0", "step_2"). Only populated when steps
// produce output.
StepResults map[string]any
// CapturedXHR holds captured XHR/fetch responses when capture patterns are configured.
// Each entry is a map with keys: request_url, request_method, status, headers, body.
CapturedXHR []map[string]any
// Cookies contains cookies set by the response (Set-Cookie headers for
// static fetches, browser context cookies for browser fetches).
Cookies []*http.Cookie `json:"cookies,omitempty"`
// CloudflareSolved is true when a Cloudflare Turnstile / JS challenge was
// detected and verified as solved before the response was returned. The
// verification checks for the cf_clearance cookie, absence of Turnstile
// DOM markers, and a non-empty cf-turnstile-response token. Only set when
// the browser fetcher was launched with WithSolveCloudflare.
CloudflareSolved bool `json:"cloudflare_solved,omitempty"`
// contains filtered or unexported fields
}
Response wraps an HTTP response with additional metadata.
func (*Response) Adaptive ¶
Adaptive returns the text of a registered adaptive selector by name. Falls back to similarity matching if the primary CSS selector finds nothing on the current page. Returns an empty string when no extractor is attached or no element is matched.
Requires Hunt.WithAdaptive(...) to have been called.
func (*Response) AdaptiveExtractor ¶
AdaptiveExtractor returns the attached extractor as an opaque value. Callers in the parse package can type-assert it to *parse.AdaptiveExtractor. Returns nil when no extractor is configured for this response.
func (*Response) CSS ¶
CSS returns a Selector bound to this Response. Subsequent calls share the same parsed document, making it efficient to chain multiple selectors:
title := resp.CSS("h1").Text()
links := resp.CSS("a[href]").Attrs("href")
func (*Response) CSSAdaptive ¶
CSSAdaptive runs a CSS selector against this response, registering it as an adaptive selector under the given name. On future runs, if the selector no longer matches, similarity matching falls back to the saved signature.
Requires Hunt.WithAdaptive(...) to have been called. Returns a Selection supporting .Text(), .Attr(), .Texts(), and .Attrs(); when no extractor is configured, returns a Selection backed by a plain CSS query (degraded behaviour).
func (*Response) CSSAdaptiveAll ¶
CSSAdaptiveAll is like CSSAdaptive but registers the selector for multi-element extraction. Returns a Selection that, when queried via .Texts() / .Attrs(), yields all matches.
func (*Response) Follow ¶
func (r *Response) Follow(selector string, opts ...FollowOption) []*Job
Follow generates follow-up Jobs from all links matching the CSS selector in the response body. Links are resolved relative to the response URL, deduplicated, and filtered to HTTP(S) schemes. The generated jobs inherit Depth+1 from the originating job.
Example:
jobs := resp.Follow("a.product-link[href]", foxhound.WithFollowCallback("parseProduct"))
func (*Response) FollowAll ¶
func (r *Response) FollowAll(opts ...FollowOption) []*Job
FollowAll generates follow-up Jobs for all anchor links (a[href]) in the response body. It is shorthand for Follow("a[href]", opts...).
func (*Response) FollowURL ¶
func (r *Response) FollowURL(rawURL string, opts ...FollowOption) *Job
FollowURL creates a single follow-up Job for the given URL. The URL is resolved relative to the response URL. The generated job inherits Depth+1 from the originating job.
Unlike Follow, which extracts links from HTML via CSS selectors, FollowURL is for programmatically following a known URL (e.g. an API endpoint or a URL extracted from JSON data).
Example:
nextPage := resp.FollowURL("/api/products?page=2", foxhound.WithFollowReferer(true))
func (*Response) IsSuccess ¶
IsSuccess returns true when the HTTP status code indicates success (2xx).
func (*Response) SetAdaptiveExtractor ¶
SetAdaptiveExtractor attaches a *parse.AdaptiveExtractor to this response. Walker calls this before invoking the user processor when Hunt.WithAdaptive(...) is configured. The argument is typed as any to avoid an import cycle; pass a *parse.AdaptiveExtractor.
type Result ¶
type Result struct {
// Items are the extracted data items.
Items []*Item
// Jobs are new jobs to enqueue (discovered links, pagination, etc.).
Jobs []*Job
}
Result is the output of processing a job. It contains scraped items and optionally new jobs to enqueue (for crawling).
type RobotsTxtConfig ¶
type RobotsTxtConfig struct {
Enabled bool `yaml:"enabled"`
}
RobotsTxtConfig configures robots.txt compliance.
type Selection ¶
type Selection struct {
// contains filtered or unexported fields
}
Selection represents a CSS selector applied to an HTML body. It provides convenience methods for extracting text, attributes, and HTML from matched elements without requiring the user to import the parse package directly.
type Selector ¶
type Selector struct {
// contains filtered or unexported fields
}
Selector provides CSS-selector-based querying on the Response body. It wraps a lazily-parsed HTML document so multiple CSS/XPath calls share the same parse result.
type Session ¶
type Session struct {
// contains filtered or unexported fields
}
Session is a stateful client that survives across calls. Cookies are persisted in an internal CookieJar; identity, proxy, and fetcher are reused for every Get/Fetch.
Session is safe for concurrent use by multiple goroutines.
func NewSession ¶
func NewSession(opts ...SessionOption) *Session
NewSession constructs a Session with the supplied options. A fresh in-memory cookie jar is created when no WithSessionCookieJar option is given.
func (*Session) Close ¶
Close releases any resources held by the underlying fetcher. The cookie jar is in-memory and needs no cleanup. Safe to call multiple times.
func (*Session) Cookies ¶
Cookies returns all cookies currently held in the jar across every host. The returned slice is a fresh copy; mutating it does not affect the jar.
func (*Session) CookiesFor ¶
CookiesFor returns the cookies the jar would send for the given URL. This is the standard http.CookieJar query — use it when you need to inspect what the session has accumulated for a particular host.
func (*Session) Fetch ¶
Fetch executes a Job through the session's fetcher. Before the call any cookies the jar holds for the target URL are merged into the job's headers (so static fetchers without their own jar still see them). After a successful fetch any cookies returned in Response.Cookies are stored in the jar so the next call observes them.
func (*Session) Fetcher ¶
Fetcher returns the underlying fetcher. Returns nil if none was configured.
func (*Session) Get ¶
Get is the simple fetch shorthand. It builds a Job with method GET, the session's identity, and the URL, then delegates to Fetch.
func (*Session) Identity ¶
Identity returns the configured identity profile (as `any`). Callers type-assert to *identity.Profile.
func (*Session) Name ¶
Name returns the session's optional name. Returns empty string for stand-alone sessions; populated for sessions registered via Hunt.AddSession.
func (*Session) SetFetcher ¶
SetFetcher updates the underlying fetcher post-construction. Useful when the fetcher needs to reference the Session's cookie jar (a chicken-and-egg problem solved by constructing the Session first, then the fetcher).
type SessionOption ¶
type SessionOption func(*Session)
SessionOption configures a Session at construction time.
func WithSessionCookieJar ¶
func WithSessionCookieJar(jar http.CookieJar) SessionOption
WithSessionCookieJar replaces the default in-memory jar with a caller- supplied implementation. Use this when persisting cookies across processes (e.g. via a custom file-backed jar) or when sharing a jar across sessions.
func WithSessionFetcher ¶
func WithSessionFetcher(f Fetcher) SessionOption
WithSessionFetcher overrides the default fetcher. When omitted the caller must register one explicitly via SetFetcher before the first Get / Fetch call; calling Get on a Session without a fetcher returns an error.
The default Session does NOT auto-create a stealth fetcher to avoid an import cycle from foxhound → fetch. Wire one up at the application layer:
s := foxhound.NewSession(foxhound.WithSessionFetcher(fetch.NewStealth()))
func WithSessionIdentity ¶
func WithSessionIdentity(p any) SessionOption
WithSessionIdentity attaches an identity profile to the session. The value is stored as `any` to avoid an import cycle with the identity package; the caller passes a *identity.Profile and is responsible for using it on the fetcher (most fetchers accept it via their own option at construction).
func WithSessionProxy ¶
func WithSessionProxy(rawURL string) SessionOption
WithSessionProxy records the session's proxy URL. The value is stored for inspection but is NOT auto-applied to the fetcher; configure the fetcher's own proxy option at construction time. This is intentional: a Session is a thin wrapper, not a fetcher factory.
type StaticFetchConfig ¶
type StaticFetchConfig struct {
Timeout Duration `yaml:"timeout"`
MaxIdleConns int `yaml:"max_idle_conns"`
TLSImpersonate bool `yaml:"tls_impersonate"`
}
StaticFetchConfig configures the TLS-impersonating HTTP client.
type ValidateConfig ¶
type ValidateConfig struct {
Required []string `yaml:"required"`
}
ValidateConfig configures the validation pipeline stage.
type Writer ¶
type Writer interface {
// Write outputs an item to the destination.
Write(ctx context.Context, item *Item) error
// Flush ensures all buffered items are written.
Flush(ctx context.Context) error
// Close releases writer resources.
Close() error
}
Writer defines the interface for exporting scraped items.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package behavior provides human-like behavioral simulation for the Foxhound scraping engine.
|
Package behavior provides human-like behavioral simulation for the Foxhound scraping engine. |
|
Package cache provides caching backends for Foxhound responses.
|
Package cache provides caching backends for Foxhound responses. |
|
Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework.
|
Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework. |
|
cmd
|
|
|
foxhound
command
Command foxhound is the CLI entry point for the Foxhound scraping framework.
|
Command foxhound is the CLI entry point for the Foxhound scraping framework. |
|
Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.
|
Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats. |
|
examples
|
|
|
adaptive
command
Example: adaptive selectors that survive DOM changes.
|
Example: adaptive selectors that survive DOM changes. |
|
ecommerce
command
Example: E-commerce product scraper
|
Example: E-commerce product scraper |
|
realtime
command
Example: Real-time price monitor with webhook notifications
|
Example: Real-time price monitor with webhook notifications |
|
travel
command
Example: Travel site scraper — hotel listings with JavaScript rendering
|
Example: Travel site scraper — hotel listings with JavaScript rendering |
|
Package fetch provides the dual-mode fetching layer for Foxhound.
|
Package fetch provides the dual-mode fetching layer for Foxhound. |
|
presets
Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured.
|
Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured. |
|
Package identity provides consistent anti-detection identity profiles.
|
Package identity provides consistent anti-detection identity profiles. |
|
Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic.
|
Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic. |
|
Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts.
|
Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts. |
|
Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework.
|
Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework. |
|
Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.
|
Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework. |
|
export
Package export provides Writer implementations for exporting scraped items to various formats and destinations.
|
Package export provides Writer implementations for exporting scraped items to various formats and destinations. |
|
Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework.
|
Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework. |
|
providers
Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface.
|
Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface. |
|
Package queue provides foxhound.Queue implementations.
|
Package queue provides foxhound.Queue implementations. |
|
tests
|
|
|
scrape_targets/alibaba
command
Scrape Target 3: Alibaba — 10 yoga mat products
|
Scrape Target 3: Alibaba — 10 yoga mat products |
|
scrape_targets/google_maps
command
Scrape Target 2: Google Maps — "villa di bali"
|
Scrape Target 2: Google Maps — "villa di bali" |
|
scrape_targets/google_serp
command
Scrape Target 1: Google SERP — "wisata alam jawa timur"
|
Scrape Target 1: Google SERP — "wisata alam jawa timur" |
|
scrape_targets/yoga_alliance
command
Scrape Target 4: Yoga Alliance School Directory
|
Scrape Target 4: Yoga Alliance School Directory |