foxhound

package module

v0.0.23 Latest Latest Go to latest Published: May 5, 2026 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sadewadee/foxhound

Links

Open Source Insights

README ¶

Go Scraping Framework with Native Camoufox Anti-Detection

Foxhound v0.0.23

High-performance Go scraping framework with native Camoufox anti-detection, dual-mode fetching, and 13-layer middleware.

Highlights

Dual-mode fetching: TLS-impersonating HTTP client (~5-50ms) + Camoufox browser (~500ms-5s), with automatic escalation on block detection
Consistent identity profiles: UA + TLS fingerprint + header order + OS + hardware + screen + locale all match — randomness without consistency causes instant blocks
13-layer middleware chain: concurrency, metrics, rate limit, robots.txt, delta-fetch, dedup, autothrottle, cookies, referer, blocked detector, redirect, depth limit, retry
Trail API: fluent navigation builder with Fill, InfiniteScroll, Evaluate (custom JS), XHR/fetch capture, and optional steps
Structured data extraction: JSON-LD, OpenGraph, NextData, NuxtData extractors + contact deobfuscation (CloudFlare cfemail)
NopeCHA auto-download: CAPTCHA-solving extension fetched and configured automatically at runtime
9 export formats: JSON, JSONL, CSV, Markdown, Text, XML, SQLite, PostgreSQL, Webhook
Parsing engine: HTML table extraction (colspan/rowspan), JS preloaded data (Next.js/Nuxt/Redux), directory listings (JSON-LD/Microdata/DOM), pagination detection, and auto-detection with Readability-style article scoring
Adaptive parsing: CSS pseudo-selectors (::text, ::attr), similarity matching, auto-selector generation + sitemap/RSS/Atom parsing
Streaming API: Hunt.Stream(ctx) for real-time item processing via Go channels
Checkpoint/resume: auto-save hunt state every N items
Stateful Session: foxhound.NewSession(...) wraps fetcher + cookie jar + identity + proxy for single-call ad-hoc scraping, with cookies persisted across calls
Multi-session campaigns: Hunt.AddSession(name, cfg) + Job.SessionID route individual jobs through distinct fetchers / identities / proxies inside one Hunt
Development mode: Hunt.WithDevelopmentMode(dir) caches responses on disk after the first run and replays them on subsequent runs for zero-network iteration
Verified Cloudflare solve: fetch.WithSolveCloudflare(timeout) polls cookie + DOM + token signals before declaring success and exposes Response.CloudflareSolved
Domain & resource blocking: Hunt.WithBlockedDomains(...) / Hunt.WithDisableResources(...) abort ad, tracker, image, and font requests at the browser layer
Trail XHR capture: Trail.CaptureXHR(pattern) attaches URL regexps to every produced job so matching XHR/fetch response bodies land in Response.Captures
TLS fingerprint customisation (build tag tls): fetch.WithIdentity auto-applies the curated Firefox JA3 from fetch/presets; fetch.WithJA3, fetch.WithJA3Pool, fetch.WithHTTP2Fingerprint, fetch.WithHTTP3Fingerprint available for advanced overrides
Build-mode safety: StealthFetcher.IsImpersonating() + startup log so consumers fail-fast when built without -tags tls
19 packages, 1200+ tests

Key Capabilities

Area	What you get
Performance	CSS parsing in ~8ms for 5K elements. Multi-core goroutines with per-domain concurrency control
Anti-detection	Real Camoufox binary (C++ fingerprint spoofing), human behavior simulation (log-normal timing, Bezier mouse, scroll rhythm), NopeCHA auto-download
Block avoidance	9 vendor patterns (Cloudflare, Akamai, DataDome, PerimeterX) with auto-retry + reCAPTCHA checkbox click + Turnstile handler
Identity	60+ device profiles with consistent UA + TLS + headers + OS + GPU + screen + locale + geo matching
Trail API	Fill forms (`JobStepFill`), infinite scroll with container + stop condition, `Evaluate` custom JS, XHR/fetch capture, optional steps, persistent cookies
Parsing	CSS + XPath + regex + JSON + structured schema + adaptive selectors + similarity matching + pseudo-selectors + sitemap/RSS/Atom
Structured data	JSON-LD, OpenGraph, NextData, NuxtData extractors + CloudFlare cfemail deobfuscation
Export	9 formats: JSON, JSONL, CSV, Markdown (table/list/cards), Text, XML, SQLite, PostgreSQL, Webhook + field-level pipeline transforms
Proxy	Pool rotation, health checking, cooldown, geo-targeted selection matching identity locale
Queue	Memory, Redis (distributed), SQLite (persistent) — checkpoint/resume across restarts
Monitoring	Prometheus metrics + webhook alerting with error/block rate thresholds
Scaling	`docker compose --scale foxhound=4` with shared Redis queue

Quick Start

git clone https://github.com/sadewadee/foxhound.git
cd foxhound
go build -tags playwright -o foxhound ./cmd/foxhound/
foxhound init myproject && cd myproject
go mod tidy
foxhound run --config config.yaml

Google Maps — Scroll feed, collect businesses, extract contacts

// Generate a consistent identity (UA + TLS + headers + OS all match)
id := identity.Generate(identity.WithBrowser(identity.BrowserFirefox))
profile := behavior.CarefulProfile().Jitter() // ±15% per-session parameter variance

browser, _ := fetch.NewCamoufox(
    fetch.WithBrowserIdentity(id),
    fetch.WithBehaviorProfile(profile),
    fetch.WithStorageState("session.json"), // persist session across runs
)
defer browser.Close()

// SmartFetcher with Bayesian domain learning — auto-escalates to browser when blocked
scorer := fetch.NewDomainScorer(fetch.SocialMediaScoreConfig())
smart := fetch.NewSmart(static, browser, fetch.WithDomainScorer(scorer))

// Trail: search → scroll feed → collect all business URLs
trail := engine.NewTrail("maps-search").
    Navigate("https://www.google.com/maps").
    Fill("input#searchboxinput", "restaurant in bali").
    Click("button#searchbox-searchbutton").
    WaitOptional("div[role='feed']", 10*time.Second).
    InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 50, 200).
    Evaluate(`() => document.querySelectorAll('.Nv2PK').length`)

h := engine.NewHunt(engine.HuntConfig{
    Name:            "maps",
    Walkers:         3,
    Seeds:           trail.ToJobs(),
    Fetcher:         middleware.Chain(
        middleware.NewCircuitBreaker(middleware.DefaultCircuitBreakerConfig()),
        middleware.NewAutoThrottle(middleware.AutoThrottleConfig{
            TargetConcurrency: 1, MinDelay: 2 * time.Second, MaxDelay: 15 * time.Second,
        }),
    ).Wrap(smart),
    Queue:           queue.NewReliable(queue.NewMemory(1000), queue.DefaultReliableConfig()),
    BehaviorProfile: profile,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Auto-detect page type and extract accordingly
        result, _ := parse.AutoExtract(resp)
        if result.Type == parse.ContentListing {
            var items []*foxhound.Item
            for _, l := range result.Listings {
                items = append(items, l.AsItem())
            }
            return &foxhound.Result{Items: items}, nil
        }
        // Fallback: extract contacts from business website
        item := foxhound.NewItem()
        item.Set("url", resp.URL)
        item.Set("emails", parse.ExtractEmails(resp))
        item.Set("phones", parse.ExtractPhones(resp))
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
    Writers: []foxhound.Writer{jsonlWriter},
})
h.Run(context.Background())

// Login trail (reusable across sessions with WithStorageState)
login := engine.Login("ig-login",
    "https://www.instagram.com/accounts/login/",
    "input[name='username']", "input[name='password']", "button[type='submit']",
    os.Getenv("IG_USER"), os.Getenv("IG_PASS"),
)

// Feed scraping trail
feed := engine.NewTrail("ig-feed").
    Navigate("https://www.instagram.com/explore/").
    WaitOptional("article", 10*time.Second).
    InfiniteScrollUntil("article", 100, 500).
    Evaluate(`() => {
        const posts = document.querySelectorAll('a[href*="/p/"]');
        return Array.from(posts).map(a => a.href);
    }`)

Auto-Detection — Let foxhound figure out the page type

result, _ := parse.AutoExtract(resp)
switch result.Type {
case parse.ContentArticle:
    fmt.Println(result.Article.Title, result.Article.WordCount, "words")
case parse.ContentListing:
    for _, listing := range result.Listings {
        fmt.Println(listing.Name, listing.Phone, listing.Rating)
    }
case parse.ContentProduct:
    fmt.Println("Product page detected")
}

// Extract preloaded JS data (Next.js, Nuxt, Redux, Apollo)
data, _ := parse.ExtractPreloadedData(resp)
fmt.Println("Framework:", data.Framework) // "nextjs", "nuxt", "react"...

// Detect pagination and follow next pages
links := parse.DetectPagination(resp) // multi-signal scoring (50pt threshold)
for _, link := range links {
    fmt.Println(link.Direction, link.URL, "score:", link.Score)
}

Anti-fragility / Adaptive Selectors

Most scrapers break the moment a target site renames a CSS class. Foxhound's adaptive selectors learn an element signature (tag, classes, text prefix, parent, depth, position) on the first successful match, then fall back to similarity matching when the primary CSS selector stops working — so a class rename, a wrapper-div change, or a sibling reordering does not break extraction.

Enable adaptive mode on a Hunt with WithAdaptive(savePath) (pass an empty string for in-memory only, or a JSON path to persist learned signatures across runs), then use the adaptive helpers on Response:

hunt := engine.NewHunt(engine.HuntConfig{
    Name:      "shop",
    Fetcher:   fetcher,
    Queue:     q,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Inline: register and extract in one call. The signature is
        // learned automatically and persisted by the Hunt.
        title := resp.CSSAdaptive("h1.product-title", "title").Text()
        price := resp.CSSAdaptive(".price", "price").Text()

        // On future runs, even if .product-title gets renamed to
        // .item-name, similarity matching will recover the element.
        // Use Adaptive(name) for selectors registered earlier (e.g.
        // via Trail.Adaptive or a previous CSSAdaptive call).
        _ = resp.Adaptive("title")

        item := foxhound.NewItem()
        item.Set("title", title)
        item.Set("price", price)
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
}).WithAdaptive("./adaptive_signatures.json")

You can also declare adaptive selectors at the Trail level:

trail := engine.NewTrail("books").
    Navigate("https://books.toscrape.com/").
    Adaptive("book_title", ".product_pod h3 a").
    Adaptive("book_price", ".product_pod .price_color")

See examples/adaptive/ for a complete runnable example demonstrating an adaptive selector surviving a CSS class rename.

TLS Fingerprint Customisation

fetch.NewStealth ships in two flavours selected by build tag:

default build (no tag): Go crypto/tls ClientHello — well-known JA3, trivially detected. Use for tests, CI, or non-bot-protected targets only.
-tags tls build: full JA3 / Akamai HTTP/2 / HTTP/3 impersonation via azuretls-client. Use for production scraping.

The same fetch.NewStealth API exists in both, but the underlying TLS layer is completely different. Confirm at startup:

f := fetch.NewStealth(fetch.WithIdentity(profile))
if !f.IsImpersonating() {
    log.Fatal("built without -tags tls; refusing to start in production")
}

Or check the binary directly:

go tool nm /path/to/binary | grep -q azuretls && echo "✅ TLS impersonation active" || echo "❌ Built without -tags tls"

TLS fingerprint comes from the identity

WithIdentity is the only thing you need for fingerprint consistency. It sets the azuretls browser family to match the profile ("firefox" for a Firefox profile — foxhound's primary target since Camoufox is Firefox-based) and lets azuretls's built-in GetLastFirefoxVersion produce the ClientHello at request time:

import "github.com/sadewadee/foxhound/fetch"

f := fetch.NewStealth(fetch.WithIdentity(profile))

The HTTP/2 layer is left to azuretls's browser-aware initHTTP2(browser) so TLS, headers, and HTTP/2 all agree on Firefox. Manual WithHTTP2Fingerprint is supported for power users but logs a startup warning when paired with WithJA3 (see issue #41).

Verified against https://www.bing.com/search and https://duckduckgo.com/ through a datacenter proxy: both return 200 with WithIdentity alone.

TLS certificate verification (v0.0.20)

NewStealth now sets InsecureSkipVerify=true by default. This disables azuretls's built-in DefaultPinManager, which performs an extra TLS handshake per new host to capture SPKI fingerprints and then fails on subsequent requests if a different CDN edge serves a different certificate. Multi-edge targets (Bing, Google, Cloudflare) rotate certificates continuously, making the default PinManager behaviour incompatible with sustained scraping.

foxhound's threat model is bot detection avoidance, not MITM prevention. The default is safe for scraping public sites over a controlled proxy path.

To re-enable full certificate chain, hostname, and pin verification:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithStrictTLSVerify(),   // re-enables chain + hostname + pin checks
)

The startup log includes tls_verify=true when strict mode is active, tls_verify=false (default).

Pin or rotate JA3 (advanced)

Capture your own Firefox JA3 from tls.peet.ws when the curated preset lags real Firefox:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3(myCapturedJA3),       // overrides the auto-applied preset
)

For per-recycle rotation, supply a pool of multiple Firefox captures:

pool := []string{ja3FromYesterday, ja3FromLastWeek, presets.FirefoxLatest().JA3}
f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3Pool(pool),
)

Without -tags tls these options compile but log an error at startup — the underlying net/http transport cannot customise the TLS ClientHello.

Real Scraping Results

Target	Mode	Items	Block Avoidance	Notes
Google Maps (10 queries)	Camoufox + proxy	100 places	100%	1,297 items/hour, 0 CAPTCHAs
Alibaba (yoga mat)	Camoufox + proxy	10 products	100%	Prices + suppliers extracted
bot.sannysoft.com	Camoufox	29/30 PASS	—	webdriver NOT detected
CreepJS	Camoufox	Trust: HIGH	—	Fingerprint consistent

Benchmarks

Measured on hachibi (AMD Ryzen 7 5700G, Docker container, 2 cores / 4GB RAM, Ubuntu 24.04).

CSS Selection — 5,000 elements

Library	Language	Time	vs Foxhound
Foxhound CSS	Go	13.6ms	1.0x
Raw goquery	Go	13.0ms	0.96x
stdlib html	Go	17.7ms	1.3x slower
Raw lxml	Python/C	195.8ms	14.4x slower
BeautifulSoup	Python	245.6ms	18.1x slower

Foxhound Internal Benchmarks (5,000 elements)

Method	Time	Memory	Allocs	Notes
Foxhound CSS	13.6ms	6.5 MB	100K	<1% overhead vs raw goquery
Foxhound Adaptive	17.3ms	6.2 MB	95K	Zero overhead when selector works
Foxhound Schema	31.3ms	13.3 MB	320K	3 fields per item
Foxhound TextExtract	22.5ms	10.0 MB	270K	3 fields per item
FindByText	24.6ms	12.1 MB	165K	Full DOM text search
Regex extract	6.7ms	1.1 MB	15K	Pattern matching on body
Similarity score	96ns	0 B	0	Zero allocation
Item.ToJSON	1.2µs	432 B	10	—
Item.ToMarkdown	716ns	376 B	8	—

Scaling by Document Size

Benchmark	1K elements	5K elements	10K elements	Scaling
Foxhound CSS	2.3ms	13.6ms	29.6ms	~linear
Regex extract	1.5ms	6.7ms	15.7ms	~linear
stdlib html	3.1ms	17.7ms	31.4ms	~linear

# Run yourself
go test -bench=. -benchmem ./benchmarks/

# Run in Docker with resource limits
docker run --cpus=2 --memory=4g foxhound-benchmark:latest \
  go test -bench=. -benchmem ./benchmarks/

Documentation

File	Contents
docs/getting-started.md	Install, first scrape, running modes
docs/configuration.md	Full config.yaml reference
docs/cli.md	All CLI commands and flags
docs/api.md	Go types, interfaces, Hunt/Stream API
docs/anti-detection.md	Identity system, TLS, behavior simulation
docs/parsing.md	Table, preload, directory, pagination, auto-detection parsers
docs/middleware.md	All 13 middleware, chain order
docs/pipeline.md	Pipeline stages and all 9 export formats
docs/proxy.md	Proxy pool, rotation, providers, geo matching
docs/browser.md	Camoufox setup, options, human simulation
docs/examples.md	E-commerce, Maps, adaptive parsing, streaming
docs/deployment.md	Docker, scaling, environment variables

Export Formats

Format	Constructor	Notes
JSON array	`export.NewJSON(path, export.JSONArray)`	Single file, full array
JSON Lines	`export.NewJSON(path, export.JSONLines)`	One object per line, streaming-friendly
CSV	`export.NewCSV(path, cols...)`	Fixed or auto-inferred columns
Markdown table	`export.NewMarkdown(path, export.MarkdownTable)`	GFM pipe table
Markdown list	`export.NewMarkdown(path, export.MarkdownList)`	Bullet list, first field bolded
Markdown cards	`export.NewMarkdown(path, export.MarkdownCards)`	H2 heading + bullet fields
Plain text lines	`export.NewText(path, export.TextLines)`	`key=value` per line
Plain text pretty	`export.NewText(path, export.TextPretty)`	Labelled blocks with separators
XML	`export.NewXML(path, root, item)`	Configurable root/item element names
SQLite	`export.NewSQLite(dbPath, table)`	Auto-creates and extends schema
PostgreSQL	`export.NewPostgres(dsn, table)`	Upsert support, batch inserts
Webhook	`export.NewWebhook(url)`	HTTP POST, optional batch size

Architecture

Job → rate limit → dedup → behavior timing → header enrichment
  → Smart Fetcher (static TLS or Camoufox browser)
    → Block detection (9 vendor patterns) → retry with backoff
  → Parser (CSS / XPath / JSON / Regex / Adaptive / Similarity)
  → User Process() → Result{Items, NextJobs}
  → Pipeline (validate, clean, dedup) → Writers (9 formats)
  → Queue (memory / Redis / SQLite)

License

MIT

Documentation ¶

Overview ¶

Package foxhound is a high-performance Go web scraping framework with native anti-detection built on Camoufox, a Firefox fork designed to evade bot fingerprinting.

Foxhound is a scraping framework for Go — it handles the full lifecycle of web data extraction: fetching pages (with or without a real browser), navigating JavaScript-heavy sites, solving CAPTCHAs, rotating identities and proxies, extracting structured data, and exporting results. Think of it as Scrapy for Go, but with first-class browser automation and anti-detection built in from day one.

Why Foxhound ¶

Modern websites deploy increasingly sophisticated bot detection: TLS fingerprinting, JavaScript challenges (Cloudflare, DataDome, PerimeterX), canvas/WebGL fingerprint checks, and behavioral analysis. Traditional HTTP-only scrapers fail silently against these defenses. Headless Chrome is widely fingerprinted. Foxhound solves this by combining two fetching strategies behind a single API:

A TLS-impersonating HTTP client for static pages (~5-50ms per request)
A Camoufox browser (Firefox fork) via playwright-go for JS-heavy and protected pages (~500ms-5s per request)

The smart router starts with the fast static client and automatically escalates to the full browser when it detects blocks (403, 429, 503, CAPTCHA pages). This means you get HTTP-client speed on easy targets and browser-level evasion on hard ones, without changing your code.

Architecture Overview ¶

Foxhound is organized around five core concepts:

Hunt is the top-level campaign orchestrator. It owns the queue, spawns Walker goroutines, collects stats, and coordinates shutdown. You configure a Hunt with seed URLs, a Processor (your extraction logic), middleware, pipelines, and writers.

Trail is a fluent navigation path builder. It chains browser actions — Navigate, Click, Fill, Wait, Scroll, InfiniteScroll, Evaluate (custom JS), and CaptureXHR — into a reusable sequence that gets compiled into Jobs. Trails describe what a human would do on the page.

Walker is a goroutine that acts as a virtual user. Each Walker pops Jobs from the queue, fetches pages through the middleware chain, runs your Processor, writes extracted Items through the pipeline, and enqueues discovered follow-up Jobs. A Hunt runs multiple Walkers concurrently.

Job is the unit of work: a URL plus fetch mode, priority, browser steps, metadata, and optional session routing. Jobs flow through the queue and are consumed by Walkers.

Session is a stateful client that wraps a fetcher, cookie jar, identity profile, and proxy into a reusable unit. Use it standalone for ad-hoc scraping, or register multiple Sessions with a Hunt via Hunt.AddSession to route different Jobs through different identities and proxies.

Dual-Mode Fetching ¶

Every request flows through a middleware chain before reaching the fetcher:

Job → middleware (rate limit → dedup → autothrottle → cookies → referer → retry)
  → Smart Fetcher (static or browser) → Browser Steps → Parser → Processor
  → Result{Items, Jobs} → Pipeline (validate → clean → dedup → transform)
  → Writers (CSV/JSON/SQLite/Webhook) + Queue (new jobs)

The static fetcher (fetch.NewStealth) uses Go's HTTP client with precise header ordering and TLS fingerprints matched to the identity profile. The browser fetcher (fetch.NewCamoufox) launches a real Camoufox browser instance via the Juggler protocol (Firefox's native remote protocol, less targeted by anti-bot than CDP). The smart fetcher (fetch.NewSmart) wraps both and auto-escalates based on response signals and Bayesian domain risk scoring.

Identity System ¶

Every request uses a complete, internally consistent identity profile: user agent, TLS fingerprint, header order, OS, hardware specs, screen dimensions, locale, timezone, and geolocation all match. Randomness without consistency is the number one cause of bot detection — a Windows UA with a macOS font list, or a US locale with a Tokyo timezone, triggers instant blocks.

Foxhound ships 60 embedded device profiles. The identity package generates profiles with functional options:

id := identity.Generate(
    identity.WithBrowser(identity.BrowserFirefox),
    identity.WithOS(identity.OSWindows),
)

When using Camoufox, the identity is serialized to a JSON config that sets navigator properties, WebGL vendor/renderer, canvas noise, OS-specific fonts, screen dimensions, and timezone at the C++ level inside the browser — not via JavaScript injection that anti-bot scripts can detect.

Human Behavior Simulation ¶

Foxhound models human behavior using statistical distributions observed from real user sessions:

Timing uses Weibull and Gamma distributions (right-skewed, matching human reaction times), not uniform random
Mouse movements follow Bezier curves with natural acceleration/deceleration
Scroll patterns simulate reading speed with variable pause durations
Keyboard input uses per-key timing with realistic inter-keystroke intervals
Session fatigue: warmup slowdown at start, cruise speed mid-session, gradual fatigue buildup — with per-call noise to prevent smooth-curve detection
Per-session jitter: all behavior parameters are perturbed ±15% to prevent anti-bot ML from clustering sessions into discrete archetypes

Three built-in profiles ("careful", "moderate", "aggressive") control the overall pacing. Configure via BehaviorConfig or Hunt options.

NopeCHA CAPTCHA Solving ¶

The NopeCHA browser extension is automatically downloaded from GitHub and loaded into Camoufox on first launch. It solves reCAPTCHA, hCAPTCHA, and Cloudflare Turnstile challenges without API keys. The extension is cached at ~/.cache/foxhound/extensions/nopecha/ and updated automatically.

The design philosophy: the goal is to never trigger a CAPTCHA. If one appears, earlier layers (identity, timing, proxy rotation) failed. NopeCHA is the safety net, not the primary strategy.

Disable with extension_path: "none" in config or WithExtensionPath("none").

Middleware Chain ¶

Foxhound provides 13 middleware layers that wrap the fetcher:

Rate limiting (token bucket per domain)
Request deduplication (URL + method fingerprint)
Autothrottle (adaptive delay based on response times)
Cookie persistence (file-backed or in-memory jar)
Referer chain (natural browsing simulation)
Blocked response detection (403/429/503/CAPTCHA triggers)
Redirect following (with loop detection)
Depth limiting (max crawl depth from seed)
Retry with exponential backoff
Delta-fetch (skip unchanged pages via ETag/Last-Modified)
Circuit breaker (3-state FSM: closed → open → half-open)
Metrics collection (Prometheus counters)
Robots.txt compliance

Middleware is composable: each layer wraps a Fetcher and returns a Fetcher, so you can stack them in any order or add custom middleware.

Adaptive Selectors ¶

Websites frequently change their DOM structure — class names rotate, IDs are randomized, layouts shift. Foxhound's adaptive selector system survives these rewrites by building element signatures (tag, position, text patterns, ancestor structure) alongside CSS selectors. When a selector stops matching, the system falls back to similarity matching against saved signatures.

Enable with Hunt.WithAdaptive and use via Response.Adaptive, Response.CSSAdaptive, Response.CSSAdaptiveAll, or Trail.Adaptive. Signatures can be stored in JSON files or SQLite.

Example: Hunt Campaign ¶

A Hunt is the standard way to scrape at scale. Define a Processor, configure middleware and writers, add seed URLs, and run:

hunt := engine.NewHunt("bookstore",
    engine.WithDomain("books.toscrape.com"),
    engine.WithWalkers(4),
    engine.WithProcessor(foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        result := &foxhound.Result{}
        titles := resp.CSS("h3 a").Texts()
        prices := resp.CSS(".price_color").Texts()
        for i, title := range titles {
            item := foxhound.NewItem()
            item.Set("title", title)
            if i < len(prices) {
                item.Set("price", prices[i])
            }
            result.Items = append(result.Items, item)
        }
        // Follow pagination links
        result.Jobs = resp.Follow("li.next a[href]")
        return result, nil
    })),
)
hunt.AddSeed("https://books.toscrape.com/")
huntResult, err := hunt.Run(ctx)

Example: Trail Navigation ¶

Trails describe multi-step browser interactions for JS-heavy pages. This example searches Google Maps and scrolls through results:

trail := engine.NewTrail("maps-search").
    Navigate("https://www.google.com/maps").
    Fill("input#searchboxinput", "cafe in canggu").
    Click("button#searchbox-searchbutton").
    WaitOptional("div[role='feed']", 10*time.Second).
    InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 20, 100).
    Evaluate("() => document.querySelectorAll('.Nv2PK').length")

jobs := trail.ToJobs()

Example: Session (Ad-Hoc Scraping) ¶

Session is the lightweight alternative to Hunt for quick, stateful fetches. Cookies persist across calls, and the identity stays consistent:

sess := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewStealth()),
    foxhound.WithSessionIdentity(identity.Generate()),
    foxhound.WithSessionProxy("http://user:pass@proxy.example:8080"),
)
defer sess.Close()

resp, err := sess.Get(ctx, "https://example.com/login")
// cookies from login response are automatically persisted
resp2, err := sess.Get(ctx, "https://example.com/dashboard")

Example: CSS and XPath Selectors ¶

Response provides built-in CSS and XPath querying without importing the parse package:

// Single element
title := resp.CSS("h1.title").Text()
price := resp.CSS("span.price").Text()
image := resp.CSS("img.product").Attr("src")

// Multiple elements
allTitles := resp.CSS("h3 a").Texts()
allLinks  := resp.CSS("a.product[href]").Attrs("href")
count     := resp.CSS("div.result").Len()

// XPath (subset converted to CSS internally)
author := resp.XPath("//span[@class='author']")

Example: Follow Links ¶

Response.Follow extracts links from the page and generates follow-up Jobs:

// Follow all product links, route to a different handler
jobs := resp.Follow("a.product-link[href]",
    foxhound.WithFollowCallback("parseProduct"),
    foxhound.WithFollowReferer(true),
)

// Follow a single known URL
nextPage := resp.FollowURL("/api/products?page=2")

// Follow all anchor links on the page
allJobs := resp.FollowAll()

Example: XHR/Fetch Capture ¶

Capture background API calls that JavaScript makes after page load. This is essential for SPAs where data loads via XHR/fetch, not in the initial HTML:

trail := engine.NewTrail("api-capture").
    Navigate("https://example.com/app").
    CaptureXHR("*/api/v2/products*").
    Click("button.load-data").
    Wait("div.results", 5*time.Second)

The captured responses are available in Response.CapturedXHR as a slice of maps with keys: request_url, request_method, status, headers, body.

Example: Cloudflare Solve ¶

For sites behind Cloudflare's JavaScript challenge, Foxhound can detect and wait for the challenge to complete:

fetcher := fetch.NewCamoufox(
    fetch.WithSolveCloudflare(30 * time.Second),
)
// resp.CloudflareSolved is true when the challenge was detected and solved.
// Verification checks: cf_clearance cookie, absence of Turnstile DOM markers,
// and a non-empty cf-turnstile-response token.

Example: Multi-Session Campaigns ¶

Route different jobs through different identities and proxies within a single Hunt:

indexSession := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewStealth()),
    foxhound.WithSessionProxy("http://proxy-a:8080"),
)
detailSession := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewCamoufox()),
    foxhound.WithSessionProxy("http://proxy-b:8080"),
)

hunt := engine.NewHunt("multi-session", /* ... */)
hunt.AddSession("index", indexSession)
hunt.AddSession("detail", detailSession)

// Jobs with SessionID "index" use indexSession's fetcher and proxy;
// jobs with SessionID "detail" use detailSession's browser.

Example: Development Mode ¶

Cache responses on disk for zero-network iteration during development:

hunt := engine.NewHunt("dev",
    engine.WithDevelopmentMode("./dev-cache"),
    // ... other options
)
// First run: fetches from network, saves responses to ./dev-cache/
// Subsequent runs: replays cached responses instantly

Sub-Packages ¶

The foxhound module is organized into focused sub-packages:

[engine] — Hunt, Trail, Walker, scheduler, retry logic, stats collection, and ItemList for thread-safe item accumulation with CSV/JSON/JSONL export.
[fetch] — Stealth HTTP client (TLS fingerprinting + header ordering), Camoufox browser automation (Juggler protocol), Smart router (auto-escalation), XHR capture, page pool management, domain risk scoring, and SOCKS5 auth relay.
[identity] — Profile generation with 60 embedded device profiles. Produces consistent identity bundles (UA, TLS, headers, OS, hardware, screen, locale, geo) and Camoufox fingerprint configs.
[behavior] — Human behavior simulation: timing (Weibull/Gamma distributions), mouse (Bezier curves), scroll patterns, keyboard input, navigation profiles, and session fatigue modeling.
[middleware] — 13 composable middleware layers: rate limiting, dedup, retry, autothrottle, cookies, referer, redirect, depth, delta-fetch, circuit breaker, metrics, blocked detection, and robots.txt.
[parse] — Content extraction: CSS (goquery), JSON (dot-path), XPath (subset), regex, structured schema, Markdown/text conversion, metadata (JSON-LD, OpenGraph, NextData, NuxtData), contact deobfuscation, sitemap/feed parsing, adaptive selectors, HTML table extraction, JS preload detection, directory listings, pagination detection, and auto-detection with Readability-style scoring.
[pipeline] — Item processing stages: validation, cleaning, deduplication, field transformation (regex, rename, type coercion), and chain composition.
pipeline/export — Output writers: JSON, JSONL, CSV, XML, SQLite, PostgreSQL, Markdown, Text, and Webhook.
[proxy] — Proxy pool management with geo-aware selection, health checking, cooldown tracking, and provider adapters (BrightData, Oxylabs, Smartproxy).
[queue] — Job queue implementations: in-memory (heap-based priority queue), Redis (sorted sets), and SQLite (persistent).
[cache] — Response caching: in-memory (LRU + TTL), file-based (SHA256 keys), Redis, and SQLite.
[captcha] — CAPTCHA detection (Cloudflare, reCAPTCHA, hCAPTCHA, GeeTest) and solving via NopeCHA, CapSolver, 2Captcha, and Turnstile.
[monitor] — Observability: atomic stat counters, Prometheus metrics (isolated registry), and webhook-based alerting rules.
cmd/foxhound — CLI tool: init, run, check, proxy-test, shell, browser-shell, resume, curl2fox, and preview commands.

Index ¶

Constants
func RegisterAdaptiveHooks(extractText func(extractor any, body []byte, name string) string, ...)
func RegisterHTMLSelectors(textsFunc func(body []byte, selector string) []string, ...)
func SetupLogging(cfg LoggingConfig, verbose int)
type AlertingExportConfig
type AutoThrottleMiddlewareConfig
type BehaviorConfig
type BrowserFetchConfig
type CacheConfig
type CaptchaConfig
type CleanConfig
type ConcurrencyConfig
type Config
- func LoadConfig(path string) (*Config, error)
type DedupConfig
type DeltaFetchConfig
type DepthLimitConfig
type DownloadDelayConfig
type Duration
- func (d Duration) MarshalYAML() (any, error)
- func (d *Duration) UnmarshalYAML(value *yaml.Node) error
type ExportConfig
type FetchConfig
type FetchMode
- func (m FetchMode) String() string
type Fetcher
type FetcherFunc
- func (f FetcherFunc) Close() error
- func (f FetcherFunc) Fetch(ctx context.Context, job *Job) (*Response, error)
type FollowOption
- func WithFollowCallback(callback string) FollowOption
- func WithFollowDontFilter(dontFilter bool) FollowOption
- func WithFollowMeta(meta map[string]any) FollowOption
- func WithFollowMode(mode FetchMode) FollowOption
- func WithFollowPriority(p Priority) FollowOption
- func WithFollowReferer(referer bool) FollowOption
type HuntConfig
type IdentityConfig
type Item
- func NewItem() *Item
- func (it *Item) Get(key string) (any, bool)
- func (it *Item) GetFloat(key string) float64
- func (it *Item) GetInt(key string) int
- func (it *Item) GetString(key string) string
- func (it *Item) Has(key string) bool
- func (it *Item) Keys() []string
- func (it *Item) Set(key string, value any)
- func (it *Item) String() string
- func (it *Item) ToCSVRow(columns []string) []string
- func (it *Item) ToJSON() ([]byte, error)
- func (it *Item) ToJSONPretty() ([]byte, error)
- func (it *Item) ToMap() map[string]any
- func (it *Item) ToMarkdown() string
- func (it *Item) ToText() string
type Job
type JobStep
type LoggingConfig
type MetricsExportConfig
type Middleware
type MiddlewareConfig
type MiddlewareFunc
- func (f MiddlewareFunc) Wrap(next Fetcher) Fetcher
type MonitorConfig
type PageActionsConfig
type Pipeline
type PipelineEntry
type PipelineFunc
- func (f PipelineFunc) Process(ctx context.Context, item *Item) (*Item, error)
type Priority
type Processor
type ProcessorFunc
- func (f ProcessorFunc) Process(ctx context.Context, resp *Response) (*Result, error)
type ProviderEntry
type ProxyConfig
type Queue
type QueueConfig
type RateLimitConfig
type Response
- func (r *Response) Adaptive(name string) string
- func (r *Response) AdaptiveExtractor() any
- func (r *Response) CSS(selector string) *Selection
- func (r *Response) CSSAdaptive(selector, name string) *Selection
- func (r *Response) CSSAdaptiveAll(selector, name string) *Selection
- func (r *Response) Follow(selector string, opts ...FollowOption) []*Job
- func (r *Response) FollowAll(opts ...FollowOption) []*Job
- func (r *Response) FollowURL(rawURL string, opts ...FollowOption) *Job
- func (r *Response) IsSuccess() bool
- func (r *Response) SetAdaptiveExtractor(ae any)
- func (r *Response) TextBody() string
- func (r *Response) XPath(expr string) string
- func (r *Response) XPathAll(expr string) []string
type Result
type RobotsTxtConfig
type Selection
- func (s *Selection) Attr(attr string) string
- func (s *Selection) Attrs(attr string) []string
- func (s *Selection) Len() int
- func (s *Selection) Text() string
- func (s *Selection) Texts() []string
type Selector
- func (s *Selector) CSS(selector string) *Selection
- func (s *Selector) XPath(expr string) string
- func (s *Selector) XPathAll(expr string) []string
type Session
- func NewSession(opts ...SessionOption) *Session
- func (s *Session) Close() error
- func (s *Session) Cookies() []*http.Cookie
- func (s *Session) CookiesFor(rawURL string) []*http.Cookie
- func (s *Session) Fetch(ctx context.Context, job *Job) (*Response, error)
- func (s *Session) Fetcher() Fetcher
- func (s *Session) Get(ctx context.Context, rawURL string) (*Response, error)
- func (s *Session) Identity() any
- func (s *Session) Name() string
- func (s *Session) ProxyURL() string
- func (s *Session) SetFetcher(f Fetcher)
- func (s *Session) SetName(name string)
type SessionOption
- func WithSessionCookieJar(jar http.CookieJar) SessionOption
- func WithSessionFetcher(f Fetcher) SessionOption
- func WithSessionIdentity(p any) SessionOption
- func WithSessionProxy(rawURL string) SessionOption
type StaticFetchConfig
type ValidateConfig
type Writer

Constants ¶

View Source

const (
	JobStepNavigate       = 0
	JobStepClick          = 1
	JobStepWait           = 2
	JobStepExtract        = 3
	JobStepScroll         = 4
	JobStepInfiniteScroll = 5  // scroll to bottom until no new content loads
	JobStepLoadMore       = 6  // click "load more" button repeatedly until gone
	JobStepPaginate       = 7  // detect and follow pagination links
	JobStepEvaluate       = 8  // execute custom JavaScript on the page
	JobStepFill           = 9  // type text into input field with human-like keystrokes
	JobStepCollect        = 10 // collect URLs from page into pool
)

Step action constants for JobStep. These are package-level int constants (not engine.StepAction) to avoid an import cycle between foxhound ↔ engine.

Variables ¶

This section is empty.

Functions ¶

func RegisterAdaptiveHooks ¶

func RegisterAdaptiveHooks(
	extractText func(extractor any, body []byte, name string) string,
	register func(extractor any, body []byte, name, selector string, all bool),
)

RegisterAdaptiveHooks is called by the parse package to wire its AdaptiveExtractor implementation into Response.Adaptive / CSSAdaptive.

func RegisterHTMLSelectors ¶

func RegisterHTMLSelectors(
	textsFunc func(body []byte, selector string) []string,
	attrsFunc func(body []byte, selector, attr string) []string,
	countFunc func(body []byte, selector string) int,
	xpathFunc func(expr string) string,
)

RegisterHTMLSelectors is called by the parse package to provide the HTML selection implementations used by Response.CSS() and Response.XPath().

func SetupLogging ¶

func SetupLogging(cfg LoggingConfig, verbose int)

SetupLogging configures the global slog logger from a LoggingConfig. The verbose parameter overrides the config level:

0 = use config level (default info)
1 = debug  (-v)
2 = debug with source location (-vv)

Types ¶

type AlertingExportConfig ¶

type AlertingExportConfig struct {
	WebhookURL         string   `yaml:"webhook_url"`
	ErrorRateThreshold float64  `yaml:"error_rate_threshold"`
	BlockRateThreshold float64  `yaml:"block_rate_threshold"`
	Cooldown           Duration `yaml:"cooldown"`
}

AlertingExportConfig configures webhook alerting.

type AutoThrottleMiddlewareConfig ¶

type AutoThrottleMiddlewareConfig struct {
	Enabled           bool     `yaml:"enabled"`
	TargetConcurrency float64  `yaml:"target_concurrency"`
	InitialDelay      Duration `yaml:"initial_delay"`
	MinDelay          Duration `yaml:"min_delay"`
	MaxDelay          Duration `yaml:"max_delay"`
}

AutoThrottleMiddlewareConfig configures the adaptive per-domain throttle.

type BehaviorConfig ¶

type BehaviorConfig struct {
	// Profile selects the preset behavior profile: "careful", "moderate", or
	// "aggressive". Defaults to "moderate" when unset.
	Profile string `yaml:"profile"`
}

BehaviorConfig configures the human-simulation behavior profile for walkers.

type BrowserFetchConfig ¶

type BrowserFetchConfig struct {
	Timeout       Duration `yaml:"timeout"`
	BlockImages   bool     `yaml:"block_images"`
	BlockWebRTC   bool     `yaml:"block_webrtc"`
	Headless      string   `yaml:"headless"`
	Instances     int      `yaml:"instances"`
	ExtensionPath string   `yaml:"extension_path"` // path to extension dir/xpi, or "none" to disable NopeCHA auto-load
}

BrowserFetchConfig configures the Camoufox browser.

type CacheConfig ¶

type CacheConfig struct {
	Backend string   `yaml:"backend"` // "memory" | "file" | "sqlite" | "redis" | "" (disabled)
	TTL     Duration `yaml:"ttl"`
	MaxSize int      `yaml:"max_size"` // max entries for memory cache
}

CacheConfig configures response caching.

type CaptchaConfig ¶

type CaptchaConfig struct {
	Enabled  bool   `yaml:"enabled"`
	Provider string `yaml:"provider"` // "capsolver" | "twocaptcha" | "nopecha"
	APIKey   string `yaml:"api_key"`
}

CaptchaConfig configures CAPTCHA detection and solving.

type CleanConfig ¶

type CleanConfig struct {
	TrimWhitespace bool `yaml:"trim_whitespace"`
	NormalizePrice bool `yaml:"normalize_price"`
	NormalizeDate  bool `yaml:"normalize_date"`
}

CleanConfig configures the cleaning pipeline stage.

type ConcurrencyConfig ¶

type ConcurrencyConfig struct {
	PerDomain int `yaml:"per_domain"` // max concurrent requests per domain (default 2)
}

ConcurrencyConfig limits concurrent in-flight requests per domain.

type Config ¶

type Config struct {
	Hunt        HuntConfig        `yaml:"hunt"`
	Identity    IdentityConfig    `yaml:"identity"`
	Proxy       ProxyConfig       `yaml:"proxy"`
	Fetch       FetchConfig       `yaml:"fetch"`
	Middleware  MiddlewareConfig  `yaml:"middleware"`
	Pipeline    []PipelineEntry   `yaml:"pipeline"`
	Queue       QueueConfig       `yaml:"queue"`
	Cache       CacheConfig       `yaml:"cache"`
	Monitor     MonitorConfig     `yaml:"monitor"`
	Captcha     CaptchaConfig     `yaml:"captcha"`
	Logging     LoggingConfig     `yaml:"logging"`
	Behavior    BehaviorConfig    `yaml:"behavior"`
	PageActions PageActionsConfig `yaml:"page_actions"`
}

Config is the top-level configuration for a Foxhound instance.

func LoadConfig ¶

func LoadConfig(path string) (*Config, error)

LoadConfig reads and parses a YAML configuration file.

type DedupConfig ¶

type DedupConfig struct {
	Strategy string `yaml:"strategy"`
	Store    string `yaml:"store"`
}

DedupConfig configures URL deduplication.

type DeltaFetchConfig ¶

type DeltaFetchConfig struct {
	Enabled  bool     `yaml:"enabled"`
	Strategy string   `yaml:"strategy"` // "skip_seen" | "skip_recent"
	TTL      Duration `yaml:"ttl"`
	Store    string   `yaml:"store"` // "memory" | "redis" | "sqlite"
}

DeltaFetchConfig configures cross-run URL deduplication.

type DepthLimitConfig ¶

type DepthLimitConfig struct {
	Max int `yaml:"max"`
}

DepthLimitConfig configures crawl depth limiting.

type DownloadDelayConfig ¶

type DownloadDelayConfig struct {
	Enabled   bool              `yaml:"enabled"`
	Default   Duration          `yaml:"default"`   // base delay between same-domain requests
	Domains   map[string]string `yaml:"domains"`   // per-domain delay overrides (domain -> duration string)
	Randomize bool              `yaml:"randomize"` // add ±25% jitter
}

DownloadDelayConfig configures per-domain download delays.

type Duration ¶

type Duration struct {
	time.Duration
}

Duration is a time.Duration that supports YAML marshaling.

func (Duration) MarshalYAML ¶

func (d Duration) MarshalYAML() (any, error)

MarshalYAML serializes the duration as a string.

func (*Duration) UnmarshalYAML ¶

func (d *Duration) UnmarshalYAML(value *yaml.Node) error

UnmarshalYAML parses a duration string like "30s", "5m", "1h".

type ExportConfig ¶

type ExportConfig struct {
	Type      string `yaml:"type"`
	Path      string `yaml:"path,omitempty"`
	Table     string `yaml:"table,omitempty"`
	UpsertKey string `yaml:"upsert_key,omitempty"`
	BatchSize int    `yaml:"batch_size,omitempty"`
}

ExportConfig defines an export destination.

type FetchConfig ¶

type FetchConfig struct {
	Static  StaticFetchConfig  `yaml:"static"`
	Browser BrowserFetchConfig `yaml:"browser"`
}

FetchConfig configures the fetch layer.

type FetchMode ¶

type FetchMode int

FetchMode indicates which fetcher to use for a request.

const (
	// FetchAuto lets the smart router decide between static and browser.
	FetchAuto FetchMode = iota
	// FetchStatic forces the TLS-impersonating HTTP client.
	FetchStatic
	// FetchBrowser forces the Camoufox browser.
	FetchBrowser
)

func (FetchMode) String ¶

func (m FetchMode) String() string

String returns the string representation of a FetchMode.

type Fetcher ¶

type Fetcher interface {
	// Fetch performs an HTTP request and returns the response.
	Fetch(ctx context.Context, job *Job) (*Response, error)
	// Close releases any resources held by the fetcher.
	Close() error
}

Fetcher defines the interface for making HTTP requests.

type FetcherFunc ¶

type FetcherFunc func(ctx context.Context, job *Job) (*Response, error)

FetcherFunc is an adapter to allow use of ordinary functions as Fetchers.

func (FetcherFunc) Close ¶

func (f FetcherFunc) Close() error

Close is a no-op to satisfy the Fetcher interface.

func (FetcherFunc) Fetch ¶

func (f FetcherFunc) Fetch(ctx context.Context, job *Job) (*Response, error)

Fetch calls f(ctx, job).

type FollowOption ¶

type FollowOption func(*followConfig)

FollowOption configures how Follow generates jobs from discovered links.

func WithFollowCallback ¶

func WithFollowCallback(callback string) FollowOption

WithFollowCallback sets a callback name in Meta["callback"] for generated jobs, allowing spider-style routing of responses to different handlers.

func WithFollowDontFilter ¶

func WithFollowDontFilter(dontFilter bool) FollowOption

WithFollowDontFilter marks generated jobs to skip deduplication filtering. Useful for pages that need to be re-fetched (e.g. pagination, monitoring).

func WithFollowMeta ¶

func WithFollowMeta(meta map[string]any) FollowOption

WithFollowMeta sets metadata on generated follow-up jobs.

func WithFollowMode ¶

func WithFollowMode(mode FetchMode) FollowOption

WithFollowMode sets the FetchMode for generated follow-up jobs.

func WithFollowPriority ¶

func WithFollowPriority(p Priority) FollowOption

WithFollowPriority sets the Priority for generated follow-up jobs.

func WithFollowReferer ¶

func WithFollowReferer(referer bool) FollowOption

WithFollowReferer sets the current response URL as referer in the generated job's Meta["referer"]. This maintains referer chain for natural browsing simulation.

type HuntConfig ¶

type HuntConfig struct {
	Domain         string `yaml:"domain"`
	Walkers        int    `yaml:"walkers"`
	MaxConcurrency int    `yaml:"max_concurrency"` // global max concurrent requests (0 = walkers count)
}

HuntConfig configures the scraping campaign.

type IdentityConfig ¶

type IdentityConfig struct {
	Browser       string   `yaml:"browser"`
	OS            []string `yaml:"os"`
	FingerprintDB string   `yaml:"fingerprint_db"`
}

IdentityConfig configures identity generation.

type Item ¶

type Item struct {
	// Fields holds the extracted data as key-value pairs.
	Fields map[string]any
	// Meta carries metadata from the originating job.
	Meta map[string]any
	// URL is the source URL.
	URL string
	// Timestamp is when the item was created.
	Timestamp time.Time
}

Item represents a scraped data item passing through the pipeline.

func NewItem ¶

func NewItem() *Item

NewItem creates a new Item with initialized fields.

func (*Item) Get ¶

func (it *Item) Get(key string) (any, bool)

Get retrieves a field from the item.

func (*Item) GetFloat ¶

func (it *Item) GetFloat(key string) float64

GetFloat returns the field value as float64. Accepts float64 and int/int64 stored in the Fields map. Returns 0 if the field is absent or non-numeric.

func (*Item) GetInt ¶

func (it *Item) GetInt(key string) int

GetInt returns the field value as int. Accepts int, int64, and float64 stored in the Fields map (float64 is truncated). Returns 0 if the field is absent or non-numeric.

func (*Item) GetString ¶

func (it *Item) GetString(key string) string

GetString returns the field value as a string. Returns "" if the field is absent or its underlying type is not string.

func (*Item) Has ¶

func (it *Item) Has(key string) bool

Has reports whether the field exists and has a non-empty string representation. A field set to nil or "" is treated as absent.

func (*Item) Keys ¶

func (it *Item) Keys() []string

Keys returns the item's field names in sorted (ascending) order.

func (*Item) Set ¶

func (it *Item) Set(key string, value any)

Set sets a field on the item.

func (*Item) String ¶

func (it *Item) String() string

String implements fmt.Stringer. It returns a compact JSON representation of the item fields, falling back to a key=value format on marshal error.

func (*Item) ToCSVRow ¶

func (it *Item) ToCSVRow(columns []string) []string

ToCSVRow returns field values as a string slice following the given column order. Missing fields are returned as empty strings.

func (*Item) ToJSON ¶

func (it *Item) ToJSON() ([]byte, error)

ToJSON returns item.Fields serialised as compact JSON bytes.

func (*Item) ToJSONPretty ¶

func (it *Item) ToJSONPretty() ([]byte, error)

ToJSONPretty returns item.Fields serialised as indented JSON bytes.

func (*Item) ToMap ¶

func (it *Item) ToMap() map[string]any

ToMap returns a shallow copy of item.Fields. Mutations to the returned map do not affect the Item.

func (*Item) ToMarkdown ¶

func (it *Item) ToMarkdown() string

ToMarkdown returns a compact Markdown representation of the item as a bullet list: the first key (sorted) is bolded; the rest are appended.

func (*Item) ToText ¶

func (it *Item) ToText() string

ToText returns a plain-text representation with one "key: value" line per field, fields in sorted order.

type Job ¶

type Job struct {
	// ID is a unique identifier for this job.
	ID string
	// URL is the target URL to fetch.
	URL string
	// Method is the HTTP method (default GET).
	Method string
	// Headers are additional HTTP headers to include.
	Headers http.Header
	// Body is the request body for POST/PUT requests.
	Body []byte
	// FetchMode determines which fetcher to use.
	FetchMode FetchMode
	// Priority determines processing order.
	Priority Priority
	// MaxRetries overrides the default retry count.
	MaxRetries int
	// Meta is arbitrary metadata passed through the pipeline.
	Meta map[string]any
	// Depth is the crawl depth from the seed URL.
	Depth int
	// Domain is the target domain extracted from URL.
	Domain string
	// CreatedAt is when the job was created.
	CreatedAt time.Time
	// Steps are browser-side actions to execute after page load (optional).
	// When non-empty, the job requires a browser fetcher. The omitempty tag
	// ensures backward compatibility with existing queue serialization.
	Steps []JobStep `json:"steps,omitempty"`
	// NavigationTimeout overrides the fetcher's default navigation timeout
	// for this specific job. Useful for pages known to be slow (e.g. later
	// pagination pages on Google SERP). Zero means use the fetcher default.
	NavigationTimeout time.Duration `json:"navigation_timeout,omitempty"`
	// DontFilter when true skips deduplication for this specific job.
	// Useful for pages that need to be re-fetched (e.g. pagination, monitoring).
	DontFilter bool `json:"dont_filter,omitempty"`
	// Callback is an optional handler name that the spider routes to a
	// specific Parse method. When empty, the default processor is used.
	Callback string `json:"callback,omitempty"`
	// SessionID names a session previously registered with Hunt.AddSession.
	// When set, the walker routes this job through the named session's
	// fetcher (with its own cookie jar, identity, and proxy) instead of the
	// hunt's default fetcher. Empty (default) preserves backward-compatible
	// behaviour: the hunt's default fetcher is used.
	SessionID string `json:"session_id,omitempty"`
}

Job represents a unit of work to be processed by the engine.

type JobStep ¶

type JobStep struct {
	// Action identifies the step type (JobStepClick, JobStepWait, etc.).
	// Zero value (JobStepNavigate) is intentionally NOT omitempty so it
	// always serializes.
	Action int `json:"action"`
	// Selector is the CSS selector for Click, Wait, and Extract steps.
	Selector string `json:"selector,omitempty"`
	// Duration is the timeout for Wait steps.
	Duration time.Duration `json:"duration,omitempty"`
	// ScrollAxis is 0 for vertical, 1 for horizontal (only for Scroll steps).
	ScrollAxis int `json:"scroll_axis,omitempty"`
	// ScrollExtent is the target scroll distance in pixels. Defaults to 3000
	// when zero.
	ScrollExtent int `json:"scroll_extent,omitempty"`
	// ScrollMode is 0 for ScrollReading, 1 for ScrollScan. Zero value
	// (omitted in JSON) defaults to ScrollReading.
	ScrollMode int `json:"scroll_mode,omitempty"`
	// MaxScrolls is the maximum number of scroll-to-bottom iterations for
	// InfiniteScroll steps. Defaults to 50 when zero.
	MaxScrolls int `json:"max_scrolls,omitempty"`
	// MaxClicks is the maximum number of "load more" button clicks for
	// LoadMore steps. Defaults to 20 when zero.
	MaxClicks int `json:"max_clicks,omitempty"`
	// MaxPages is the maximum number of pagination pages to follow for
	// Paginate steps. Defaults to 10 when zero.
	MaxPages int `json:"max_pages,omitempty"`
	// Script is the JavaScript code to execute for Evaluate steps.
	Script string `json:"script,omitempty"`
	// WaitState specifies what state to wait for in Wait steps:
	// "attached" (default), "detached", "visible", or "hidden".
	// Maps to playwright's WaitForSelectorState.
	WaitState string `json:"wait_state,omitempty"`
	// Optional marks this step as non-fatal: if it fails, execution continues
	// instead of aborting the fetch. Useful for steps that may not always be
	// present on the page (e.g. a cookie banner dismiss button).
	Optional bool `json:"optional,omitempty"`
	// StopSelector is a CSS selector that signals InfiniteScroll to stop
	// when the target element count is reached. Used with StopCount to scroll
	// until N items exist (e.g. "div.result" + StopCount=20).
	StopSelector string `json:"stop_selector,omitempty"`
	// StopCount is the target element count for StopSelector. InfiniteScroll
	// stops when document.querySelectorAll(StopSelector).length >= StopCount.
	// Only used when StopSelector is set. Defaults to 1 when zero.
	StopCount int `json:"stop_count,omitempty"`
	// ScrollWait is the duration to wait after each scroll iteration before
	// checking for new content. Defaults to 2s when zero. Increase for slow
	// sites like Google Maps (3-5s recommended).
	ScrollWait time.Duration `json:"scroll_wait,omitempty"`
	// Value is the text to type into an input field for Fill steps.
	Value string `json:"value,omitempty"`
}

JobStep is a single browser-side action that should be executed after the page loads. Steps are attached to a Job by Trail.ToJobs() and executed by the CamoufoxFetcher before content extraction.

type LoggingConfig ¶

type LoggingConfig struct {
	Level  string `yaml:"level"`
	Format string `yaml:"format"`
	Output string `yaml:"output"`
}

LoggingConfig configures structured logging.

type MetricsExportConfig ¶

type MetricsExportConfig struct {
	Enabled bool `yaml:"enabled"`
	Port    int  `yaml:"port"`
}

MetricsExportConfig configures Prometheus metrics.

type Middleware ¶

type Middleware interface {
	// Wrap takes a Fetcher and returns a wrapped Fetcher.
	Wrap(next Fetcher) Fetcher
}

Middleware wraps a Fetcher to add cross-cutting behavior.

type MiddlewareConfig ¶

type MiddlewareConfig struct {
	RateLimit     RateLimitConfig              `yaml:"ratelimit"`
	AutoThrottle  AutoThrottleMiddlewareConfig `yaml:"autothrottle"`
	Dedup         DedupConfig                  `yaml:"dedup"`
	DeltaFetch    DeltaFetchConfig             `yaml:"deltafetch"`
	RobotsTxt     RobotsTxtConfig              `yaml:"robots_txt"`
	DepthLimit    DepthLimitConfig             `yaml:"depth_limit"`
	Concurrency   ConcurrencyConfig            `yaml:"concurrency"`
	DownloadDelay DownloadDelayConfig          `yaml:"download_delay"`
}

MiddlewareConfig configures request/response processing middleware.

type MiddlewareFunc ¶

type MiddlewareFunc func(next Fetcher) Fetcher

MiddlewareFunc is an adapter for using functions as Middleware.

func (MiddlewareFunc) Wrap ¶

func (f MiddlewareFunc) Wrap(next Fetcher) Fetcher

Wrap calls f(next).

type MonitorConfig ¶

type MonitorConfig struct {
	Metrics  MetricsExportConfig  `yaml:"metrics"`
	Alerting AlertingExportConfig `yaml:"alerting"`
}

MonitorConfig configures observability.

type PageActionsConfig ¶

type PageActionsConfig struct {
	Scripts []string `yaml:"scripts"` // JS snippets to run after page load
}

PageActionsConfig configures JavaScript execution after page load.

type Pipeline ¶

type Pipeline interface {
	// Process takes an item and returns the (possibly modified) item.
	// Return nil to drop the item. Return an error to log and continue.
	Process(ctx context.Context, item *Item) (*Item, error)
}

Pipeline processes items after extraction.

type PipelineEntry ¶

type PipelineEntry struct {
	Validate *ValidateConfig `yaml:"validate,omitempty"`
	Clean    *CleanConfig    `yaml:"clean,omitempty"`
	Dedup    *DedupConfig    `yaml:"dedup,omitempty"`
	Export   []ExportConfig  `yaml:"export,omitempty"`
}

PipelineEntry is a polymorphic pipeline stage definition.

type PipelineFunc ¶

type PipelineFunc func(ctx context.Context, item *Item) (*Item, error)

PipelineFunc is an adapter for using functions as Pipeline stages.

func (PipelineFunc) Process ¶

func (f PipelineFunc) Process(ctx context.Context, item *Item) (*Item, error)

Process calls f(ctx, item).

type Priority ¶

type Priority int

Priority represents job priority in the queue.

const (
	PriorityLow    Priority = 0
	PriorityNormal Priority = 5
	PriorityHigh   Priority = 10
)

type Processor ¶

type Processor interface {
	// Process takes a response and returns extracted items and new jobs.
	Process(ctx context.Context, resp *Response) (*Result, error)
}

Processor defines the user-provided logic for handling responses. This is the main extension point: users implement this to extract data.

type ProcessorFunc ¶

type ProcessorFunc func(ctx context.Context, resp *Response) (*Result, error)

ProcessorFunc is an adapter to allow use of ordinary functions as Processors.

func (ProcessorFunc) Process ¶

func (f ProcessorFunc) Process(ctx context.Context, resp *Response) (*Result, error)

Process calls f(ctx, resp).

type ProviderEntry ¶

type ProviderEntry struct {
	Type     string   `yaml:"type"`
	List     []string `yaml:"list,omitempty"`
	APIKey   string   `yaml:"api_key,omitempty"`
	Username string   `yaml:"username,omitempty"`
	Password string   `yaml:"password,omitempty"`
	Product  string   `yaml:"product,omitempty"`
	Country  string   `yaml:"country,omitempty"`
}

ProviderEntry defines a proxy provider in configuration.

type ProxyConfig ¶

type ProxyConfig struct {
	Providers           []ProviderEntry `yaml:"providers"`
	Rotation            string          `yaml:"rotation"`
	Cooldown            Duration        `yaml:"cooldown"`
	MaxRequestsPerProxy int             `yaml:"max_requests_per_proxy"`
	HealthCheckInterval Duration        `yaml:"health_check_interval"`
}

ProxyConfig configures proxy management.

type Queue ¶

type Queue interface {
	// Push adds a job to the queue.
	Push(ctx context.Context, job *Job) error
	// Pop removes and returns the highest priority job. Blocks until available
	// or context is cancelled.
	Pop(ctx context.Context) (*Job, error)
	// Len returns the number of jobs in the queue.
	Len() int
	// Close releases queue resources.
	Close() error
}

Queue defines the interface for job storage and retrieval.

type QueueConfig ¶

type QueueConfig struct {
	Backend string `yaml:"backend"`
}

QueueConfig configures the job queue backend.

type RateLimitConfig ¶

type RateLimitConfig struct {
	Enabled        bool    `yaml:"enabled"`
	RequestsPerSec float64 `yaml:"requests_per_sec"`
	BurstSize      int     `yaml:"burst_size"`
}

RateLimitConfig configures per-domain rate limiting.

type Response ¶

type Response struct {
	// StatusCode is the HTTP status code.
	StatusCode int
	// Headers are the response headers.
	Headers http.Header
	// Body is the response body bytes.
	Body []byte
	// URL is the final URL after redirects.
	URL string
	// FetchMode indicates which fetcher was used.
	FetchMode FetchMode
	// Duration is how long the fetch took.
	Duration time.Duration
	// Job is the original job that produced this response.
	Job *Job
	// StepResults holds return values from JobStepEvaluate steps, keyed by
	// step index (e.g. "step_0", "step_2"). Only populated when steps
	// produce output.
	StepResults map[string]any
	// CapturedXHR holds captured XHR/fetch responses when capture patterns are configured.
	// Each entry is a map with keys: request_url, request_method, status, headers, body.
	CapturedXHR []map[string]any
	// Cookies contains cookies set by the response (Set-Cookie headers for
	// static fetches, browser context cookies for browser fetches).
	Cookies []*http.Cookie `json:"cookies,omitempty"`
	// CloudflareSolved is true when a Cloudflare Turnstile / JS challenge was
	// detected and verified as solved before the response was returned. The
	// verification checks for the cf_clearance cookie, absence of Turnstile
	// DOM markers, and a non-empty cf-turnstile-response token. Only set when
	// the browser fetcher was launched with WithSolveCloudflare.
	CloudflareSolved bool `json:"cloudflare_solved,omitempty"`
	// contains filtered or unexported fields
}

Response wraps an HTTP response with additional metadata.

func (*Response) Adaptive ¶

func (r *Response) Adaptive(name string) string

Adaptive returns the text of a registered adaptive selector by name. Falls back to similarity matching if the primary CSS selector finds nothing on the current page. Returns an empty string when no extractor is attached or no element is matched.

Requires Hunt.WithAdaptive(...) to have been called.

func (*Response) AdaptiveExtractor ¶

func (r *Response) AdaptiveExtractor() any

AdaptiveExtractor returns the attached extractor as an opaque value. Callers in the parse package can type-assert it to *parse.AdaptiveExtractor. Returns nil when no extractor is configured for this response.

func (*Response) CSS ¶

func (r *Response) CSS(selector string) *Selection

CSS returns a Selector bound to this Response. Subsequent calls share the same parsed document, making it efficient to chain multiple selectors:

title := resp.CSS("h1").Text()
links := resp.CSS("a[href]").Attrs("href")

func (*Response) CSSAdaptive ¶

func (r *Response) CSSAdaptive(selector, name string) *Selection

CSSAdaptive runs a CSS selector against this response, registering it as an adaptive selector under the given name. On future runs, if the selector no longer matches, similarity matching falls back to the saved signature.

Requires Hunt.WithAdaptive(...) to have been called. Returns a Selection supporting .Text(), .Attr(), .Texts(), and .Attrs(); when no extractor is configured, returns a Selection backed by a plain CSS query (degraded behaviour).

func (*Response) CSSAdaptiveAll ¶

func (r *Response) CSSAdaptiveAll(selector, name string) *Selection

CSSAdaptiveAll is like CSSAdaptive but registers the selector for multi-element extraction. Returns a Selection that, when queried via .Texts() / .Attrs(), yields all matches.

func (*Response) Follow ¶

func (r *Response) Follow(selector string, opts ...FollowOption) []*Job

Follow generates follow-up Jobs from all links matching the CSS selector in the response body. Links are resolved relative to the response URL, deduplicated, and filtered to HTTP(S) schemes. The generated jobs inherit Depth+1 from the originating job.

Example:

jobs := resp.Follow("a.product-link[href]", foxhound.WithFollowCallback("parseProduct"))

func (*Response) FollowAll ¶

func (r *Response) FollowAll(opts ...FollowOption) []*Job

FollowAll generates follow-up Jobs for all anchor links (a[href]) in the response body. It is shorthand for Follow("a[href]", opts...).

func (*Response) FollowURL ¶

func (r *Response) FollowURL(rawURL string, opts ...FollowOption) *Job

FollowURL creates a single follow-up Job for the given URL. The URL is resolved relative to the response URL. The generated job inherits Depth+1 from the originating job.

Unlike Follow, which extracts links from HTML via CSS selectors, FollowURL is for programmatically following a known URL (e.g. an API endpoint or a URL extracted from JSON data).

Example:

nextPage := resp.FollowURL("/api/products?page=2", foxhound.WithFollowReferer(true))

func (*Response) IsSuccess ¶

func (r *Response) IsSuccess() bool

IsSuccess returns true when the HTTP status code indicates success (2xx).

func (*Response) SetAdaptiveExtractor ¶

func (r *Response) SetAdaptiveExtractor(ae any)

SetAdaptiveExtractor attaches a *parse.AdaptiveExtractor to this response. Walker calls this before invoking the user processor when Hunt.WithAdaptive(...) is configured. The argument is typed as any to avoid an import cycle; pass a *parse.AdaptiveExtractor.

func (*Response) TextBody ¶

func (r *Response) TextBody() string

TextBody returns the response body as a string.

func (*Response) XPath ¶

func (r *Response) XPath(expr string) string

XPath evaluates a simplified XPath expression against the response body and returns the first matching element's text. See parse.XPath for supported syntax.

func (*Response) XPathAll ¶

func (r *Response) XPathAll(expr string) []string

XPathAll evaluates a simplified XPath expression and returns text content of all matching elements.

type Result ¶

type Result struct {
	// Items are the extracted data items.
	Items []*Item
	// Jobs are new jobs to enqueue (discovered links, pagination, etc.).
	Jobs []*Job
}

Result is the output of processing a job. It contains scraped items and optionally new jobs to enqueue (for crawling).

type RobotsTxtConfig ¶

type RobotsTxtConfig struct {
	Enabled bool `yaml:"enabled"`
}

RobotsTxtConfig configures robots.txt compliance.

type Selection ¶

type Selection struct {
	// contains filtered or unexported fields
}

Selection represents a CSS selector applied to an HTML body. It provides convenience methods for extracting text, attributes, and HTML from matched elements without requiring the user to import the parse package directly.

func (*Selection) Attr ¶

func (s *Selection) Attr(attr string) string

Attr returns an attribute value from the first matching element.

func (*Selection) Attrs ¶

func (s *Selection) Attrs(attr string) []string

Attrs returns attribute values from all matching elements.

func (*Selection) Len ¶

func (s *Selection) Len() int

Len returns the number of elements matching the selector.

func (*Selection) Text ¶

func (s *Selection) Text() string

Text returns the trimmed text content of the first element matching the selector.

func (*Selection) Texts ¶

func (s *Selection) Texts() []string

Texts returns the trimmed text content of all elements matching the selector.

type Selector ¶

type Selector struct {
	// contains filtered or unexported fields
}

Selector provides CSS-selector-based querying on the Response body. It wraps a lazily-parsed HTML document so multiple CSS/XPath calls share the same parse result.

func (*Selector) CSS ¶

func (s *Selector) CSS(selector string) *Selection

CSS returns a Selection for the given CSS selector.

func (*Selector) XPath ¶

func (s *Selector) XPath(expr string) string

XPath evaluates a simplified XPath expression and returns the first match text.

func (*Selector) XPathAll ¶

func (s *Selector) XPathAll(expr string) []string

XPathAll evaluates a simplified XPath expression and returns all match texts.

type Session ¶

type Session struct {
	// contains filtered or unexported fields
}

Session is a stateful client that survives across calls. Cookies are persisted in an internal CookieJar; identity, proxy, and fetcher are reused for every Get/Fetch.

Session is safe for concurrent use by multiple goroutines.

func NewSession ¶

func NewSession(opts ...SessionOption) *Session

NewSession constructs a Session with the supplied options. A fresh in-memory cookie jar is created when no WithSessionCookieJar option is given.

func (*Session) Close ¶

func (s *Session) Close() error

Close releases any resources held by the underlying fetcher. The cookie jar is in-memory and needs no cleanup. Safe to call multiple times.

func (*Session) Cookies ¶

func (s *Session) Cookies() []*http.Cookie

Cookies returns all cookies currently held in the jar across every host. The returned slice is a fresh copy; mutating it does not affect the jar.

func (*Session) CookiesFor ¶

func (s *Session) CookiesFor(rawURL string) []*http.Cookie

CookiesFor returns the cookies the jar would send for the given URL. This is the standard http.CookieJar query — use it when you need to inspect what the session has accumulated for a particular host.

func (*Session) Fetch ¶

func (s *Session) Fetch(ctx context.Context, job *Job) (*Response, error)

Fetch executes a Job through the session's fetcher. Before the call any cookies the jar holds for the target URL are merged into the job's headers (so static fetchers without their own jar still see them). After a successful fetch any cookies returned in Response.Cookies are stored in the jar so the next call observes them.

func (*Session) Fetcher ¶

func (s *Session) Fetcher() Fetcher

Fetcher returns the underlying fetcher. Returns nil if none was configured.

func (*Session) Get ¶

func (s *Session) Get(ctx context.Context, rawURL string) (*Response, error)

Get is the simple fetch shorthand. It builds a Job with method GET, the session's identity, and the URL, then delegates to Fetch.

func (*Session) Identity ¶

func (s *Session) Identity() any

Identity returns the configured identity profile (as `any`). Callers type-assert to *identity.Profile.

func (*Session) Name ¶

func (s *Session) Name() string

Name returns the session's optional name. Returns empty string for stand-alone sessions; populated for sessions registered via Hunt.AddSession.

func (*Session) ProxyURL ¶

func (s *Session) ProxyURL() string

ProxyURL returns the session's recorded proxy URL.

func (*Session) SetFetcher ¶

func (s *Session) SetFetcher(f Fetcher)

SetFetcher updates the underlying fetcher post-construction. Useful when the fetcher needs to reference the Session's cookie jar (a chicken-and-egg problem solved by constructing the Session first, then the fetcher).

func (*Session) SetName ¶

func (s *Session) SetName(name string)

SetName updates the session's name. Used by Hunt.AddSession.

type SessionOption ¶

type SessionOption func(*Session)

SessionOption configures a Session at construction time.

func WithSessionCookieJar ¶

func WithSessionCookieJar(jar http.CookieJar) SessionOption

WithSessionCookieJar replaces the default in-memory jar with a caller- supplied implementation. Use this when persisting cookies across processes (e.g. via a custom file-backed jar) or when sharing a jar across sessions.

func WithSessionFetcher ¶

func WithSessionFetcher(f Fetcher) SessionOption

WithSessionFetcher overrides the default fetcher. When omitted the caller must register one explicitly via SetFetcher before the first Get / Fetch call; calling Get on a Session without a fetcher returns an error.

The default Session does NOT auto-create a stealth fetcher to avoid an import cycle from foxhound → fetch. Wire one up at the application layer:

s := foxhound.NewSession(foxhound.WithSessionFetcher(fetch.NewStealth()))

func WithSessionIdentity ¶

func WithSessionIdentity(p any) SessionOption

WithSessionIdentity attaches an identity profile to the session. The value is stored as `any` to avoid an import cycle with the identity package; the caller passes a *identity.Profile and is responsible for using it on the fetcher (most fetchers accept it via their own option at construction).

func WithSessionProxy ¶

func WithSessionProxy(rawURL string) SessionOption

WithSessionProxy records the session's proxy URL. The value is stored for inspection but is NOT auto-applied to the fetcher; configure the fetcher's own proxy option at construction time. This is intentional: a Session is a thin wrapper, not a fetcher factory.

type StaticFetchConfig ¶

type StaticFetchConfig struct {
	Timeout        Duration `yaml:"timeout"`
	MaxIdleConns   int      `yaml:"max_idle_conns"`
	TLSImpersonate bool     `yaml:"tls_impersonate"`
}

StaticFetchConfig configures the TLS-impersonating HTTP client.

type ValidateConfig ¶

type ValidateConfig struct {
	Required []string `yaml:"required"`
}

ValidateConfig configures the validation pipeline stage.

type Writer ¶

type Writer interface {
	// Write outputs an item to the destination.
	Write(ctx context.Context, item *Item) error
	// Flush ensures all buffered items are written.
	Flush(ctx context.Context) error
	// Close releases writer resources.
	Close() error
}

Writer defines the interface for exporting scraped items.

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
behavior Package behavior provides human-like behavioral simulation for the Foxhound scraping engine.	Package behavior provides human-like behavioral simulation for the Foxhound scraping engine.
cache Package cache provides caching backends for Foxhound responses.	Package cache provides caching backends for Foxhound responses.
captcha Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework.	Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework.
cmd
foxhound command Command foxhound is the CLI entry point for the Foxhound scraping framework.	Command foxhound is the CLI entry point for the Foxhound scraping framework.
engine Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.	Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.
examples
adaptive command Example: adaptive selectors that survive DOM changes.	Example: adaptive selectors that survive DOM changes.
ecommerce command Example: E-commerce product scraper	Example: E-commerce product scraper
realtime command Example: Real-time price monitor with webhook notifications	Example: Real-time price monitor with webhook notifications
travel command Example: Travel site scraper — hotel listings with JavaScript rendering	Example: Travel site scraper — hotel listings with JavaScript rendering
fetch Package fetch provides the dual-mode fetching layer for Foxhound.	Package fetch provides the dual-mode fetching layer for Foxhound.
presets Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured.	Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured.
identity Package identity provides consistent anti-detection identity profiles.	Package identity provides consistent anti-detection identity profiles.
middleware Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic.	Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic.
monitor Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts.	Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts.
parse Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework.	Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework.
pipeline Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.	Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.
export Package export provides Writer implementations for exporting scraped items to various formats and destinations.	Package export provides Writer implementations for exporting scraped items to various formats and destinations.
proxy Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework.	Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework.
providers Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface.	Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface.
queue Package queue provides foxhound.Queue implementations.	Package queue provides foxhound.Queue implementations.
tests
scrape_targets/alibaba command Scrape Target 3: Alibaba — 10 yoga mat products	Scrape Target 3: Alibaba — 10 yoga mat products
scrape_targets/google_maps command Scrape Target 2: Google Maps — "villa di bali"	Scrape Target 2: Google Maps — "villa di bali"
scrape_targets/google_serp command Scrape Target 1: Google SERP — "wisata alam jawa timur"	Scrape Target 1: Google SERP — "wisata alam jawa timur"
scrape_targets/yoga_alliance command Scrape Target 4: Yoga Alliance School Directory	Scrape Target 4: Yoga Alliance School Directory