foxhound

package module
v0.0.23 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: MIT Imports: 15 Imported by: 0

README

Foxhound - Go Scraping Framework

Go Scraping Framework with Native Camoufox Anti-Detection

Foxhound v0.0.23

High-performance Go scraping framework with native Camoufox anti-detection, dual-mode fetching, and 13-layer middleware.

Highlights

  • Dual-mode fetching: TLS-impersonating HTTP client (~5-50ms) + Camoufox browser (~500ms-5s), with automatic escalation on block detection
  • Consistent identity profiles: UA + TLS fingerprint + header order + OS + hardware + screen + locale all match — randomness without consistency causes instant blocks
  • 13-layer middleware chain: concurrency, metrics, rate limit, robots.txt, delta-fetch, dedup, autothrottle, cookies, referer, blocked detector, redirect, depth limit, retry
  • Trail API: fluent navigation builder with Fill, InfiniteScroll, Evaluate (custom JS), XHR/fetch capture, and optional steps
  • Structured data extraction: JSON-LD, OpenGraph, NextData, NuxtData extractors + contact deobfuscation (CloudFlare cfemail)
  • NopeCHA auto-download: CAPTCHA-solving extension fetched and configured automatically at runtime
  • 9 export formats: JSON, JSONL, CSV, Markdown, Text, XML, SQLite, PostgreSQL, Webhook
  • Parsing engine: HTML table extraction (colspan/rowspan), JS preloaded data (Next.js/Nuxt/Redux), directory listings (JSON-LD/Microdata/DOM), pagination detection, and auto-detection with Readability-style article scoring
  • Adaptive parsing: CSS pseudo-selectors (::text, ::attr), similarity matching, auto-selector generation + sitemap/RSS/Atom parsing
  • Streaming API: Hunt.Stream(ctx) for real-time item processing via Go channels
  • Checkpoint/resume: auto-save hunt state every N items
  • Stateful Session: foxhound.NewSession(...) wraps fetcher + cookie jar + identity + proxy for single-call ad-hoc scraping, with cookies persisted across calls
  • Multi-session campaigns: Hunt.AddSession(name, cfg) + Job.SessionID route individual jobs through distinct fetchers / identities / proxies inside one Hunt
  • Development mode: Hunt.WithDevelopmentMode(dir) caches responses on disk after the first run and replays them on subsequent runs for zero-network iteration
  • Verified Cloudflare solve: fetch.WithSolveCloudflare(timeout) polls cookie + DOM + token signals before declaring success and exposes Response.CloudflareSolved
  • Domain & resource blocking: Hunt.WithBlockedDomains(...) / Hunt.WithDisableResources(...) abort ad, tracker, image, and font requests at the browser layer
  • Trail XHR capture: Trail.CaptureXHR(pattern) attaches URL regexps to every produced job so matching XHR/fetch response bodies land in Response.Captures
  • TLS fingerprint customisation (build tag tls): fetch.WithIdentity auto-applies the curated Firefox JA3 from fetch/presets; fetch.WithJA3, fetch.WithJA3Pool, fetch.WithHTTP2Fingerprint, fetch.WithHTTP3Fingerprint available for advanced overrides
  • Build-mode safety: StealthFetcher.IsImpersonating() + startup log so consumers fail-fast when built without -tags tls
  • 19 packages, 1200+ tests

Key Capabilities

Area What you get
Performance CSS parsing in ~8ms for 5K elements. Multi-core goroutines with per-domain concurrency control
Anti-detection Real Camoufox binary (C++ fingerprint spoofing), human behavior simulation (log-normal timing, Bezier mouse, scroll rhythm), NopeCHA auto-download
Block avoidance 9 vendor patterns (Cloudflare, Akamai, DataDome, PerimeterX) with auto-retry + reCAPTCHA checkbox click + Turnstile handler
Identity 60+ device profiles with consistent UA + TLS + headers + OS + GPU + screen + locale + geo matching
Trail API Fill forms (JobStepFill), infinite scroll with container + stop condition, Evaluate custom JS, XHR/fetch capture, optional steps, persistent cookies
Parsing CSS + XPath + regex + JSON + structured schema + adaptive selectors + similarity matching + pseudo-selectors + sitemap/RSS/Atom
Structured data JSON-LD, OpenGraph, NextData, NuxtData extractors + CloudFlare cfemail deobfuscation
Export 9 formats: JSON, JSONL, CSV, Markdown (table/list/cards), Text, XML, SQLite, PostgreSQL, Webhook + field-level pipeline transforms
Proxy Pool rotation, health checking, cooldown, geo-targeted selection matching identity locale
Queue Memory, Redis (distributed), SQLite (persistent) — checkpoint/resume across restarts
Monitoring Prometheus metrics + webhook alerting with error/block rate thresholds
Scaling docker compose --scale foxhound=4 with shared Redis queue

Quick Start

git clone https://github.com/sadewadee/foxhound.git
cd foxhound
go build -tags playwright -o foxhound ./cmd/foxhound/
foxhound init myproject && cd myproject
go mod tidy
foxhound run --config config.yaml
Google Maps — Scroll feed, collect businesses, extract contacts
// Generate a consistent identity (UA + TLS + headers + OS all match)
id := identity.Generate(identity.WithBrowser(identity.BrowserFirefox))
profile := behavior.CarefulProfile().Jitter() // ±15% per-session parameter variance

browser, _ := fetch.NewCamoufox(
    fetch.WithBrowserIdentity(id),
    fetch.WithBehaviorProfile(profile),
    fetch.WithStorageState("session.json"), // persist session across runs
)
defer browser.Close()

// SmartFetcher with Bayesian domain learning — auto-escalates to browser when blocked
scorer := fetch.NewDomainScorer(fetch.SocialMediaScoreConfig())
smart := fetch.NewSmart(static, browser, fetch.WithDomainScorer(scorer))

// Trail: search → scroll feed → collect all business URLs
trail := engine.NewTrail("maps-search").
    Navigate("https://www.google.com/maps").
    Fill("input#searchboxinput", "restaurant in bali").
    Click("button#searchbox-searchbutton").
    WaitOptional("div[role='feed']", 10*time.Second).
    InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 50, 200).
    Evaluate(`() => document.querySelectorAll('.Nv2PK').length`)

h := engine.NewHunt(engine.HuntConfig{
    Name:            "maps",
    Walkers:         3,
    Seeds:           trail.ToJobs(),
    Fetcher:         middleware.Chain(
        middleware.NewCircuitBreaker(middleware.DefaultCircuitBreakerConfig()),
        middleware.NewAutoThrottle(middleware.AutoThrottleConfig{
            TargetConcurrency: 1, MinDelay: 2 * time.Second, MaxDelay: 15 * time.Second,
        }),
    ).Wrap(smart),
    Queue:           queue.NewReliable(queue.NewMemory(1000), queue.DefaultReliableConfig()),
    BehaviorProfile: profile,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Auto-detect page type and extract accordingly
        result, _ := parse.AutoExtract(resp)
        if result.Type == parse.ContentListing {
            var items []*foxhound.Item
            for _, l := range result.Listings {
                items = append(items, l.AsItem())
            }
            return &foxhound.Result{Items: items}, nil
        }
        // Fallback: extract contacts from business website
        item := foxhound.NewItem()
        item.Set("url", resp.URL)
        item.Set("emails", parse.ExtractEmails(resp))
        item.Set("phones", parse.ExtractPhones(resp))
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
    Writers: []foxhound.Writer{jsonlWriter},
})
h.Run(context.Background())
Trail API — Login + Search + Infinite Scroll + JS Extract
// Login trail (reusable across sessions with WithStorageState)
login := engine.Login("ig-login",
    "https://www.instagram.com/accounts/login/",
    "input[name='username']", "input[name='password']", "button[type='submit']",
    os.Getenv("IG_USER"), os.Getenv("IG_PASS"),
)

// Feed scraping trail
feed := engine.NewTrail("ig-feed").
    Navigate("https://www.instagram.com/explore/").
    WaitOptional("article", 10*time.Second).
    InfiniteScrollUntil("article", 100, 500).
    Evaluate(`() => {
        const posts = document.querySelectorAll('a[href*="/p/"]');
        return Array.from(posts).map(a => a.href);
    }`)
Auto-Detection — Let foxhound figure out the page type
result, _ := parse.AutoExtract(resp)
switch result.Type {
case parse.ContentArticle:
    fmt.Println(result.Article.Title, result.Article.WordCount, "words")
case parse.ContentListing:
    for _, listing := range result.Listings {
        fmt.Println(listing.Name, listing.Phone, listing.Rating)
    }
case parse.ContentProduct:
    fmt.Println("Product page detected")
}

// Extract preloaded JS data (Next.js, Nuxt, Redux, Apollo)
data, _ := parse.ExtractPreloadedData(resp)
fmt.Println("Framework:", data.Framework) // "nextjs", "nuxt", "react"...

// Detect pagination and follow next pages
links := parse.DetectPagination(resp) // multi-signal scoring (50pt threshold)
for _, link := range links {
    fmt.Println(link.Direction, link.URL, "score:", link.Score)
}

Anti-fragility / Adaptive Selectors

Most scrapers break the moment a target site renames a CSS class. Foxhound's adaptive selectors learn an element signature (tag, classes, text prefix, parent, depth, position) on the first successful match, then fall back to similarity matching when the primary CSS selector stops working — so a class rename, a wrapper-div change, or a sibling reordering does not break extraction.

Enable adaptive mode on a Hunt with WithAdaptive(savePath) (pass an empty string for in-memory only, or a JSON path to persist learned signatures across runs), then use the adaptive helpers on Response:

hunt := engine.NewHunt(engine.HuntConfig{
    Name:      "shop",
    Fetcher:   fetcher,
    Queue:     q,
    Processor: foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        // Inline: register and extract in one call. The signature is
        // learned automatically and persisted by the Hunt.
        title := resp.CSSAdaptive("h1.product-title", "title").Text()
        price := resp.CSSAdaptive(".price", "price").Text()

        // On future runs, even if .product-title gets renamed to
        // .item-name, similarity matching will recover the element.
        // Use Adaptive(name) for selectors registered earlier (e.g.
        // via Trail.Adaptive or a previous CSSAdaptive call).
        _ = resp.Adaptive("title")

        item := foxhound.NewItem()
        item.Set("title", title)
        item.Set("price", price)
        return &foxhound.Result{Items: []*foxhound.Item{item}}, nil
    }),
}).WithAdaptive("./adaptive_signatures.json")

You can also declare adaptive selectors at the Trail level:

trail := engine.NewTrail("books").
    Navigate("https://books.toscrape.com/").
    Adaptive("book_title", ".product_pod h3 a").
    Adaptive("book_price", ".product_pod .price_color")

See examples/adaptive/ for a complete runnable example demonstrating an adaptive selector surviving a CSS class rename.

TLS Fingerprint Customisation

fetch.NewStealth ships in two flavours selected by build tag:

  • default build (no tag): Go crypto/tls ClientHello — well-known JA3, trivially detected. Use for tests, CI, or non-bot-protected targets only.
  • -tags tls build: full JA3 / Akamai HTTP/2 / HTTP/3 impersonation via azuretls-client. Use for production scraping.

The same fetch.NewStealth API exists in both, but the underlying TLS layer is completely different. Confirm at startup:

f := fetch.NewStealth(fetch.WithIdentity(profile))
if !f.IsImpersonating() {
    log.Fatal("built without -tags tls; refusing to start in production")
}

Or check the binary directly:

go tool nm /path/to/binary | grep -q azuretls && echo "✅ TLS impersonation active" || echo "❌ Built without -tags tls"
TLS fingerprint comes from the identity

WithIdentity is the only thing you need for fingerprint consistency. It sets the azuretls browser family to match the profile ("firefox" for a Firefox profile — foxhound's primary target since Camoufox is Firefox-based) and lets azuretls's built-in GetLastFirefoxVersion produce the ClientHello at request time:

import "github.com/sadewadee/foxhound/fetch"

f := fetch.NewStealth(fetch.WithIdentity(profile))

The HTTP/2 layer is left to azuretls's browser-aware initHTTP2(browser) so TLS, headers, and HTTP/2 all agree on Firefox. Manual WithHTTP2Fingerprint is supported for power users but logs a startup warning when paired with WithJA3 (see issue #41).

Verified against https://www.bing.com/search and https://duckduckgo.com/ through a datacenter proxy: both return 200 with WithIdentity alone.

TLS certificate verification (v0.0.20)

NewStealth now sets InsecureSkipVerify=true by default. This disables azuretls's built-in DefaultPinManager, which performs an extra TLS handshake per new host to capture SPKI fingerprints and then fails on subsequent requests if a different CDN edge serves a different certificate. Multi-edge targets (Bing, Google, Cloudflare) rotate certificates continuously, making the default PinManager behaviour incompatible with sustained scraping.

foxhound's threat model is bot detection avoidance, not MITM prevention. The default is safe for scraping public sites over a controlled proxy path.

To re-enable full certificate chain, hostname, and pin verification:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithStrictTLSVerify(),   // re-enables chain + hostname + pin checks
)

The startup log includes tls_verify=true when strict mode is active, tls_verify=false (default).

Pin or rotate JA3 (advanced)

Capture your own Firefox JA3 from tls.peet.ws when the curated preset lags real Firefox:

f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3(myCapturedJA3),       // overrides the auto-applied preset
)

For per-recycle rotation, supply a pool of multiple Firefox captures:

pool := []string{ja3FromYesterday, ja3FromLastWeek, presets.FirefoxLatest().JA3}
f := fetch.NewStealth(
    fetch.WithIdentity(profile),
    fetch.WithJA3Pool(pool),
)

Without -tags tls these options compile but log an error at startup — the underlying net/http transport cannot customise the TLS ClientHello.

Real Scraping Results

Target Mode Items Block Avoidance Notes
Google Maps (10 queries) Camoufox + proxy 100 places 100% 1,297 items/hour, 0 CAPTCHAs
Alibaba (yoga mat) Camoufox + proxy 10 products 100% Prices + suppliers extracted
bot.sannysoft.com Camoufox 29/30 PASS webdriver NOT detected
CreepJS Camoufox Trust: HIGH Fingerprint consistent

Benchmarks

Measured on hachibi (AMD Ryzen 7 5700G, Docker container, 2 cores / 4GB RAM, Ubuntu 24.04).

CSS Selection — 5,000 elements
Library Language Time vs Foxhound
Foxhound CSS Go 13.6ms 1.0x
Raw goquery Go 13.0ms 0.96x
stdlib html Go 17.7ms 1.3x slower
Raw lxml Python/C 195.8ms 14.4x slower
BeautifulSoup Python 245.6ms 18.1x slower
Foxhound Internal Benchmarks (5,000 elements)
Method Time Memory Allocs Notes
Foxhound CSS 13.6ms 6.5 MB 100K <1% overhead vs raw goquery
Foxhound Adaptive 17.3ms 6.2 MB 95K Zero overhead when selector works
Foxhound Schema 31.3ms 13.3 MB 320K 3 fields per item
Foxhound TextExtract 22.5ms 10.0 MB 270K 3 fields per item
FindByText 24.6ms 12.1 MB 165K Full DOM text search
Regex extract 6.7ms 1.1 MB 15K Pattern matching on body
Similarity score 96ns 0 B 0 Zero allocation
Item.ToJSON 1.2µs 432 B 10
Item.ToMarkdown 716ns 376 B 8
Scaling by Document Size
Benchmark 1K elements 5K elements 10K elements Scaling
Foxhound CSS 2.3ms 13.6ms 29.6ms ~linear
Regex extract 1.5ms 6.7ms 15.7ms ~linear
stdlib html 3.1ms 17.7ms 31.4ms ~linear
# Run yourself
go test -bench=. -benchmem ./benchmarks/

# Run in Docker with resource limits
docker run --cpus=2 --memory=4g foxhound-benchmark:latest \
  go test -bench=. -benchmem ./benchmarks/

Documentation

File Contents
docs/getting-started.md Install, first scrape, running modes
docs/configuration.md Full config.yaml reference
docs/cli.md All CLI commands and flags
docs/api.md Go types, interfaces, Hunt/Stream API
docs/anti-detection.md Identity system, TLS, behavior simulation
docs/parsing.md Table, preload, directory, pagination, auto-detection parsers
docs/middleware.md All 13 middleware, chain order
docs/pipeline.md Pipeline stages and all 9 export formats
docs/proxy.md Proxy pool, rotation, providers, geo matching
docs/browser.md Camoufox setup, options, human simulation
docs/examples.md E-commerce, Maps, adaptive parsing, streaming
docs/deployment.md Docker, scaling, environment variables

Export Formats

Format Constructor Notes
JSON array export.NewJSON(path, export.JSONArray) Single file, full array
JSON Lines export.NewJSON(path, export.JSONLines) One object per line, streaming-friendly
CSV export.NewCSV(path, cols...) Fixed or auto-inferred columns
Markdown table export.NewMarkdown(path, export.MarkdownTable) GFM pipe table
Markdown list export.NewMarkdown(path, export.MarkdownList) Bullet list, first field bolded
Markdown cards export.NewMarkdown(path, export.MarkdownCards) H2 heading + bullet fields
Plain text lines export.NewText(path, export.TextLines) key=value per line
Plain text pretty export.NewText(path, export.TextPretty) Labelled blocks with separators
XML export.NewXML(path, root, item) Configurable root/item element names
SQLite export.NewSQLite(dbPath, table) Auto-creates and extends schema
PostgreSQL export.NewPostgres(dsn, table) Upsert support, batch inserts
Webhook export.NewWebhook(url) HTTP POST, optional batch size

Architecture

Job → rate limit → dedup → behavior timing → header enrichment
  → Smart Fetcher (static TLS or Camoufox browser)
    → Block detection (9 vendor patterns) → retry with backoff
  → Parser (CSS / XPath / JSON / Regex / Adaptive / Similarity)
  → User Process() → Result{Items, NextJobs}
  → Pipeline (validate, clean, dedup) → Writers (9 formats)
  → Queue (memory / Redis / SQLite)

License

MIT

Documentation

Overview

Package foxhound is a high-performance Go web scraping framework with native anti-detection built on Camoufox, a Firefox fork designed to evade bot fingerprinting.

Foxhound is a scraping framework for Go — it handles the full lifecycle of web data extraction: fetching pages (with or without a real browser), navigating JavaScript-heavy sites, solving CAPTCHAs, rotating identities and proxies, extracting structured data, and exporting results. Think of it as Scrapy for Go, but with first-class browser automation and anti-detection built in from day one.

Why Foxhound

Modern websites deploy increasingly sophisticated bot detection: TLS fingerprinting, JavaScript challenges (Cloudflare, DataDome, PerimeterX), canvas/WebGL fingerprint checks, and behavioral analysis. Traditional HTTP-only scrapers fail silently against these defenses. Headless Chrome is widely fingerprinted. Foxhound solves this by combining two fetching strategies behind a single API:

  • A TLS-impersonating HTTP client for static pages (~5-50ms per request)
  • A Camoufox browser (Firefox fork) via playwright-go for JS-heavy and protected pages (~500ms-5s per request)

The smart router starts with the fast static client and automatically escalates to the full browser when it detects blocks (403, 429, 503, CAPTCHA pages). This means you get HTTP-client speed on easy targets and browser-level evasion on hard ones, without changing your code.

Architecture Overview

Foxhound is organized around five core concepts:

Hunt is the top-level campaign orchestrator. It owns the queue, spawns Walker goroutines, collects stats, and coordinates shutdown. You configure a Hunt with seed URLs, a Processor (your extraction logic), middleware, pipelines, and writers.

Trail is a fluent navigation path builder. It chains browser actions — Navigate, Click, Fill, Wait, Scroll, InfiniteScroll, Evaluate (custom JS), and CaptureXHR — into a reusable sequence that gets compiled into Jobs. Trails describe what a human would do on the page.

Walker is a goroutine that acts as a virtual user. Each Walker pops Jobs from the queue, fetches pages through the middleware chain, runs your Processor, writes extracted Items through the pipeline, and enqueues discovered follow-up Jobs. A Hunt runs multiple Walkers concurrently.

Job is the unit of work: a URL plus fetch mode, priority, browser steps, metadata, and optional session routing. Jobs flow through the queue and are consumed by Walkers.

Session is a stateful client that wraps a fetcher, cookie jar, identity profile, and proxy into a reusable unit. Use it standalone for ad-hoc scraping, or register multiple Sessions with a Hunt via Hunt.AddSession to route different Jobs through different identities and proxies.

Dual-Mode Fetching

Every request flows through a middleware chain before reaching the fetcher:

Job → middleware (rate limit → dedup → autothrottle → cookies → referer → retry)
  → Smart Fetcher (static or browser) → Browser Steps → Parser → Processor
  → Result{Items, Jobs} → Pipeline (validate → clean → dedup → transform)
  → Writers (CSV/JSON/SQLite/Webhook) + Queue (new jobs)

The static fetcher (fetch.NewStealth) uses Go's HTTP client with precise header ordering and TLS fingerprints matched to the identity profile. The browser fetcher (fetch.NewCamoufox) launches a real Camoufox browser instance via the Juggler protocol (Firefox's native remote protocol, less targeted by anti-bot than CDP). The smart fetcher (fetch.NewSmart) wraps both and auto-escalates based on response signals and Bayesian domain risk scoring.

Identity System

Every request uses a complete, internally consistent identity profile: user agent, TLS fingerprint, header order, OS, hardware specs, screen dimensions, locale, timezone, and geolocation all match. Randomness without consistency is the number one cause of bot detection — a Windows UA with a macOS font list, or a US locale with a Tokyo timezone, triggers instant blocks.

Foxhound ships 60 embedded device profiles. The identity package generates profiles with functional options:

id := identity.Generate(
    identity.WithBrowser(identity.BrowserFirefox),
    identity.WithOS(identity.OSWindows),
)

When using Camoufox, the identity is serialized to a JSON config that sets navigator properties, WebGL vendor/renderer, canvas noise, OS-specific fonts, screen dimensions, and timezone at the C++ level inside the browser — not via JavaScript injection that anti-bot scripts can detect.

Human Behavior Simulation

Foxhound models human behavior using statistical distributions observed from real user sessions:

  • Timing uses Weibull and Gamma distributions (right-skewed, matching human reaction times), not uniform random
  • Mouse movements follow Bezier curves with natural acceleration/deceleration
  • Scroll patterns simulate reading speed with variable pause durations
  • Keyboard input uses per-key timing with realistic inter-keystroke intervals
  • Session fatigue: warmup slowdown at start, cruise speed mid-session, gradual fatigue buildup — with per-call noise to prevent smooth-curve detection
  • Per-session jitter: all behavior parameters are perturbed ±15% to prevent anti-bot ML from clustering sessions into discrete archetypes

Three built-in profiles ("careful", "moderate", "aggressive") control the overall pacing. Configure via BehaviorConfig or Hunt options.

NopeCHA CAPTCHA Solving

The NopeCHA browser extension is automatically downloaded from GitHub and loaded into Camoufox on first launch. It solves reCAPTCHA, hCAPTCHA, and Cloudflare Turnstile challenges without API keys. The extension is cached at ~/.cache/foxhound/extensions/nopecha/ and updated automatically.

The design philosophy: the goal is to never trigger a CAPTCHA. If one appears, earlier layers (identity, timing, proxy rotation) failed. NopeCHA is the safety net, not the primary strategy.

Disable with extension_path: "none" in config or WithExtensionPath("none").

Middleware Chain

Foxhound provides 13 middleware layers that wrap the fetcher:

  • Rate limiting (token bucket per domain)
  • Request deduplication (URL + method fingerprint)
  • Autothrottle (adaptive delay based on response times)
  • Cookie persistence (file-backed or in-memory jar)
  • Referer chain (natural browsing simulation)
  • Blocked response detection (403/429/503/CAPTCHA triggers)
  • Redirect following (with loop detection)
  • Depth limiting (max crawl depth from seed)
  • Retry with exponential backoff
  • Delta-fetch (skip unchanged pages via ETag/Last-Modified)
  • Circuit breaker (3-state FSM: closed → open → half-open)
  • Metrics collection (Prometheus counters)
  • Robots.txt compliance

Middleware is composable: each layer wraps a Fetcher and returns a Fetcher, so you can stack them in any order or add custom middleware.

Adaptive Selectors

Websites frequently change their DOM structure — class names rotate, IDs are randomized, layouts shift. Foxhound's adaptive selector system survives these rewrites by building element signatures (tag, position, text patterns, ancestor structure) alongside CSS selectors. When a selector stops matching, the system falls back to similarity matching against saved signatures.

Enable with Hunt.WithAdaptive and use via Response.Adaptive, Response.CSSAdaptive, Response.CSSAdaptiveAll, or Trail.Adaptive. Signatures can be stored in JSON files or SQLite.

Example: Hunt Campaign

A Hunt is the standard way to scrape at scale. Define a Processor, configure middleware and writers, add seed URLs, and run:

hunt := engine.NewHunt("bookstore",
    engine.WithDomain("books.toscrape.com"),
    engine.WithWalkers(4),
    engine.WithProcessor(foxhound.ProcessorFunc(func(ctx context.Context, resp *foxhound.Response) (*foxhound.Result, error) {
        result := &foxhound.Result{}
        titles := resp.CSS("h3 a").Texts()
        prices := resp.CSS(".price_color").Texts()
        for i, title := range titles {
            item := foxhound.NewItem()
            item.Set("title", title)
            if i < len(prices) {
                item.Set("price", prices[i])
            }
            result.Items = append(result.Items, item)
        }
        // Follow pagination links
        result.Jobs = resp.Follow("li.next a[href]")
        return result, nil
    })),
)
hunt.AddSeed("https://books.toscrape.com/")
huntResult, err := hunt.Run(ctx)

Example: Trail Navigation

Trails describe multi-step browser interactions for JS-heavy pages. This example searches Google Maps and scrolls through results:

trail := engine.NewTrail("maps-search").
    Navigate("https://www.google.com/maps").
    Fill("input#searchboxinput", "cafe in canggu").
    Click("button#searchbox-searchbutton").
    WaitOptional("div[role='feed']", 10*time.Second).
    InfiniteScrollInUntil("div[role='feed']", "div.Nv2PK", 20, 100).
    Evaluate("() => document.querySelectorAll('.Nv2PK').length")

jobs := trail.ToJobs()

Example: Session (Ad-Hoc Scraping)

Session is the lightweight alternative to Hunt for quick, stateful fetches. Cookies persist across calls, and the identity stays consistent:

sess := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewStealth()),
    foxhound.WithSessionIdentity(identity.Generate()),
    foxhound.WithSessionProxy("http://user:pass@proxy.example:8080"),
)
defer sess.Close()

resp, err := sess.Get(ctx, "https://example.com/login")
// cookies from login response are automatically persisted
resp2, err := sess.Get(ctx, "https://example.com/dashboard")

Example: CSS and XPath Selectors

Response provides built-in CSS and XPath querying without importing the parse package:

// Single element
title := resp.CSS("h1.title").Text()
price := resp.CSS("span.price").Text()
image := resp.CSS("img.product").Attr("src")

// Multiple elements
allTitles := resp.CSS("h3 a").Texts()
allLinks  := resp.CSS("a.product[href]").Attrs("href")
count     := resp.CSS("div.result").Len()

// XPath (subset converted to CSS internally)
author := resp.XPath("//span[@class='author']")

Response.Follow extracts links from the page and generates follow-up Jobs:

// Follow all product links, route to a different handler
jobs := resp.Follow("a.product-link[href]",
    foxhound.WithFollowCallback("parseProduct"),
    foxhound.WithFollowReferer(true),
)

// Follow a single known URL
nextPage := resp.FollowURL("/api/products?page=2")

// Follow all anchor links on the page
allJobs := resp.FollowAll()

Example: XHR/Fetch Capture

Capture background API calls that JavaScript makes after page load. This is essential for SPAs where data loads via XHR/fetch, not in the initial HTML:

trail := engine.NewTrail("api-capture").
    Navigate("https://example.com/app").
    CaptureXHR("*/api/v2/products*").
    Click("button.load-data").
    Wait("div.results", 5*time.Second)

The captured responses are available in Response.CapturedXHR as a slice of maps with keys: request_url, request_method, status, headers, body.

Example: Cloudflare Solve

For sites behind Cloudflare's JavaScript challenge, Foxhound can detect and wait for the challenge to complete:

fetcher := fetch.NewCamoufox(
    fetch.WithSolveCloudflare(30 * time.Second),
)
// resp.CloudflareSolved is true when the challenge was detected and solved.
// Verification checks: cf_clearance cookie, absence of Turnstile DOM markers,
// and a non-empty cf-turnstile-response token.

Example: Multi-Session Campaigns

Route different jobs through different identities and proxies within a single Hunt:

indexSession := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewStealth()),
    foxhound.WithSessionProxy("http://proxy-a:8080"),
)
detailSession := foxhound.NewSession(
    foxhound.WithSessionFetcher(fetch.NewCamoufox()),
    foxhound.WithSessionProxy("http://proxy-b:8080"),
)

hunt := engine.NewHunt("multi-session", /* ... */)
hunt.AddSession("index", indexSession)
hunt.AddSession("detail", detailSession)

// Jobs with SessionID "index" use indexSession's fetcher and proxy;
// jobs with SessionID "detail" use detailSession's browser.

Example: Development Mode

Cache responses on disk for zero-network iteration during development:

hunt := engine.NewHunt("dev",
    engine.WithDevelopmentMode("./dev-cache"),
    // ... other options
)
// First run: fetches from network, saves responses to ./dev-cache/
// Subsequent runs: replays cached responses instantly

Sub-Packages

The foxhound module is organized into focused sub-packages:

  • [engine] — Hunt, Trail, Walker, scheduler, retry logic, stats collection, and ItemList for thread-safe item accumulation with CSV/JSON/JSONL export.

  • [fetch] — Stealth HTTP client (TLS fingerprinting + header ordering), Camoufox browser automation (Juggler protocol), Smart router (auto-escalation), XHR capture, page pool management, domain risk scoring, and SOCKS5 auth relay.

  • [identity] — Profile generation with 60 embedded device profiles. Produces consistent identity bundles (UA, TLS, headers, OS, hardware, screen, locale, geo) and Camoufox fingerprint configs.

  • [behavior] — Human behavior simulation: timing (Weibull/Gamma distributions), mouse (Bezier curves), scroll patterns, keyboard input, navigation profiles, and session fatigue modeling.

  • [middleware] — 13 composable middleware layers: rate limiting, dedup, retry, autothrottle, cookies, referer, redirect, depth, delta-fetch, circuit breaker, metrics, blocked detection, and robots.txt.

  • [parse] — Content extraction: CSS (goquery), JSON (dot-path), XPath (subset), regex, structured schema, Markdown/text conversion, metadata (JSON-LD, OpenGraph, NextData, NuxtData), contact deobfuscation, sitemap/feed parsing, adaptive selectors, HTML table extraction, JS preload detection, directory listings, pagination detection, and auto-detection with Readability-style scoring.

  • [pipeline] — Item processing stages: validation, cleaning, deduplication, field transformation (regex, rename, type coercion), and chain composition.

  • pipeline/export — Output writers: JSON, JSONL, CSV, XML, SQLite, PostgreSQL, Markdown, Text, and Webhook.

  • [proxy] — Proxy pool management with geo-aware selection, health checking, cooldown tracking, and provider adapters (BrightData, Oxylabs, Smartproxy).

  • [queue] — Job queue implementations: in-memory (heap-based priority queue), Redis (sorted sets), and SQLite (persistent).

  • [cache] — Response caching: in-memory (LRU + TTL), file-based (SHA256 keys), Redis, and SQLite.

  • [captcha] — CAPTCHA detection (Cloudflare, reCAPTCHA, hCAPTCHA, GeeTest) and solving via NopeCHA, CapSolver, 2Captcha, and Turnstile.

  • [monitor] — Observability: atomic stat counters, Prometheus metrics (isolated registry), and webhook-based alerting rules.

  • cmd/foxhound — CLI tool: init, run, check, proxy-test, shell, browser-shell, resume, curl2fox, and preview commands.

Index

Constants

View Source
const (
	JobStepNavigate       = 0
	JobStepClick          = 1
	JobStepWait           = 2
	JobStepExtract        = 3
	JobStepScroll         = 4
	JobStepInfiniteScroll = 5  // scroll to bottom until no new content loads
	JobStepLoadMore       = 6  // click "load more" button repeatedly until gone
	JobStepPaginate       = 7  // detect and follow pagination links
	JobStepEvaluate       = 8  // execute custom JavaScript on the page
	JobStepFill           = 9  // type text into input field with human-like keystrokes
	JobStepCollect        = 10 // collect URLs from page into pool
)

Step action constants for JobStep. These are package-level int constants (not engine.StepAction) to avoid an import cycle between foxhound ↔ engine.

Variables

This section is empty.

Functions

func RegisterAdaptiveHooks

func RegisterAdaptiveHooks(
	extractText func(extractor any, body []byte, name string) string,
	register func(extractor any, body []byte, name, selector string, all bool),
)

RegisterAdaptiveHooks is called by the parse package to wire its AdaptiveExtractor implementation into Response.Adaptive / CSSAdaptive.

func RegisterHTMLSelectors

func RegisterHTMLSelectors(
	textsFunc func(body []byte, selector string) []string,
	attrsFunc func(body []byte, selector, attr string) []string,
	countFunc func(body []byte, selector string) int,
	xpathFunc func(expr string) string,
)

RegisterHTMLSelectors is called by the parse package to provide the HTML selection implementations used by Response.CSS() and Response.XPath().

func SetupLogging

func SetupLogging(cfg LoggingConfig, verbose int)

SetupLogging configures the global slog logger from a LoggingConfig. The verbose parameter overrides the config level:

0 = use config level (default info)
1 = debug  (-v)
2 = debug with source location (-vv)

Types

type AlertingExportConfig

type AlertingExportConfig struct {
	WebhookURL         string   `yaml:"webhook_url"`
	ErrorRateThreshold float64  `yaml:"error_rate_threshold"`
	BlockRateThreshold float64  `yaml:"block_rate_threshold"`
	Cooldown           Duration `yaml:"cooldown"`
}

AlertingExportConfig configures webhook alerting.

type AutoThrottleMiddlewareConfig

type AutoThrottleMiddlewareConfig struct {
	Enabled           bool     `yaml:"enabled"`
	TargetConcurrency float64  `yaml:"target_concurrency"`
	InitialDelay      Duration `yaml:"initial_delay"`
	MinDelay          Duration `yaml:"min_delay"`
	MaxDelay          Duration `yaml:"max_delay"`
}

AutoThrottleMiddlewareConfig configures the adaptive per-domain throttle.

type BehaviorConfig

type BehaviorConfig struct {
	// Profile selects the preset behavior profile: "careful", "moderate", or
	// "aggressive". Defaults to "moderate" when unset.
	Profile string `yaml:"profile"`
}

BehaviorConfig configures the human-simulation behavior profile for walkers.

type BrowserFetchConfig

type BrowserFetchConfig struct {
	Timeout       Duration `yaml:"timeout"`
	BlockImages   bool     `yaml:"block_images"`
	BlockWebRTC   bool     `yaml:"block_webrtc"`
	Headless      string   `yaml:"headless"`
	Instances     int      `yaml:"instances"`
	ExtensionPath string   `yaml:"extension_path"` // path to extension dir/xpi, or "none" to disable NopeCHA auto-load
}

BrowserFetchConfig configures the Camoufox browser.

type CacheConfig

type CacheConfig struct {
	Backend string   `yaml:"backend"` // "memory" | "file" | "sqlite" | "redis" | "" (disabled)
	TTL     Duration `yaml:"ttl"`
	MaxSize int      `yaml:"max_size"` // max entries for memory cache
}

CacheConfig configures response caching.

type CaptchaConfig

type CaptchaConfig struct {
	Enabled  bool   `yaml:"enabled"`
	Provider string `yaml:"provider"` // "capsolver" | "twocaptcha" | "nopecha"
	APIKey   string `yaml:"api_key"`
}

CaptchaConfig configures CAPTCHA detection and solving.

type CleanConfig

type CleanConfig struct {
	TrimWhitespace bool `yaml:"trim_whitespace"`
	NormalizePrice bool `yaml:"normalize_price"`
	NormalizeDate  bool `yaml:"normalize_date"`
}

CleanConfig configures the cleaning pipeline stage.

type ConcurrencyConfig

type ConcurrencyConfig struct {
	PerDomain int `yaml:"per_domain"` // max concurrent requests per domain (default 2)
}

ConcurrencyConfig limits concurrent in-flight requests per domain.

type Config

type Config struct {
	Hunt        HuntConfig        `yaml:"hunt"`
	Identity    IdentityConfig    `yaml:"identity"`
	Proxy       ProxyConfig       `yaml:"proxy"`
	Fetch       FetchConfig       `yaml:"fetch"`
	Middleware  MiddlewareConfig  `yaml:"middleware"`
	Pipeline    []PipelineEntry   `yaml:"pipeline"`
	Queue       QueueConfig       `yaml:"queue"`
	Cache       CacheConfig       `yaml:"cache"`
	Monitor     MonitorConfig     `yaml:"monitor"`
	Captcha     CaptchaConfig     `yaml:"captcha"`
	Logging     LoggingConfig     `yaml:"logging"`
	Behavior    BehaviorConfig    `yaml:"behavior"`
	PageActions PageActionsConfig `yaml:"page_actions"`
}

Config is the top-level configuration for a Foxhound instance.

func LoadConfig

func LoadConfig(path string) (*Config, error)

LoadConfig reads and parses a YAML configuration file.

type DedupConfig

type DedupConfig struct {
	Strategy string `yaml:"strategy"`
	Store    string `yaml:"store"`
}

DedupConfig configures URL deduplication.

type DeltaFetchConfig

type DeltaFetchConfig struct {
	Enabled  bool     `yaml:"enabled"`
	Strategy string   `yaml:"strategy"` // "skip_seen" | "skip_recent"
	TTL      Duration `yaml:"ttl"`
	Store    string   `yaml:"store"` // "memory" | "redis" | "sqlite"
}

DeltaFetchConfig configures cross-run URL deduplication.

type DepthLimitConfig

type DepthLimitConfig struct {
	Max int `yaml:"max"`
}

DepthLimitConfig configures crawl depth limiting.

type DownloadDelayConfig

type DownloadDelayConfig struct {
	Enabled   bool              `yaml:"enabled"`
	Default   Duration          `yaml:"default"`   // base delay between same-domain requests
	Domains   map[string]string `yaml:"domains"`   // per-domain delay overrides (domain -> duration string)
	Randomize bool              `yaml:"randomize"` // add ±25% jitter
}

DownloadDelayConfig configures per-domain download delays.

type Duration

type Duration struct {
	time.Duration
}

Duration is a time.Duration that supports YAML marshaling.

func (Duration) MarshalYAML

func (d Duration) MarshalYAML() (any, error)

MarshalYAML serializes the duration as a string.

func (*Duration) UnmarshalYAML

func (d *Duration) UnmarshalYAML(value *yaml.Node) error

UnmarshalYAML parses a duration string like "30s", "5m", "1h".

type ExportConfig

type ExportConfig struct {
	Type      string `yaml:"type"`
	Path      string `yaml:"path,omitempty"`
	Table     string `yaml:"table,omitempty"`
	UpsertKey string `yaml:"upsert_key,omitempty"`
	BatchSize int    `yaml:"batch_size,omitempty"`
}

ExportConfig defines an export destination.

type FetchConfig

type FetchConfig struct {
	Static  StaticFetchConfig  `yaml:"static"`
	Browser BrowserFetchConfig `yaml:"browser"`
}

FetchConfig configures the fetch layer.

type FetchMode

type FetchMode int

FetchMode indicates which fetcher to use for a request.

const (
	// FetchAuto lets the smart router decide between static and browser.
	FetchAuto FetchMode = iota
	// FetchStatic forces the TLS-impersonating HTTP client.
	FetchStatic
	// FetchBrowser forces the Camoufox browser.
	FetchBrowser
)

func (FetchMode) String

func (m FetchMode) String() string

String returns the string representation of a FetchMode.

type Fetcher

type Fetcher interface {
	// Fetch performs an HTTP request and returns the response.
	Fetch(ctx context.Context, job *Job) (*Response, error)
	// Close releases any resources held by the fetcher.
	Close() error
}

Fetcher defines the interface for making HTTP requests.

type FetcherFunc

type FetcherFunc func(ctx context.Context, job *Job) (*Response, error)

FetcherFunc is an adapter to allow use of ordinary functions as Fetchers.

func (FetcherFunc) Close

func (f FetcherFunc) Close() error

Close is a no-op to satisfy the Fetcher interface.

func (FetcherFunc) Fetch

func (f FetcherFunc) Fetch(ctx context.Context, job *Job) (*Response, error)

Fetch calls f(ctx, job).

type FollowOption

type FollowOption func(*followConfig)

FollowOption configures how Follow generates jobs from discovered links.

func WithFollowCallback

func WithFollowCallback(callback string) FollowOption

WithFollowCallback sets a callback name in Meta["callback"] for generated jobs, allowing spider-style routing of responses to different handlers.

func WithFollowDontFilter

func WithFollowDontFilter(dontFilter bool) FollowOption

WithFollowDontFilter marks generated jobs to skip deduplication filtering. Useful for pages that need to be re-fetched (e.g. pagination, monitoring).

func WithFollowMeta

func WithFollowMeta(meta map[string]any) FollowOption

WithFollowMeta sets metadata on generated follow-up jobs.

func WithFollowMode

func WithFollowMode(mode FetchMode) FollowOption

WithFollowMode sets the FetchMode for generated follow-up jobs.

func WithFollowPriority

func WithFollowPriority(p Priority) FollowOption

WithFollowPriority sets the Priority for generated follow-up jobs.

func WithFollowReferer

func WithFollowReferer(referer bool) FollowOption

WithFollowReferer sets the current response URL as referer in the generated job's Meta["referer"]. This maintains referer chain for natural browsing simulation.

type HuntConfig

type HuntConfig struct {
	Domain         string `yaml:"domain"`
	Walkers        int    `yaml:"walkers"`
	MaxConcurrency int    `yaml:"max_concurrency"` // global max concurrent requests (0 = walkers count)
}

HuntConfig configures the scraping campaign.

type IdentityConfig

type IdentityConfig struct {
	Browser       string   `yaml:"browser"`
	OS            []string `yaml:"os"`
	FingerprintDB string   `yaml:"fingerprint_db"`
}

IdentityConfig configures identity generation.

type Item

type Item struct {
	// Fields holds the extracted data as key-value pairs.
	Fields map[string]any
	// Meta carries metadata from the originating job.
	Meta map[string]any
	// URL is the source URL.
	URL string
	// Timestamp is when the item was created.
	Timestamp time.Time
}

Item represents a scraped data item passing through the pipeline.

func NewItem

func NewItem() *Item

NewItem creates a new Item with initialized fields.

func (*Item) Get

func (it *Item) Get(key string) (any, bool)

Get retrieves a field from the item.

func (*Item) GetFloat

func (it *Item) GetFloat(key string) float64

GetFloat returns the field value as float64. Accepts float64 and int/int64 stored in the Fields map. Returns 0 if the field is absent or non-numeric.

func (*Item) GetInt

func (it *Item) GetInt(key string) int

GetInt returns the field value as int. Accepts int, int64, and float64 stored in the Fields map (float64 is truncated). Returns 0 if the field is absent or non-numeric.

func (*Item) GetString

func (it *Item) GetString(key string) string

GetString returns the field value as a string. Returns "" if the field is absent or its underlying type is not string.

func (*Item) Has

func (it *Item) Has(key string) bool

Has reports whether the field exists and has a non-empty string representation. A field set to nil or "" is treated as absent.

func (*Item) Keys

func (it *Item) Keys() []string

Keys returns the item's field names in sorted (ascending) order.

func (*Item) Set

func (it *Item) Set(key string, value any)

Set sets a field on the item.

func (*Item) String

func (it *Item) String() string

String implements fmt.Stringer. It returns a compact JSON representation of the item fields, falling back to a key=value format on marshal error.

func (*Item) ToCSVRow

func (it *Item) ToCSVRow(columns []string) []string

ToCSVRow returns field values as a string slice following the given column order. Missing fields are returned as empty strings.

func (*Item) ToJSON

func (it *Item) ToJSON() ([]byte, error)

ToJSON returns item.Fields serialised as compact JSON bytes.

func (*Item) ToJSONPretty

func (it *Item) ToJSONPretty() ([]byte, error)

ToJSONPretty returns item.Fields serialised as indented JSON bytes.

func (*Item) ToMap

func (it *Item) ToMap() map[string]any

ToMap returns a shallow copy of item.Fields. Mutations to the returned map do not affect the Item.

func (*Item) ToMarkdown

func (it *Item) ToMarkdown() string

ToMarkdown returns a compact Markdown representation of the item as a bullet list: the first key (sorted) is bolded; the rest are appended.

func (*Item) ToText

func (it *Item) ToText() string

ToText returns a plain-text representation with one "key: value" line per field, fields in sorted order.

type Job

type Job struct {
	// ID is a unique identifier for this job.
	ID string
	// URL is the target URL to fetch.
	URL string
	// Method is the HTTP method (default GET).
	Method string
	// Headers are additional HTTP headers to include.
	Headers http.Header
	// Body is the request body for POST/PUT requests.
	Body []byte
	// FetchMode determines which fetcher to use.
	FetchMode FetchMode
	// Priority determines processing order.
	Priority Priority
	// MaxRetries overrides the default retry count.
	MaxRetries int
	// Meta is arbitrary metadata passed through the pipeline.
	Meta map[string]any
	// Depth is the crawl depth from the seed URL.
	Depth int
	// Domain is the target domain extracted from URL.
	Domain string
	// CreatedAt is when the job was created.
	CreatedAt time.Time
	// Steps are browser-side actions to execute after page load (optional).
	// When non-empty, the job requires a browser fetcher. The omitempty tag
	// ensures backward compatibility with existing queue serialization.
	Steps []JobStep `json:"steps,omitempty"`
	// NavigationTimeout overrides the fetcher's default navigation timeout
	// for this specific job. Useful for pages known to be slow (e.g. later
	// pagination pages on Google SERP). Zero means use the fetcher default.
	NavigationTimeout time.Duration `json:"navigation_timeout,omitempty"`
	// DontFilter when true skips deduplication for this specific job.
	// Useful for pages that need to be re-fetched (e.g. pagination, monitoring).
	DontFilter bool `json:"dont_filter,omitempty"`
	// Callback is an optional handler name that the spider routes to a
	// specific Parse method. When empty, the default processor is used.
	Callback string `json:"callback,omitempty"`
	// SessionID names a session previously registered with Hunt.AddSession.
	// When set, the walker routes this job through the named session's
	// fetcher (with its own cookie jar, identity, and proxy) instead of the
	// hunt's default fetcher. Empty (default) preserves backward-compatible
	// behaviour: the hunt's default fetcher is used.
	SessionID string `json:"session_id,omitempty"`
}

Job represents a unit of work to be processed by the engine.

type JobStep

type JobStep struct {
	// Action identifies the step type (JobStepClick, JobStepWait, etc.).
	// Zero value (JobStepNavigate) is intentionally NOT omitempty so it
	// always serializes.
	Action int `json:"action"`
	// Selector is the CSS selector for Click, Wait, and Extract steps.
	Selector string `json:"selector,omitempty"`
	// Duration is the timeout for Wait steps.
	Duration time.Duration `json:"duration,omitempty"`
	// ScrollAxis is 0 for vertical, 1 for horizontal (only for Scroll steps).
	ScrollAxis int `json:"scroll_axis,omitempty"`
	// ScrollExtent is the target scroll distance in pixels. Defaults to 3000
	// when zero.
	ScrollExtent int `json:"scroll_extent,omitempty"`
	// ScrollMode is 0 for ScrollReading, 1 for ScrollScan. Zero value
	// (omitted in JSON) defaults to ScrollReading.
	ScrollMode int `json:"scroll_mode,omitempty"`
	// MaxScrolls is the maximum number of scroll-to-bottom iterations for
	// InfiniteScroll steps. Defaults to 50 when zero.
	MaxScrolls int `json:"max_scrolls,omitempty"`
	// MaxClicks is the maximum number of "load more" button clicks for
	// LoadMore steps. Defaults to 20 when zero.
	MaxClicks int `json:"max_clicks,omitempty"`
	// MaxPages is the maximum number of pagination pages to follow for
	// Paginate steps. Defaults to 10 when zero.
	MaxPages int `json:"max_pages,omitempty"`
	// Script is the JavaScript code to execute for Evaluate steps.
	Script string `json:"script,omitempty"`
	// WaitState specifies what state to wait for in Wait steps:
	// "attached" (default), "detached", "visible", or "hidden".
	// Maps to playwright's WaitForSelectorState.
	WaitState string `json:"wait_state,omitempty"`
	// Optional marks this step as non-fatal: if it fails, execution continues
	// instead of aborting the fetch. Useful for steps that may not always be
	// present on the page (e.g. a cookie banner dismiss button).
	Optional bool `json:"optional,omitempty"`
	// StopSelector is a CSS selector that signals InfiniteScroll to stop
	// when the target element count is reached. Used with StopCount to scroll
	// until N items exist (e.g. "div.result" + StopCount=20).
	StopSelector string `json:"stop_selector,omitempty"`
	// StopCount is the target element count for StopSelector. InfiniteScroll
	// stops when document.querySelectorAll(StopSelector).length >= StopCount.
	// Only used when StopSelector is set. Defaults to 1 when zero.
	StopCount int `json:"stop_count,omitempty"`
	// ScrollWait is the duration to wait after each scroll iteration before
	// checking for new content. Defaults to 2s when zero. Increase for slow
	// sites like Google Maps (3-5s recommended).
	ScrollWait time.Duration `json:"scroll_wait,omitempty"`
	// Value is the text to type into an input field for Fill steps.
	Value string `json:"value,omitempty"`
}

JobStep is a single browser-side action that should be executed after the page loads. Steps are attached to a Job by Trail.ToJobs() and executed by the CamoufoxFetcher before content extraction.

type LoggingConfig

type LoggingConfig struct {
	Level  string `yaml:"level"`
	Format string `yaml:"format"`
	Output string `yaml:"output"`
}

LoggingConfig configures structured logging.

type MetricsExportConfig

type MetricsExportConfig struct {
	Enabled bool `yaml:"enabled"`
	Port    int  `yaml:"port"`
}

MetricsExportConfig configures Prometheus metrics.

type Middleware

type Middleware interface {
	// Wrap takes a Fetcher and returns a wrapped Fetcher.
	Wrap(next Fetcher) Fetcher
}

Middleware wraps a Fetcher to add cross-cutting behavior.

type MiddlewareConfig

type MiddlewareConfig struct {
	RateLimit     RateLimitConfig              `yaml:"ratelimit"`
	AutoThrottle  AutoThrottleMiddlewareConfig `yaml:"autothrottle"`
	Dedup         DedupConfig                  `yaml:"dedup"`
	DeltaFetch    DeltaFetchConfig             `yaml:"deltafetch"`
	RobotsTxt     RobotsTxtConfig              `yaml:"robots_txt"`
	DepthLimit    DepthLimitConfig             `yaml:"depth_limit"`
	Concurrency   ConcurrencyConfig            `yaml:"concurrency"`
	DownloadDelay DownloadDelayConfig          `yaml:"download_delay"`
}

MiddlewareConfig configures request/response processing middleware.

type MiddlewareFunc

type MiddlewareFunc func(next Fetcher) Fetcher

MiddlewareFunc is an adapter for using functions as Middleware.

func (MiddlewareFunc) Wrap

func (f MiddlewareFunc) Wrap(next Fetcher) Fetcher

Wrap calls f(next).

type MonitorConfig

type MonitorConfig struct {
	Metrics  MetricsExportConfig  `yaml:"metrics"`
	Alerting AlertingExportConfig `yaml:"alerting"`
}

MonitorConfig configures observability.

type PageActionsConfig

type PageActionsConfig struct {
	Scripts []string `yaml:"scripts"` // JS snippets to run after page load
}

PageActionsConfig configures JavaScript execution after page load.

type Pipeline

type Pipeline interface {
	// Process takes an item and returns the (possibly modified) item.
	// Return nil to drop the item. Return an error to log and continue.
	Process(ctx context.Context, item *Item) (*Item, error)
}

Pipeline processes items after extraction.

type PipelineEntry

type PipelineEntry struct {
	Validate *ValidateConfig `yaml:"validate,omitempty"`
	Clean    *CleanConfig    `yaml:"clean,omitempty"`
	Dedup    *DedupConfig    `yaml:"dedup,omitempty"`
	Export   []ExportConfig  `yaml:"export,omitempty"`
}

PipelineEntry is a polymorphic pipeline stage definition.

type PipelineFunc

type PipelineFunc func(ctx context.Context, item *Item) (*Item, error)

PipelineFunc is an adapter for using functions as Pipeline stages.

func (PipelineFunc) Process

func (f PipelineFunc) Process(ctx context.Context, item *Item) (*Item, error)

Process calls f(ctx, item).

type Priority

type Priority int

Priority represents job priority in the queue.

const (
	PriorityLow    Priority = 0
	PriorityNormal Priority = 5
	PriorityHigh   Priority = 10
)

type Processor

type Processor interface {
	// Process takes a response and returns extracted items and new jobs.
	Process(ctx context.Context, resp *Response) (*Result, error)
}

Processor defines the user-provided logic for handling responses. This is the main extension point: users implement this to extract data.

type ProcessorFunc

type ProcessorFunc func(ctx context.Context, resp *Response) (*Result, error)

ProcessorFunc is an adapter to allow use of ordinary functions as Processors.

func (ProcessorFunc) Process

func (f ProcessorFunc) Process(ctx context.Context, resp *Response) (*Result, error)

Process calls f(ctx, resp).

type ProviderEntry

type ProviderEntry struct {
	Type     string   `yaml:"type"`
	List     []string `yaml:"list,omitempty"`
	APIKey   string   `yaml:"api_key,omitempty"`
	Username string   `yaml:"username,omitempty"`
	Password string   `yaml:"password,omitempty"`
	Product  string   `yaml:"product,omitempty"`
	Country  string   `yaml:"country,omitempty"`
}

ProviderEntry defines a proxy provider in configuration.

type ProxyConfig

type ProxyConfig struct {
	Providers           []ProviderEntry `yaml:"providers"`
	Rotation            string          `yaml:"rotation"`
	Cooldown            Duration        `yaml:"cooldown"`
	MaxRequestsPerProxy int             `yaml:"max_requests_per_proxy"`
	HealthCheckInterval Duration        `yaml:"health_check_interval"`
}

ProxyConfig configures proxy management.

type Queue

type Queue interface {
	// Push adds a job to the queue.
	Push(ctx context.Context, job *Job) error
	// Pop removes and returns the highest priority job. Blocks until available
	// or context is cancelled.
	Pop(ctx context.Context) (*Job, error)
	// Len returns the number of jobs in the queue.
	Len() int
	// Close releases queue resources.
	Close() error
}

Queue defines the interface for job storage and retrieval.

type QueueConfig

type QueueConfig struct {
	Backend string `yaml:"backend"`
}

QueueConfig configures the job queue backend.

type RateLimitConfig

type RateLimitConfig struct {
	Enabled        bool    `yaml:"enabled"`
	RequestsPerSec float64 `yaml:"requests_per_sec"`
	BurstSize      int     `yaml:"burst_size"`
}

RateLimitConfig configures per-domain rate limiting.

type Response

type Response struct {
	// StatusCode is the HTTP status code.
	StatusCode int
	// Headers are the response headers.
	Headers http.Header
	// Body is the response body bytes.
	Body []byte
	// URL is the final URL after redirects.
	URL string
	// FetchMode indicates which fetcher was used.
	FetchMode FetchMode
	// Duration is how long the fetch took.
	Duration time.Duration
	// Job is the original job that produced this response.
	Job *Job
	// StepResults holds return values from JobStepEvaluate steps, keyed by
	// step index (e.g. "step_0", "step_2"). Only populated when steps
	// produce output.
	StepResults map[string]any
	// CapturedXHR holds captured XHR/fetch responses when capture patterns are configured.
	// Each entry is a map with keys: request_url, request_method, status, headers, body.
	CapturedXHR []map[string]any
	// Cookies contains cookies set by the response (Set-Cookie headers for
	// static fetches, browser context cookies for browser fetches).
	Cookies []*http.Cookie `json:"cookies,omitempty"`
	// CloudflareSolved is true when a Cloudflare Turnstile / JS challenge was
	// detected and verified as solved before the response was returned. The
	// verification checks for the cf_clearance cookie, absence of Turnstile
	// DOM markers, and a non-empty cf-turnstile-response token. Only set when
	// the browser fetcher was launched with WithSolveCloudflare.
	CloudflareSolved bool `json:"cloudflare_solved,omitempty"`
	// contains filtered or unexported fields
}

Response wraps an HTTP response with additional metadata.

func (*Response) Adaptive

func (r *Response) Adaptive(name string) string

Adaptive returns the text of a registered adaptive selector by name. Falls back to similarity matching if the primary CSS selector finds nothing on the current page. Returns an empty string when no extractor is attached or no element is matched.

Requires Hunt.WithAdaptive(...) to have been called.

func (*Response) AdaptiveExtractor

func (r *Response) AdaptiveExtractor() any

AdaptiveExtractor returns the attached extractor as an opaque value. Callers in the parse package can type-assert it to *parse.AdaptiveExtractor. Returns nil when no extractor is configured for this response.

func (*Response) CSS

func (r *Response) CSS(selector string) *Selection

CSS returns a Selector bound to this Response. Subsequent calls share the same parsed document, making it efficient to chain multiple selectors:

title := resp.CSS("h1").Text()
links := resp.CSS("a[href]").Attrs("href")

func (*Response) CSSAdaptive

func (r *Response) CSSAdaptive(selector, name string) *Selection

CSSAdaptive runs a CSS selector against this response, registering it as an adaptive selector under the given name. On future runs, if the selector no longer matches, similarity matching falls back to the saved signature.

Requires Hunt.WithAdaptive(...) to have been called. Returns a Selection supporting .Text(), .Attr(), .Texts(), and .Attrs(); when no extractor is configured, returns a Selection backed by a plain CSS query (degraded behaviour).

func (*Response) CSSAdaptiveAll

func (r *Response) CSSAdaptiveAll(selector, name string) *Selection

CSSAdaptiveAll is like CSSAdaptive but registers the selector for multi-element extraction. Returns a Selection that, when queried via .Texts() / .Attrs(), yields all matches.

func (*Response) Follow

func (r *Response) Follow(selector string, opts ...FollowOption) []*Job

Follow generates follow-up Jobs from all links matching the CSS selector in the response body. Links are resolved relative to the response URL, deduplicated, and filtered to HTTP(S) schemes. The generated jobs inherit Depth+1 from the originating job.

Example:

jobs := resp.Follow("a.product-link[href]", foxhound.WithFollowCallback("parseProduct"))

func (*Response) FollowAll

func (r *Response) FollowAll(opts ...FollowOption) []*Job

FollowAll generates follow-up Jobs for all anchor links (a[href]) in the response body. It is shorthand for Follow("a[href]", opts...).

func (*Response) FollowURL

func (r *Response) FollowURL(rawURL string, opts ...FollowOption) *Job

FollowURL creates a single follow-up Job for the given URL. The URL is resolved relative to the response URL. The generated job inherits Depth+1 from the originating job.

Unlike Follow, which extracts links from HTML via CSS selectors, FollowURL is for programmatically following a known URL (e.g. an API endpoint or a URL extracted from JSON data).

Example:

nextPage := resp.FollowURL("/api/products?page=2", foxhound.WithFollowReferer(true))

func (*Response) IsSuccess

func (r *Response) IsSuccess() bool

IsSuccess returns true when the HTTP status code indicates success (2xx).

func (*Response) SetAdaptiveExtractor

func (r *Response) SetAdaptiveExtractor(ae any)

SetAdaptiveExtractor attaches a *parse.AdaptiveExtractor to this response. Walker calls this before invoking the user processor when Hunt.WithAdaptive(...) is configured. The argument is typed as any to avoid an import cycle; pass a *parse.AdaptiveExtractor.

func (*Response) TextBody

func (r *Response) TextBody() string

TextBody returns the response body as a string.

func (*Response) XPath

func (r *Response) XPath(expr string) string

XPath evaluates a simplified XPath expression against the response body and returns the first matching element's text. See parse.XPath for supported syntax.

func (*Response) XPathAll

func (r *Response) XPathAll(expr string) []string

XPathAll evaluates a simplified XPath expression and returns text content of all matching elements.

type Result

type Result struct {
	// Items are the extracted data items.
	Items []*Item
	// Jobs are new jobs to enqueue (discovered links, pagination, etc.).
	Jobs []*Job
}

Result is the output of processing a job. It contains scraped items and optionally new jobs to enqueue (for crawling).

type RobotsTxtConfig

type RobotsTxtConfig struct {
	Enabled bool `yaml:"enabled"`
}

RobotsTxtConfig configures robots.txt compliance.

type Selection

type Selection struct {
	// contains filtered or unexported fields
}

Selection represents a CSS selector applied to an HTML body. It provides convenience methods for extracting text, attributes, and HTML from matched elements without requiring the user to import the parse package directly.

func (*Selection) Attr

func (s *Selection) Attr(attr string) string

Attr returns an attribute value from the first matching element.

func (*Selection) Attrs

func (s *Selection) Attrs(attr string) []string

Attrs returns attribute values from all matching elements.

func (*Selection) Len

func (s *Selection) Len() int

Len returns the number of elements matching the selector.

func (*Selection) Text

func (s *Selection) Text() string

Text returns the trimmed text content of the first element matching the selector.

func (*Selection) Texts

func (s *Selection) Texts() []string

Texts returns the trimmed text content of all elements matching the selector.

type Selector

type Selector struct {
	// contains filtered or unexported fields
}

Selector provides CSS-selector-based querying on the Response body. It wraps a lazily-parsed HTML document so multiple CSS/XPath calls share the same parse result.

func (*Selector) CSS

func (s *Selector) CSS(selector string) *Selection

CSS returns a Selection for the given CSS selector.

func (*Selector) XPath

func (s *Selector) XPath(expr string) string

XPath evaluates a simplified XPath expression and returns the first match text.

func (*Selector) XPathAll

func (s *Selector) XPathAll(expr string) []string

XPathAll evaluates a simplified XPath expression and returns all match texts.

type Session

type Session struct {
	// contains filtered or unexported fields
}

Session is a stateful client that survives across calls. Cookies are persisted in an internal CookieJar; identity, proxy, and fetcher are reused for every Get/Fetch.

Session is safe for concurrent use by multiple goroutines.

func NewSession

func NewSession(opts ...SessionOption) *Session

NewSession constructs a Session with the supplied options. A fresh in-memory cookie jar is created when no WithSessionCookieJar option is given.

func (*Session) Close

func (s *Session) Close() error

Close releases any resources held by the underlying fetcher. The cookie jar is in-memory and needs no cleanup. Safe to call multiple times.

func (*Session) Cookies

func (s *Session) Cookies() []*http.Cookie

Cookies returns all cookies currently held in the jar across every host. The returned slice is a fresh copy; mutating it does not affect the jar.

func (*Session) CookiesFor

func (s *Session) CookiesFor(rawURL string) []*http.Cookie

CookiesFor returns the cookies the jar would send for the given URL. This is the standard http.CookieJar query — use it when you need to inspect what the session has accumulated for a particular host.

func (*Session) Fetch

func (s *Session) Fetch(ctx context.Context, job *Job) (*Response, error)

Fetch executes a Job through the session's fetcher. Before the call any cookies the jar holds for the target URL are merged into the job's headers (so static fetchers without their own jar still see them). After a successful fetch any cookies returned in Response.Cookies are stored in the jar so the next call observes them.

func (*Session) Fetcher

func (s *Session) Fetcher() Fetcher

Fetcher returns the underlying fetcher. Returns nil if none was configured.

func (*Session) Get

func (s *Session) Get(ctx context.Context, rawURL string) (*Response, error)

Get is the simple fetch shorthand. It builds a Job with method GET, the session's identity, and the URL, then delegates to Fetch.

func (*Session) Identity

func (s *Session) Identity() any

Identity returns the configured identity profile (as `any`). Callers type-assert to *identity.Profile.

func (*Session) Name

func (s *Session) Name() string

Name returns the session's optional name. Returns empty string for stand-alone sessions; populated for sessions registered via Hunt.AddSession.

func (*Session) ProxyURL

func (s *Session) ProxyURL() string

ProxyURL returns the session's recorded proxy URL.

func (*Session) SetFetcher

func (s *Session) SetFetcher(f Fetcher)

SetFetcher updates the underlying fetcher post-construction. Useful when the fetcher needs to reference the Session's cookie jar (a chicken-and-egg problem solved by constructing the Session first, then the fetcher).

func (*Session) SetName

func (s *Session) SetName(name string)

SetName updates the session's name. Used by Hunt.AddSession.

type SessionOption

type SessionOption func(*Session)

SessionOption configures a Session at construction time.

func WithSessionCookieJar

func WithSessionCookieJar(jar http.CookieJar) SessionOption

WithSessionCookieJar replaces the default in-memory jar with a caller- supplied implementation. Use this when persisting cookies across processes (e.g. via a custom file-backed jar) or when sharing a jar across sessions.

func WithSessionFetcher

func WithSessionFetcher(f Fetcher) SessionOption

WithSessionFetcher overrides the default fetcher. When omitted the caller must register one explicitly via SetFetcher before the first Get / Fetch call; calling Get on a Session without a fetcher returns an error.

The default Session does NOT auto-create a stealth fetcher to avoid an import cycle from foxhound → fetch. Wire one up at the application layer:

s := foxhound.NewSession(foxhound.WithSessionFetcher(fetch.NewStealth()))

func WithSessionIdentity

func WithSessionIdentity(p any) SessionOption

WithSessionIdentity attaches an identity profile to the session. The value is stored as `any` to avoid an import cycle with the identity package; the caller passes a *identity.Profile and is responsible for using it on the fetcher (most fetchers accept it via their own option at construction).

func WithSessionProxy

func WithSessionProxy(rawURL string) SessionOption

WithSessionProxy records the session's proxy URL. The value is stored for inspection but is NOT auto-applied to the fetcher; configure the fetcher's own proxy option at construction time. This is intentional: a Session is a thin wrapper, not a fetcher factory.

type StaticFetchConfig

type StaticFetchConfig struct {
	Timeout        Duration `yaml:"timeout"`
	MaxIdleConns   int      `yaml:"max_idle_conns"`
	TLSImpersonate bool     `yaml:"tls_impersonate"`
}

StaticFetchConfig configures the TLS-impersonating HTTP client.

type ValidateConfig

type ValidateConfig struct {
	Required []string `yaml:"required"`
}

ValidateConfig configures the validation pipeline stage.

type Writer

type Writer interface {
	// Write outputs an item to the destination.
	Write(ctx context.Context, item *Item) error
	// Flush ensures all buffered items are written.
	Flush(ctx context.Context) error
	// Close releases writer resources.
	Close() error
}

Writer defines the interface for exporting scraped items.

Directories

Path Synopsis
Package behavior provides human-like behavioral simulation for the Foxhound scraping engine.
Package behavior provides human-like behavioral simulation for the Foxhound scraping engine.
Package cache provides caching backends for Foxhound responses.
Package cache provides caching backends for Foxhound responses.
Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework.
Package captcha provides CAPTCHA detection and solving for the Foxhound scraping framework.
cmd
foxhound command
Command foxhound is the CLI entry point for the Foxhound scraping framework.
Command foxhound is the CLI entry point for the Foxhound scraping framework.
Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.
Package engine implements the core orchestration layer for the Foxhound scraping framework: Hunt (campaign coordinator), Walker (virtual user), Trail (navigation path), Scheduler (goroutine pool), RetryPolicy, and Stats.
examples
adaptive command
Example: adaptive selectors that survive DOM changes.
Example: adaptive selectors that survive DOM changes.
ecommerce command
Example: E-commerce product scraper
Example: E-commerce product scraper
realtime command
Example: Real-time price monitor with webhook notifications
Example: Real-time price monitor with webhook notifications
travel command
Example: Travel site scraper — hotel listings with JavaScript rendering
Example: Travel site scraper — hotel listings with JavaScript rendering
Package fetch provides the dual-mode fetching layer for Foxhound.
Package fetch provides the dual-mode fetching layer for Foxhound.
presets
Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured.
Package presets ships a single curated Firefox JA3 fingerprint that the stealth fetcher applies automatically when an identity profile is configured.
Package identity provides consistent anti-detection identity profiles.
Package identity provides consistent anti-detection identity profiles.
Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic.
Package middleware provides composable foxhound.Middleware implementations for rate limiting, deduplication, depth limiting, and retry logic.
Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts.
Package monitor provides runtime statistics, Prometheus metrics, and alerting for Foxhound scraping hunts.
Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework.
Package parse provides HTML, JSON, and other response parsing utilities for the Foxhound scraping framework.
Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.
Package pipeline provides composable data processing stages and export writers for the Foxhound scraping framework.
export
Package export provides Writer implementations for exporting scraped items to various formats and destinations.
Package export provides Writer implementations for exporting scraped items to various formats and destinations.
Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework.
Package proxy manages HTTP/SOCKS proxy pools, health checking, and rotation strategies for the Foxhound scraping framework.
providers
Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface.
Package providers contains third-party proxy provider adapters that implement the proxy.Provider interface.
Package queue provides foxhound.Queue implementations.
Package queue provides foxhound.Queue implementations.
tests
scrape_targets/alibaba command
Scrape Target 3: Alibaba — 10 yoga mat products
Scrape Target 3: Alibaba — 10 yoga mat products
scrape_targets/google_maps command
Scrape Target 2: Google Maps — "villa di bali"
Scrape Target 2: Google Maps — "villa di bali"
scrape_targets/google_serp command
Scrape Target 1: Google SERP — "wisata alam jawa timur"
Scrape Target 1: Google SERP — "wisata alam jawa timur"
scrape_targets/yoga_alliance command
Scrape Target 4: Yoga Alliance School Directory
Scrape Target 4: Yoga Alliance School Directory

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL