spider

package module
v0.0.0-...-9e6d8de Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 9, 2026 License: MIT Imports: 21 Imported by: 0

README

spider

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BasicChecker

type BasicChecker struct{}

BasicChecker merges the former BasicChecker and AdvancedChecker into a single, unified heuristic pass. It runs inline (same goroutine as Fetch) and exploits both the raw HTML body and the readability-extracted readable body that the fetcher now provides.

Signal groups:

  • Content volume (readable text length, raw body length)
  • Text quality (lexical diversity, visible-text ratio)
  • DOM structure (semantic tags, heading hierarchy, main/article)
  • Metadata (title quality, Open Graph, canonical)
  • Readable body (word count from readability output, content density)
  • Hard penalties (anti-bot, error page, login wall, SPA shell, JS-heavy)
  • Soft penalties (link density, noindex, noscript JS-wall signal)

func NewBasicChecker

func NewBasicChecker() *BasicChecker

func (*BasicChecker) Check

func (*BasicChecker) Tier

func (h *BasicChecker) Tier() Tier

Tier returns TierBasic so the pipeline will escalate to LLM when uncertain.

type Checker

type Checker interface {
	// Check analyses the fetch result and returns a quality verdict.
	Check(ctx context.Context, rs *FetchResult) (QualityResult, error)
	// Tier returns the sophistication level of this checker.
	Tier() Tier
}

Checker is the universal interface every quality strategy must satisfy.

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the main spider entry point. It automatically selects HTTP or headless-browser fetch strategies based on per-host quality scores that are updated lazily in the background.

func New

func New(options ...Option) (*Client, error)

New creates a Client with sensible production defaults. Pass option functions to customise behaviour.

func (*Client) CheckQuality

func (c *Client) CheckQuality(ctx context.Context, rs *FetchResult) (QualityResult, error)

CheckQuality scores a pre-fetched result using the configured inline checker.

func (*Client) Close

func (c *Client) Close() error

Close drains the background pipeline, releases the browser, and waits for all background goroutines to finish.

func (*Client) Fetch

func (c *Client) Fetch(ctx context.Context, endpoint string, fetchMethod ...FetchMethod) (*FetchResult, error)

Fetch is the smart entry point. It:

  1. Reads the cached quality score for this host.
  2. Chooses HTTP or browser accordingly (defaults to HTTP on first visit).
  3. Fetches and extracts the readable body.
  4. Runs the HeuristicChecker inline to produce an immediate score.
  5. If the result is uncertain, enqueues a non-blocking background upgrade.
  6. Returns both bodies and the pre-fetch cached score.

func (*Client) FetchBrowser

func (c *Client) FetchBrowser(ctx context.Context, endpoint string) (*FetchResult, error)

FetchBrowser fetches via headless browser regardless of the cached recommendation.

func (*Client) FetchHTTP

func (c *Client) FetchHTTP(ctx context.Context, endpoint string) (*FetchResult, error)

FetchHTTP performs a plain HTTP GET regardless of the cached recommendation. Useful when you explicitly want raw HTML without browser overhead.

func (*Client) FetchJSON

func (c *Client) FetchJSON(ctx context.Context, endpoint string) ([]byte, error)

FetchJSON performs a plain HTTP GET and returns the raw body bytes. Convenience wrapper for JSON API endpoints.

func (*Client) FetchRaw

func (c *Client) FetchRaw(ctx context.Context, endpoint string, method FetchMethod) (*FetchResult, error)

FetchRaw fetches the endpoint with the given method and populates a FetchResult with both the raw and readable bodies.

func (*Client) PipelineLen

func (c *Client) PipelineLen() int

PipelineLen returns the number of jobs currently waiting in the upgrade queue. Useful for monitoring.

func (*Client) RenderReadableHTML

func (c *Client) RenderReadableHTML(w io.Writer, body []byte) error

RenderReadableHTML extracts the main article content from raw HTML and writes it as clean HTML to w.

func (*Client) RenderReadableText

func (c *Client) RenderReadableText(w io.Writer, body []byte) error

RenderReadableText extracts the main article content and writes it as plain text to w.

func (*Client) ScoreFor

func (c *Client) ScoreFor(rawURL string) (QualityResult, bool)

ScoreFor returns the best known cached quality result for a URL's host.

type Confidence

type Confidence int

Confidence expresses how certain the scorer is about its recommendation.

const (
	ConfidenceLow    Confidence = iota // unsure – trigger background re-score
	ConfidenceMedium                   // probably right
	ConfidenceHigh                     // very certain
)

func (Confidence) String

func (c Confidence) String() string

type FetchMethod

type FetchMethod int

FetchMethod is the recommended crawl strategy for a host.

const (
	MethodAuto    FetchMethod = iota
	MethodHTTP                // plain HTTP is sufficient
	MethodBrowser             // headless browser required
)

func (FetchMethod) String

func (m FetchMethod) String() string

type FetchResult

type FetchResult struct {
	// RawBody is the unmodified response body as received from the server or browser.
	RawBody []byte
	// ReadableBody is the readability-extracted clean HTML for the main article content.
	// It is always populated (may be empty if readability found nothing meaningful).
	ReadableBody []byte
	// Method is the strategy that was actually used for this fetch.
	Method FetchMethod
	// Score is the best known cached quality score at the time of the fetch.
	// Nil on the first-ever visit to a host.
	Score *QualityResult
	// Endpoint is the URL that was fetched.
	Endpoint string
	// StatusCode is the HTTP status code (0 for browser fetches).
	StatusCode int
	// FetchedAt is when the fetch completed.
	FetchedAt time.Time
}

FetchResult bundles fetched content with metadata about how it was obtained.

type LLMChecker

type LLMChecker struct {
	// contains filtered or unexported fields
}

LLMChecker sends a compact, dual-view snippet (raw HTML head + readable body text) to Claude and asks for a structured quality verdict. It is the most accurate but also the slowest and most expensive checker, so it always runs in a background pipeline worker.

func NewLLMChecker

func NewLLMChecker(apiKey string) *LLMChecker

func (*LLMChecker) Check

func (l *LLMChecker) Check(ctx context.Context, rs *FetchResult) (QualityResult, error)

func (*LLMChecker) Tier

func (l *LLMChecker) Tier() Tier

type OllamaChecker

type OllamaChecker struct {
	// contains filtered or unexported fields
}

OllamaChecker is a TierLLM quality checker that runs a local model via Ollama's /api/chat endpoint. It uses the same dual-view prompt and result schema as LLMChecker (Claude) so scores are directly comparable.

Prefer this when you want zero API cost and don't mind slightly lower accuracy than a hosted frontier model.

func NewOllamaChecker

func NewOllamaChecker(opts ...OllamaOption) *OllamaChecker

func (*OllamaChecker) Check

func (*OllamaChecker) Ping

func (o *OllamaChecker) Ping(ctx context.Context) error

Ping checks whether the Ollama server is reachable and the chosen model is available. Call this at startup to fail fast rather than discovering the problem during a crawl.

func (*OllamaChecker) Tier

func (o *OllamaChecker) Tier() Tier

type OllamaOption

type OllamaOption func(*OllamaChecker)

func WithOllamaBaseURL

func WithOllamaBaseURL(url string) OllamaOption

WithOllamaBaseURL overrides the default http://localhost:11434 endpoint.

func WithOllamaHTTPClient

func WithOllamaHTTPClient(c *http.Client) OllamaOption

WithOllamaHTTPClient overrides the default HTTP client.

func WithOllamaModel

func WithOllamaModel(model string) OllamaOption

WithOllamaModel overrides the default model (llama3). Good alternatives: mistral, phi3, gemma2, qwen2, deepseek-r1.

type Option

type Option func(*Client)

func WithBrowserTimeout

func WithBrowserTimeout(d time.Duration) Option

WithBrowserTimeout sets a separate (usually longer) deadline for browser fetches. Defaults to 2× the HTTP timeout.

func WithChecker

func WithChecker(checker Checker) Option

WithChecker replaces the inline (TierBasic) checker.

func WithHTTPClient

func WithHTTPClient(httpClient *http.Client) Option

WithHTTPClient replaces the default HTTP client. The client's Timeout is overridden by WithTimeout unless you also call WithTimeout(0).

func WithLocation

func WithLocation(loc *time.Location) Option

WithLocation sets the time.Location used for any time-stamped output.

func WithLogger

func WithLogger(log *slog.Logger) Option

WithLogger sets the structured logger.

func WithPipeline

func WithPipeline(pipeline *Pipeline) Option

WithPipeline replaces the background upgrade pipeline.

func WithScoreStore

func WithScoreStore(store Store) Option

WithScoreStore replaces the default ScoreStore.

func WithTimeout

func WithTimeout(d time.Duration) Option

WithTimeout sets the per-request deadline for both HTTP and browser fetches.

func WithUserAgent

func WithUserAgent(ua string) Option

WithUserAgent overrides the User-Agent header sent on HTTP requests.

type Pipeline

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline manages a pool of background workers that run higher-tier checkers and update the ScoreStore. Callers are never blocked — if the queue is full the job is dropped and a warning is logged.

Key production improvements over the original:

  • Per-host deduplication: only one pending job per host at a time, so a burst of fetches for the same domain doesn't flood the queue.
  • Job timeout via context: a stuck LLM call cannot stall a worker forever.
  • Queue-depth metric exposed via Len() for monitoring.
  • Graceful drain on Close() with a configurable deadline.

func NewPipeline

func NewPipeline(store Store, checkers []Checker, workers int, log *slog.Logger, opts ...PipelineOption) *Pipeline

func (*Pipeline) Close

func (p *Pipeline) Close()

Close drains pending jobs and waits for all workers to finish.

func (*Pipeline) Enqueue

func (p *Pipeline) Enqueue(host string, rs *FetchResult, fromTier Tier) bool

Enqueue submits a non-blocking background upgrade for the given host. Returns false if the job was deduplicated or the queue was full.

func (*Pipeline) Len

func (p *Pipeline) Len() int

Len returns the number of jobs currently buffered in the queue.

type PipelineOption

type PipelineOption func(*Pipeline)

PipelineOption configures a Pipeline.

func WithPipelineJobTimeout

func WithPipelineJobTimeout(d time.Duration) PipelineOption

WithPipelineJobTimeout sets the per-job context deadline. Default: 60 s. Set to 0 to disable.

func WithPipelineQueueSize

func WithPipelineQueueSize(n int) PipelineOption

WithPipelineQueueSize sets the channel buffer. Default: 512.

type ProbeCandidate

type ProbeCandidate struct {
	Host          string
	LastURL       string
	CurrentMethod FetchMethod
}

ProbeCandidate describes a host due for a de-escalation probe.

type Prober

type Prober struct {
	// contains filtered or unexported fields
}

Prober runs in the background and periodically attempts to de-escalate hosts from Browser back to HTTP. It never touches live Fetch() calls.

Production improvements:

  • Configurable concurrency: probes run in parallel up to cfg.Concurrency.
  • Pending confirmation state is protected by a mutex (safe across goroutines).
  • Per-probe timeout via context.
  • Graceful stop waits for the current scan batch to complete.

func NewProber

func NewProber(c *Client, opts ...ProberOption) *Prober

NewProber creates a Prober wired to the given Client with default config. Use NewProberWithOptions for custom configuration.

func (*Prober) Close

func (p *Prober) Close()

Close signals the scan loop to stop and waits for the current scan to finish.

func (*Prober) Start

func (p *Prober) Start()

Start launches the background scan loop. Non-blocking.

type ProberConfig

type ProberConfig struct {
	// ScanInterval is how often the prober scans for hosts due a probe.
	// Default: 5 min.
	ScanInterval time.Duration
	// InitialProbeInterval is how long after browser is chosen before the
	// first de-escalation probe. Default: 30 min.
	InitialProbeInterval time.Duration
	// DemoteThreshold is the minimum quality score for HTTP to be considered
	// "good enough" to demote back from browser. Default: 0.65.
	DemoteThreshold float64
	// ConfirmCount is how many consecutive successful HTTP probes are required
	// before demoting. Guards against a single lucky probe on a flaky site.
	// Default: 2.
	ConfirmCount int
	// ProbeTimeout is the per-probe request deadline. Default: 30 s.
	ProbeTimeout time.Duration
	// Concurrency is the maximum number of hosts probed in parallel.
	// Default: 4.
	Concurrency int
}

ProberConfig controls de-escalation behaviour.

type ProberOption

type ProberOption func(*ProberConfig)

ProberOption configures a Prober.

func WithConfirmCount

func WithConfirmCount(n int) ProberOption

func WithDemoteThreshold

func WithDemoteThreshold(t float64) ProberOption

func WithInitialProbeInterval

func WithInitialProbeInterval(d time.Duration) ProberOption

func WithProbeConcurrency

func WithProbeConcurrency(n int) ProberOption

func WithProbeTimeout

func WithProbeTimeout(d time.Duration) ProberOption

func WithScanInterval

func WithScanInterval(d time.Duration) ProberOption

type QualityResult

type QualityResult struct {
	Score       float64            // 0.0 (unusable) – 1.0 (perfect)
	Confidence  Confidence         // how certain the checker is
	Signals     map[string]float64 // named signal contributions for debugging
	Recommended FetchMethod        // which fetch strategy to use next time
	Reason      string             // human-readable summary
	Tier        Tier               // which checker produced this
}

QualityResult is the unified output of every quality checker.

func (QualityResult) NeedsUpgrade

func (q QualityResult) NeedsUpgrade() bool

NeedsUpgrade returns true when the result is uncertain enough to warrant a background upgrade to a higher-tier checker.

type ScoreStore

type ScoreStore struct {
	// contains filtered or unexported fields
}

ScoreStore is a thread-safe, per-host store of quality observations.

func NewScoreStore

func NewScoreStore(opts ...StoreOption) *ScoreStore

func (*ScoreStore) Delete

func (s *ScoreStore) Delete(host string)

Delete removes all observations for a host (manual cache invalidation).

func (*ScoreStore) Evict

func (s *ScoreStore) Evict()

Evict removes observations older than windowTTL from all entries. Call periodically (e.g. in a housekeeping goroutine) to bound memory use.

func (*ScoreStore) Get

func (s *ScoreStore) Get(host string) (QualityResult, bool)

Get returns the merged (time-decayed) result for a host. Returns (zero, false) if no observations exist or all have expired.

func (*ScoreStore) Len

func (s *ScoreStore) Len() int

Len returns the number of hosts currently tracked.

func (*ScoreStore) ProbeCandidates

func (s *ScoreStore) ProbeCandidates() []ProbeCandidate

ProbeCandidates returns hosts that are on Browser and whose next probe time has elapsed.

func (*ScoreStore) RecordProbeResult

func (s *ScoreStore) RecordProbeResult(host string, success bool, baseInterval time.Duration)

RecordProbeResult updates probe scheduling after a de-escalation attempt.

  • success=true → clear observations so new HTTP results dominate quickly.
  • success=false → exponential backoff, capped at 24 h.

func (*ScoreStore) ScheduleProbe

func (s *ScoreStore) ScheduleProbe(host string, interval time.Duration)

ScheduleProbe sets the initial nextProbeAt for a host that just got promoted to Browser. A no-op if a probe is already scheduled.

func (*ScoreStore) SetLastURL

func (s *ScoreStore) SetLastURL(host, rawURL string)

SetLastURL records the most recently fetched URL for a host so the Prober knows what URL to re-probe.

func (*ScoreStore) Update

func (s *ScoreStore) Update(host string, result QualityResult)

Update adds a new observation for the host. Every observation contributes to the weighted average regardless of tier, so even basic heuristic readings add signal to the window.

type Store

type Store interface {
	Delete(host string)
	Evict()
	Get(host string) (QualityResult, bool)
	Len() int
	ProbeCandidates() []ProbeCandidate
	RecordProbeResult(host string, success bool, baseInterval time.Duration)
	ScheduleProbe(host string, interval time.Duration)
	SetLastURL(host string, rawURL string)
	Update(host string, result QualityResult)
}

type StoreOption

type StoreOption func(*ScoreStore)

StoreOption configures the ScoreStore.

func WithHalfLife

func WithHalfLife(d time.Duration) StoreOption

WithHalfLife sets the exponential decay half-life. A shorter half-life makes the store react faster to site changes. Default: 4 h.

func WithMaxWindow

func WithMaxWindow(n int) StoreOption

WithMaxWindow sets the maximum observations kept per host. Default: 20.

func WithWindowTTL

func WithWindowTTL(d time.Duration) StoreOption

WithWindowTTL sets how long an individual observation is retained. Default: 24 h.

type Tier

type Tier int

Tier represents the sophistication level of a quality checker. Higher tiers are more accurate but slower and more expensive.

const (
	TierBasic Tier = iota // fast heuristic, runs inline with every Fetch
	TierLLM               // LLM-based, runs in a background pipeline worker
)

func (Tier) String

func (t Tier) String() string

Directories

Path Synopsis
cmd
spider command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL