Documentation
¶
Index ¶
- type BasicChecker
- type Checker
- type Client
- func (c *Client) CheckQuality(ctx context.Context, rs *FetchResult) (QualityResult, error)
- func (c *Client) Close() error
- func (c *Client) Fetch(ctx context.Context, endpoint string, fetchMethod ...FetchMethod) (*FetchResult, error)
- func (c *Client) FetchBrowser(ctx context.Context, endpoint string) (*FetchResult, error)
- func (c *Client) FetchHTTP(ctx context.Context, endpoint string) (*FetchResult, error)
- func (c *Client) FetchJSON(ctx context.Context, endpoint string) ([]byte, error)
- func (c *Client) FetchRaw(ctx context.Context, endpoint string, method FetchMethod) (*FetchResult, error)
- func (c *Client) PipelineLen() int
- func (c *Client) RenderReadableHTML(w io.Writer, body []byte) error
- func (c *Client) RenderReadableText(w io.Writer, body []byte) error
- func (c *Client) ScoreFor(rawURL string) (QualityResult, bool)
- type Confidence
- type FetchMethod
- type FetchResult
- type LLMChecker
- type OllamaChecker
- type OllamaOption
- type Option
- func WithBrowserTimeout(d time.Duration) Option
- func WithChecker(checker Checker) Option
- func WithHTTPClient(httpClient *http.Client) Option
- func WithLocation(loc *time.Location) Option
- func WithLogger(log *slog.Logger) Option
- func WithPipeline(pipeline *Pipeline) Option
- func WithScoreStore(store Store) Option
- func WithTimeout(d time.Duration) Option
- func WithUserAgent(ua string) Option
- type Pipeline
- type PipelineOption
- type ProbeCandidate
- type Prober
- type ProberConfig
- type ProberOption
- func WithConfirmCount(n int) ProberOption
- func WithDemoteThreshold(t float64) ProberOption
- func WithInitialProbeInterval(d time.Duration) ProberOption
- func WithProbeConcurrency(n int) ProberOption
- func WithProbeTimeout(d time.Duration) ProberOption
- func WithScanInterval(d time.Duration) ProberOption
- type QualityResult
- type ScoreStore
- func (s *ScoreStore) Delete(host string)
- func (s *ScoreStore) Evict()
- func (s *ScoreStore) Get(host string) (QualityResult, bool)
- func (s *ScoreStore) Len() int
- func (s *ScoreStore) ProbeCandidates() []ProbeCandidate
- func (s *ScoreStore) RecordProbeResult(host string, success bool, baseInterval time.Duration)
- func (s *ScoreStore) ScheduleProbe(host string, interval time.Duration)
- func (s *ScoreStore) SetLastURL(host, rawURL string)
- func (s *ScoreStore) Update(host string, result QualityResult)
- type Store
- type StoreOption
- type Tier
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BasicChecker ¶
type BasicChecker struct{}
BasicChecker merges the former BasicChecker and AdvancedChecker into a single, unified heuristic pass. It runs inline (same goroutine as Fetch) and exploits both the raw HTML body and the readability-extracted readable body that the fetcher now provides.
Signal groups:
- Content volume (readable text length, raw body length)
- Text quality (lexical diversity, visible-text ratio)
- DOM structure (semantic tags, heading hierarchy, main/article)
- Metadata (title quality, Open Graph, canonical)
- Readable body (word count from readability output, content density)
- Hard penalties (anti-bot, error page, login wall, SPA shell, JS-heavy)
- Soft penalties (link density, noindex, noscript JS-wall signal)
func NewBasicChecker ¶
func NewBasicChecker() *BasicChecker
func (*BasicChecker) Check ¶
func (h *BasicChecker) Check(_ context.Context, rs *FetchResult) (QualityResult, error)
func (*BasicChecker) Tier ¶
func (h *BasicChecker) Tier() Tier
Tier returns TierBasic so the pipeline will escalate to LLM when uncertain.
type Checker ¶
type Checker interface {
// Check analyses the fetch result and returns a quality verdict.
Check(ctx context.Context, rs *FetchResult) (QualityResult, error)
// Tier returns the sophistication level of this checker.
Tier() Tier
}
Checker is the universal interface every quality strategy must satisfy.
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client is the main spider entry point. It automatically selects HTTP or headless-browser fetch strategies based on per-host quality scores that are updated lazily in the background.
func New ¶
New creates a Client with sensible production defaults. Pass option functions to customise behaviour.
func (*Client) CheckQuality ¶
func (c *Client) CheckQuality(ctx context.Context, rs *FetchResult) (QualityResult, error)
CheckQuality scores a pre-fetched result using the configured inline checker.
func (*Client) Close ¶
Close drains the background pipeline, releases the browser, and waits for all background goroutines to finish.
func (*Client) Fetch ¶
func (c *Client) Fetch(ctx context.Context, endpoint string, fetchMethod ...FetchMethod) (*FetchResult, error)
Fetch is the smart entry point. It:
- Reads the cached quality score for this host.
- Chooses HTTP or browser accordingly (defaults to HTTP on first visit).
- Fetches and extracts the readable body.
- Runs the HeuristicChecker inline to produce an immediate score.
- If the result is uncertain, enqueues a non-blocking background upgrade.
- Returns both bodies and the pre-fetch cached score.
func (*Client) FetchBrowser ¶
FetchBrowser fetches via headless browser regardless of the cached recommendation.
func (*Client) FetchHTTP ¶
FetchHTTP performs a plain HTTP GET regardless of the cached recommendation. Useful when you explicitly want raw HTML without browser overhead.
func (*Client) FetchJSON ¶
FetchJSON performs a plain HTTP GET and returns the raw body bytes. Convenience wrapper for JSON API endpoints.
func (*Client) FetchRaw ¶
func (c *Client) FetchRaw(ctx context.Context, endpoint string, method FetchMethod) (*FetchResult, error)
FetchRaw fetches the endpoint with the given method and populates a FetchResult with both the raw and readable bodies.
func (*Client) PipelineLen ¶
PipelineLen returns the number of jobs currently waiting in the upgrade queue. Useful for monitoring.
func (*Client) RenderReadableHTML ¶
RenderReadableHTML extracts the main article content from raw HTML and writes it as clean HTML to w.
func (*Client) RenderReadableText ¶
RenderReadableText extracts the main article content and writes it as plain text to w.
type Confidence ¶
type Confidence int
Confidence expresses how certain the scorer is about its recommendation.
const ( ConfidenceLow Confidence = iota // unsure – trigger background re-score ConfidenceMedium // probably right ConfidenceHigh // very certain )
func (Confidence) String ¶
func (c Confidence) String() string
type FetchMethod ¶
type FetchMethod int
FetchMethod is the recommended crawl strategy for a host.
const ( MethodAuto FetchMethod = iota MethodHTTP // plain HTTP is sufficient MethodBrowser // headless browser required )
func (FetchMethod) String ¶
func (m FetchMethod) String() string
type FetchResult ¶
type FetchResult struct {
// RawBody is the unmodified response body as received from the server or browser.
RawBody []byte
// ReadableBody is the readability-extracted clean HTML for the main article content.
// It is always populated (may be empty if readability found nothing meaningful).
ReadableBody []byte
// Method is the strategy that was actually used for this fetch.
Method FetchMethod
// Score is the best known cached quality score at the time of the fetch.
// Nil on the first-ever visit to a host.
Score *QualityResult
// Endpoint is the URL that was fetched.
Endpoint string
// StatusCode is the HTTP status code (0 for browser fetches).
StatusCode int
// FetchedAt is when the fetch completed.
FetchedAt time.Time
}
FetchResult bundles fetched content with metadata about how it was obtained.
type LLMChecker ¶
type LLMChecker struct {
// contains filtered or unexported fields
}
LLMChecker sends a compact, dual-view snippet (raw HTML head + readable body text) to Claude and asks for a structured quality verdict. It is the most accurate but also the slowest and most expensive checker, so it always runs in a background pipeline worker.
func NewLLMChecker ¶
func NewLLMChecker(apiKey string) *LLMChecker
func (*LLMChecker) Check ¶
func (l *LLMChecker) Check(ctx context.Context, rs *FetchResult) (QualityResult, error)
func (*LLMChecker) Tier ¶
func (l *LLMChecker) Tier() Tier
type OllamaChecker ¶
type OllamaChecker struct {
// contains filtered or unexported fields
}
OllamaChecker is a TierLLM quality checker that runs a local model via Ollama's /api/chat endpoint. It uses the same dual-view prompt and result schema as LLMChecker (Claude) so scores are directly comparable.
Prefer this when you want zero API cost and don't mind slightly lower accuracy than a hosted frontier model.
func NewOllamaChecker ¶
func NewOllamaChecker(opts ...OllamaOption) *OllamaChecker
func (*OllamaChecker) Check ¶
func (o *OllamaChecker) Check(ctx context.Context, rs *FetchResult) (QualityResult, error)
func (*OllamaChecker) Ping ¶
func (o *OllamaChecker) Ping(ctx context.Context) error
Ping checks whether the Ollama server is reachable and the chosen model is available. Call this at startup to fail fast rather than discovering the problem during a crawl.
func (*OllamaChecker) Tier ¶
func (o *OllamaChecker) Tier() Tier
type OllamaOption ¶
type OllamaOption func(*OllamaChecker)
func WithOllamaBaseURL ¶
func WithOllamaBaseURL(url string) OllamaOption
WithOllamaBaseURL overrides the default http://localhost:11434 endpoint.
func WithOllamaHTTPClient ¶
func WithOllamaHTTPClient(c *http.Client) OllamaOption
WithOllamaHTTPClient overrides the default HTTP client.
func WithOllamaModel ¶
func WithOllamaModel(model string) OllamaOption
WithOllamaModel overrides the default model (llama3). Good alternatives: mistral, phi3, gemma2, qwen2, deepseek-r1.
type Option ¶
type Option func(*Client)
func WithBrowserTimeout ¶
WithBrowserTimeout sets a separate (usually longer) deadline for browser fetches. Defaults to 2× the HTTP timeout.
func WithChecker ¶
WithChecker replaces the inline (TierBasic) checker.
func WithHTTPClient ¶
WithHTTPClient replaces the default HTTP client. The client's Timeout is overridden by WithTimeout unless you also call WithTimeout(0).
func WithLocation ¶
WithLocation sets the time.Location used for any time-stamped output.
func WithPipeline ¶
WithPipeline replaces the background upgrade pipeline.
func WithScoreStore ¶
WithScoreStore replaces the default ScoreStore.
func WithTimeout ¶
WithTimeout sets the per-request deadline for both HTTP and browser fetches.
func WithUserAgent ¶
WithUserAgent overrides the User-Agent header sent on HTTP requests.
type Pipeline ¶
type Pipeline struct {
// contains filtered or unexported fields
}
Pipeline manages a pool of background workers that run higher-tier checkers and update the ScoreStore. Callers are never blocked — if the queue is full the job is dropped and a warning is logged.
Key production improvements over the original:
- Per-host deduplication: only one pending job per host at a time, so a burst of fetches for the same domain doesn't flood the queue.
- Job timeout via context: a stuck LLM call cannot stall a worker forever.
- Queue-depth metric exposed via Len() for monitoring.
- Graceful drain on Close() with a configurable deadline.
func NewPipeline ¶
func (*Pipeline) Close ¶
func (p *Pipeline) Close()
Close drains pending jobs and waits for all workers to finish.
type PipelineOption ¶
type PipelineOption func(*Pipeline)
PipelineOption configures a Pipeline.
func WithPipelineJobTimeout ¶
func WithPipelineJobTimeout(d time.Duration) PipelineOption
WithPipelineJobTimeout sets the per-job context deadline. Default: 60 s. Set to 0 to disable.
func WithPipelineQueueSize ¶
func WithPipelineQueueSize(n int) PipelineOption
WithPipelineQueueSize sets the channel buffer. Default: 512.
type ProbeCandidate ¶
type ProbeCandidate struct {
Host string
LastURL string
CurrentMethod FetchMethod
}
ProbeCandidate describes a host due for a de-escalation probe.
type Prober ¶
type Prober struct {
// contains filtered or unexported fields
}
Prober runs in the background and periodically attempts to de-escalate hosts from Browser back to HTTP. It never touches live Fetch() calls.
Production improvements:
- Configurable concurrency: probes run in parallel up to cfg.Concurrency.
- Pending confirmation state is protected by a mutex (safe across goroutines).
- Per-probe timeout via context.
- Graceful stop waits for the current scan batch to complete.
func NewProber ¶
func NewProber(c *Client, opts ...ProberOption) *Prober
NewProber creates a Prober wired to the given Client with default config. Use NewProberWithOptions for custom configuration.
type ProberConfig ¶
type ProberConfig struct {
// ScanInterval is how often the prober scans for hosts due a probe.
// Default: 5 min.
ScanInterval time.Duration
// InitialProbeInterval is how long after browser is chosen before the
// first de-escalation probe. Default: 30 min.
InitialProbeInterval time.Duration
// DemoteThreshold is the minimum quality score for HTTP to be considered
// "good enough" to demote back from browser. Default: 0.65.
DemoteThreshold float64
// ConfirmCount is how many consecutive successful HTTP probes are required
// before demoting. Guards against a single lucky probe on a flaky site.
// Default: 2.
ConfirmCount int
// ProbeTimeout is the per-probe request deadline. Default: 30 s.
ProbeTimeout time.Duration
// Concurrency is the maximum number of hosts probed in parallel.
// Default: 4.
Concurrency int
}
ProberConfig controls de-escalation behaviour.
type ProberOption ¶
type ProberOption func(*ProberConfig)
ProberOption configures a Prober.
func WithConfirmCount ¶
func WithConfirmCount(n int) ProberOption
func WithDemoteThreshold ¶
func WithDemoteThreshold(t float64) ProberOption
func WithInitialProbeInterval ¶
func WithInitialProbeInterval(d time.Duration) ProberOption
func WithProbeConcurrency ¶
func WithProbeConcurrency(n int) ProberOption
func WithProbeTimeout ¶
func WithProbeTimeout(d time.Duration) ProberOption
func WithScanInterval ¶
func WithScanInterval(d time.Duration) ProberOption
type QualityResult ¶
type QualityResult struct {
Score float64 // 0.0 (unusable) – 1.0 (perfect)
Confidence Confidence // how certain the checker is
Signals map[string]float64 // named signal contributions for debugging
Recommended FetchMethod // which fetch strategy to use next time
Reason string // human-readable summary
Tier Tier // which checker produced this
}
QualityResult is the unified output of every quality checker.
func (QualityResult) NeedsUpgrade ¶
func (q QualityResult) NeedsUpgrade() bool
NeedsUpgrade returns true when the result is uncertain enough to warrant a background upgrade to a higher-tier checker.
type ScoreStore ¶
type ScoreStore struct {
// contains filtered or unexported fields
}
ScoreStore is a thread-safe, per-host store of quality observations.
func NewScoreStore ¶
func NewScoreStore(opts ...StoreOption) *ScoreStore
func (*ScoreStore) Delete ¶
func (s *ScoreStore) Delete(host string)
Delete removes all observations for a host (manual cache invalidation).
func (*ScoreStore) Evict ¶
func (s *ScoreStore) Evict()
Evict removes observations older than windowTTL from all entries. Call periodically (e.g. in a housekeeping goroutine) to bound memory use.
func (*ScoreStore) Get ¶
func (s *ScoreStore) Get(host string) (QualityResult, bool)
Get returns the merged (time-decayed) result for a host. Returns (zero, false) if no observations exist or all have expired.
func (*ScoreStore) Len ¶
func (s *ScoreStore) Len() int
Len returns the number of hosts currently tracked.
func (*ScoreStore) ProbeCandidates ¶
func (s *ScoreStore) ProbeCandidates() []ProbeCandidate
ProbeCandidates returns hosts that are on Browser and whose next probe time has elapsed.
func (*ScoreStore) RecordProbeResult ¶
func (s *ScoreStore) RecordProbeResult(host string, success bool, baseInterval time.Duration)
RecordProbeResult updates probe scheduling after a de-escalation attempt.
- success=true → clear observations so new HTTP results dominate quickly.
- success=false → exponential backoff, capped at 24 h.
func (*ScoreStore) ScheduleProbe ¶
func (s *ScoreStore) ScheduleProbe(host string, interval time.Duration)
ScheduleProbe sets the initial nextProbeAt for a host that just got promoted to Browser. A no-op if a probe is already scheduled.
func (*ScoreStore) SetLastURL ¶
func (s *ScoreStore) SetLastURL(host, rawURL string)
SetLastURL records the most recently fetched URL for a host so the Prober knows what URL to re-probe.
func (*ScoreStore) Update ¶
func (s *ScoreStore) Update(host string, result QualityResult)
Update adds a new observation for the host. Every observation contributes to the weighted average regardless of tier, so even basic heuristic readings add signal to the window.
type Store ¶
type Store interface {
Delete(host string)
Evict()
Get(host string) (QualityResult, bool)
Len() int
ProbeCandidates() []ProbeCandidate
RecordProbeResult(host string, success bool, baseInterval time.Duration)
ScheduleProbe(host string, interval time.Duration)
SetLastURL(host string, rawURL string)
Update(host string, result QualityResult)
}
type StoreOption ¶
type StoreOption func(*ScoreStore)
StoreOption configures the ScoreStore.
func WithHalfLife ¶
func WithHalfLife(d time.Duration) StoreOption
WithHalfLife sets the exponential decay half-life. A shorter half-life makes the store react faster to site changes. Default: 4 h.
func WithMaxWindow ¶
func WithMaxWindow(n int) StoreOption
WithMaxWindow sets the maximum observations kept per host. Default: 20.
func WithWindowTTL ¶
func WithWindowTTL(d time.Duration) StoreOption
WithWindowTTL sets how long an individual observation is retained. Default: 24 h.