Documentation
¶
Overview ¶
Package web is the Phase 7 shared engine behind the web_search and web_fetch tools. It owns the full request path so the two thin tool adapters in internal/agent/tools stay free of cross-cutting concerns:
- searxng client — query build, JSON parse, domain post-filter (D-07..D-14)
- ssrf classifier — net/netip blocklist, fail-closed public-web-only (D-24..D-28)
- dns pin cache — per-conversation TTL pin, anti-rebinding (D-25)
- pinned transport — dial-only-the-pinned-IP + manual redirect revalidate (D-29)
- html extraction — readability → markdown + link dedup (D-17..D-22)
- cache — in-memory TTL, disk opt-in (D-31..D-33)
- errors — the D-38 stable enum + sanitized, non-leaky shapes
Security posture is fail-closed: local/internal targets stay blocked, no allowlist escape hatch lands in Phase 7 (D-30), and model-visible errors carry no IP / host / header / body / redirect-chain detail (D-27).
The two extraction dependencies (D-20 / roadmap Amendment #3) are consumed by html.go: codeberg.org/readeck/go-readability/v2 (FromReader → Article) feeds its node tree straight into github.com/JohannesKaufmann/html-to-markdown/v2 (ConvertNode), with golang.org/x/net/html as the shared *html.Node bridge type.
Index ¶
Constants ¶
const ( CodeBlockedURL = "blocked_url" CodeUnsupportedScheme = "unsupported_scheme" CodeUnsupportedContent = "unsupported_content_type" CodeResponseTooLarge = "response_too_large" CodeTimeout = "timeout" CodeHTTPError = "http_error" CodeExtractionFailed = "extraction_failed" )
The D-38 stable error enum. These strings are the model-visible `error` field values; they are a contract — never rename without a PRD amendment.
const ( ReasonSearxngNotConfigured = "searxng_not_configured" ReasonSearxngUnreachable = "searxng_unreachable" ReasonPrivateOrMetadata = "private_or_metadata_target" ReasonRedirectToBlocked = "redirect_to_blocked_target" ReasonInvalidTarget = "invalid_target" )
Stable, non-sensitive reason strings. Reasons name a CLASS of block, never a concrete IP/host/CIDR (D-27): "private_or_metadata_target" not "169.254.169.254".
Variables ¶
This section is empty.
Functions ¶
func ExtractMarkdown ¶
func ExtractMarkdown(body []byte, pageURL *url.URL) (title, markdown string, links []string, warning string, err error)
ExtractMarkdown turns already-fetched, size-gated HTML bytes into clean markdown plus deduped absolute links. It NEVER fetches — readability.FromReader runs on the bytes the hardened transport already pulled, so the SSRF gate is never bypassed (the self-fetching readability entry point is forbidden, T-07-22). pageURL anchors relative link resolution (D-19). title is art.Title() only (no byline/excerpt/site/time, D-18). Readable text shorter than lowContentRunes returns markdown with warning="low_content" and a nil error (D-22).
Types ¶
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client is the Phase 7 shared engine behind web_search and web_fetch. It owns the single SSRF-hardened transport (dial-only-the-pinned-IP), the per-conversation DNS pin cache, the response cache, and the web config. The two thin tool adapters (Wave 4) hold one *Client and call Search / Fetch — no tool re-implements the security boundary (D-01).
func NewClient ¶
NewClient wires the hardened transport (real net.Dialer with the Control recheck), the DNS pin, and the response cache from config. The pin TTL and the User-Agent come from config; the transport owns the egress identity so every search and fetch request carries the same Aura UA (D-34/D-35).
func (*Client) Fetch ¶
Fetch runs the full security state machine: parse + scheme allowlist (D-15) → per-hop SSRF validate+pin via the hardened transport + manual redirect revalidation (D-29) → Content-Type allowlist (D-16) → size cap (D-16) → readability→markdown (D-17/D-19/D-22). It runs under the configured fetch deadline with at most one retry on a transient/408/429/5xx response and no retry on an SSRF/4xx/config error (D-23/D-42). A per-host concurrency cap bounds in-flight fetches to one origin (D-36).
func (*Client) FetchImage ¶ added in v1.0.0
FetchImage fetches an external image (a web_result thumbnail/favicon) through the SAME SSRF defense web_fetch uses — NOT a fresh http.Get, which would re-open the hole D-09 forbids. It runs: parse + scheme allowlist (D-15) → the hardened transport's dialContext (validateAndPin: hostname blocklist → DNS resolve → classify-every → pin → dial only the pinned IP, transport.go) → reject any 3xx (CheckRedirect never auto-follows, so a redirect is surfaced and rejected here, never silently chased to a private target) → image content-type allowlist (svg excluded) → size cap via io.LimitReader. It returns the raw bytes + the matched media type. Every failure is a sanitized *WebError carrying NO IP/host/redirect detail (D-26/27/28), mirroring Fetch.
func (*Client) Search ¶
Search builds a SearXNG /search JSON query, parses the response, and post-filters by the requested domain set. It runs under the configured search deadline with at most one retry on a transient/408/429/5xx failure (D-14/D-42). Missing SEARXNG_URL → web_search_unavailable{searxng_not_configured}; an unreachable or non-2xx backend → searxng_unreachable. The backend URL/body is never leaked.
type Page ¶
type Page struct {
Title string `json:"title"`
URL string `json:"url"`
ContentMD string `json:"content_md"`
Links []string `json:"links"`
Warning string `json:"warning,omitempty"`
}
Page is the model-visible web_fetch result (D-17/D-19/D-22). Warning is non-empty only for a soft downgrade (low_content) — it is never an error channel. Links are deduped normalized absolute URL strings (D-19), never {text,url} objects.
type Result ¶
type Result struct {
Title string `json:"title"`
URL string `json:"url"`
Snippet string `json:"snippet"`
Metadata *ResultMetadata `json:"metadata,omitempty"`
}
Result is the model-visible search hit. Metadata is nil unless IncludeMetadata was set, so the default wire shape stays {title,url,snippet} (D-07/D-08).
type ResultMetadata ¶
type ResultMetadata struct {
Engine string `json:"engine,omitempty"`
Score float64 `json:"score,omitempty"`
Category string `json:"category,omitempty"`
PublishedAt *string `json:"published_at,omitempty"`
Thumbnail string `json:"thumbnail,omitempty"`
}
ResultMetadata is the NORMALIZED second tier (D-08/D-10). PublishedAt is a pointer so a null/absent publishedDate omits the key rather than emitting "".
type SearchParams ¶
type SearchParams struct {
Query string
MaxResults int
Category string
Language string
TimeRange string
Domains []string
IncludeMetadata bool
}
SearchParams is the validated, model-supplied search request (D-09). Domains are hostnames only (D-12); Category is an enum (general|news, D-11); TimeRange is day|week|month|year. IncludeMetadata toggles the second result tier (D-08).
type WebError ¶
WebError is the ONLY error shape the model ever sees. It is deliberately flat and carries no IP, CIDR, internal hostname, response header, body snippet, or redirect-chain detail (D-26/27/28). It serializes to {error, reason, message, status_code?} (D-39/D-40).
func AsWebError ¶
AsWebError unwraps err to a *WebError when one is present in the chain. An *internalError is sanitized on the fly so callers and tests can assert the model-visible code/reason without ever touching the sensitive fields. The bool is false when no web error is in the chain.
func (*WebError) Error ¶
Error renders the headline without any sensitive field (it mirrors only the already-sanitized Code/Reason/StatusCode). Safe to log.
func (*WebError) JSON ¶
JSON is the adapter-facing serializer; the tool adapter feeds the bytes to tools.NewResult so the model self-corrects (D-41 — errors stay inline, only successful large content spills).
func (*WebError) MarshalJSON ¶
MarshalJSON emits the D-39/D-40 wire shape: `error` (not `code`), optional `reason`/`message`, and `status_code` only when set. Hand-rolled rather than struct tags so an unset reason/status omits its key entirely (no leaky zero values, no `"status_code":0` on an SSRF block).