web

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 20, 2026 License: MIT Imports: 25 Imported by: 0

Documentation

Overview

Package web is the Phase 7 shared engine behind the web_search and web_fetch tools. It owns the full request path so the two thin tool adapters in internal/agent/tools stay free of cross-cutting concerns:

  • searxng client — query build, JSON parse, domain post-filter (D-07..D-14)
  • ssrf classifier — net/netip blocklist, fail-closed public-web-only (D-24..D-28)
  • dns pin cache — per-conversation TTL pin, anti-rebinding (D-25)
  • pinned transport — dial-only-the-pinned-IP + manual redirect revalidate (D-29)
  • html extraction — readability → markdown + link dedup (D-17..D-22)
  • cache — in-memory TTL, disk opt-in (D-31..D-33)
  • errors — the D-38 stable enum + sanitized, non-leaky shapes

Security posture is fail-closed: local/internal targets stay blocked, no allowlist escape hatch lands in Phase 7 (D-30), and model-visible errors carry no IP / host / header / body / redirect-chain detail (D-27).

The two extraction dependencies (D-20 / roadmap Amendment #3) are consumed by html.go: codeberg.org/readeck/go-readability/v2 (FromReader → Article) feeds its node tree straight into github.com/JohannesKaufmann/html-to-markdown/v2 (ConvertNode), with golang.org/x/net/html as the shared *html.Node bridge type.

Index

Constants

View Source
const (
	CodeSearchUnavailable  = "web_search_unavailable"
	CodeBlockedURL         = "blocked_url"
	CodeUnsupportedScheme  = "unsupported_scheme"
	CodeUnsupportedContent = "unsupported_content_type"
	CodeResponseTooLarge   = "response_too_large"
	CodeTimeout            = "timeout"
	CodeHTTPError          = "http_error"
	CodeExtractionFailed   = "extraction_failed"
)

The D-38 stable error enum. These strings are the model-visible `error` field values; they are a contract — never rename without a PRD amendment.

View Source
const (
	ReasonSearxngNotConfigured = "searxng_not_configured"
	ReasonSearxngUnreachable   = "searxng_unreachable"
	ReasonPrivateOrMetadata    = "private_or_metadata_target"
	ReasonRedirectToBlocked    = "redirect_to_blocked_target"
	ReasonInvalidTarget        = "invalid_target"
)

Stable, non-sensitive reason strings. Reasons name a CLASS of block, never a concrete IP/host/CIDR (D-27): "private_or_metadata_target" not "169.254.169.254".

Variables

This section is empty.

Functions

func ExtractMarkdown

func ExtractMarkdown(body []byte, pageURL *url.URL) (title, markdown string, links []string, warning string, err error)

ExtractMarkdown turns already-fetched, size-gated HTML bytes into clean markdown plus deduped absolute links. It NEVER fetches — readability.FromReader runs on the bytes the hardened transport already pulled, so the SSRF gate is never bypassed (the self-fetching readability entry point is forbidden, T-07-22). pageURL anchors relative link resolution (D-19). title is art.Title() only (no byline/excerpt/site/time, D-18). Readable text shorter than lowContentRunes returns markdown with warning="low_content" and a nil error (D-22).

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the Phase 7 shared engine behind web_search and web_fetch. It owns the single SSRF-hardened transport (dial-only-the-pinned-IP), the per-conversation DNS pin cache, the response cache, and the web config. The two thin tool adapters (Wave 4) hold one *Client and call Search / Fetch — no tool re-implements the security boundary (D-01).

func NewClient

func NewClient(cfg *config.Config) *Client

NewClient wires the hardened transport (real net.Dialer with the Control recheck), the DNS pin, and the response cache from config. The pin TTL and the User-Agent come from config; the transport owns the egress identity so every search and fetch request carries the same Aura UA (D-34/D-35).

func (*Client) Fetch

func (c *Client) Fetch(ctx context.Context, convID, rawURL string) (Page, error)

Fetch runs the full security state machine: parse + scheme allowlist (D-15) → per-hop SSRF validate+pin via the hardened transport + manual redirect revalidation (D-29) → Content-Type allowlist (D-16) → size cap (D-16) → readability→markdown (D-17/D-19/D-22). It runs under the configured fetch deadline with at most one retry on a transient/408/429/5xx response and no retry on an SSRF/4xx/config error (D-23/D-42). A per-host concurrency cap bounds in-flight fetches to one origin (D-36).

func (*Client) FetchImage added in v1.0.0

func (c *Client) FetchImage(ctx context.Context, convID, rawURL string) ([]byte, string, error)

FetchImage fetches an external image (a web_result thumbnail/favicon) through the SAME SSRF defense web_fetch uses — NOT a fresh http.Get, which would re-open the hole D-09 forbids. It runs: parse + scheme allowlist (D-15) → the hardened transport's dialContext (validateAndPin: hostname blocklist → DNS resolve → classify-every → pin → dial only the pinned IP, transport.go) → reject any 3xx (CheckRedirect never auto-follows, so a redirect is surfaced and rejected here, never silently chased to a private target) → image content-type allowlist (svg excluded) → size cap via io.LimitReader. It returns the raw bytes + the matched media type. Every failure is a sanitized *WebError carrying NO IP/host/redirect detail (D-26/27/28), mirroring Fetch.

func (*Client) Search

func (c *Client) Search(ctx context.Context, params SearchParams) ([]Result, error)

Search builds a SearXNG /search JSON query, parses the response, and post-filters by the requested domain set. It runs under the configured search deadline with at most one retry on a transient/408/429/5xx failure (D-14/D-42). Missing SEARXNG_URL → web_search_unavailable{searxng_not_configured}; an unreachable or non-2xx backend → searxng_unreachable. The backend URL/body is never leaked.

type Page

type Page struct {
	Title     string   `json:"title"`
	URL       string   `json:"url"`
	ContentMD string   `json:"content_md"`
	Links     []string `json:"links"`
	Warning   string   `json:"warning,omitempty"`
}

Page is the model-visible web_fetch result (D-17/D-19/D-22). Warning is non-empty only for a soft downgrade (low_content) — it is never an error channel. Links are deduped normalized absolute URL strings (D-19), never {text,url} objects.

type Result

type Result struct {
	Title    string          `json:"title"`
	URL      string          `json:"url"`
	Snippet  string          `json:"snippet"`
	Metadata *ResultMetadata `json:"metadata,omitempty"`
}

Result is the model-visible search hit. Metadata is nil unless IncludeMetadata was set, so the default wire shape stays {title,url,snippet} (D-07/D-08).

type ResultMetadata

type ResultMetadata struct {
	Engine      string  `json:"engine,omitempty"`
	Score       float64 `json:"score,omitempty"`
	Category    string  `json:"category,omitempty"`
	PublishedAt *string `json:"published_at,omitempty"`
	Thumbnail   string  `json:"thumbnail,omitempty"`
}

ResultMetadata is the NORMALIZED second tier (D-08/D-10). PublishedAt is a pointer so a null/absent publishedDate omits the key rather than emitting "".

type SearchParams

type SearchParams struct {
	Query           string
	MaxResults      int
	Category        string
	Language        string
	TimeRange       string
	Domains         []string
	IncludeMetadata bool
}

SearchParams is the validated, model-supplied search request (D-09). Domains are hostnames only (D-12); Category is an enum (general|news, D-11); TimeRange is day|week|month|year. IncludeMetadata toggles the second result tier (D-08).

type WebError

type WebError struct {
	Code       string
	Reason     string
	Message    string
	StatusCode int
}

WebError is the ONLY error shape the model ever sees. It is deliberately flat and carries no IP, CIDR, internal hostname, response header, body snippet, or redirect-chain detail (D-26/27/28). It serializes to {error, reason, message, status_code?} (D-39/D-40).

func AsWebError

func AsWebError(err error) (*WebError, bool)

AsWebError unwraps err to a *WebError when one is present in the chain. An *internalError is sanitized on the fly so callers and tests can assert the model-visible code/reason without ever touching the sensitive fields. The bool is false when no web error is in the chain.

func (*WebError) Error

func (e *WebError) Error() string

Error renders the headline without any sensitive field (it mirrors only the already-sanitized Code/Reason/StatusCode). Safe to log.

func (*WebError) JSON

func (e *WebError) JSON() ([]byte, error)

JSON is the adapter-facing serializer; the tool adapter feeds the bytes to tools.NewResult so the model self-corrects (D-41 — errors stay inline, only successful large content spills).

func (*WebError) MarshalJSON

func (e *WebError) MarshalJSON() ([]byte, error)

MarshalJSON emits the D-39/D-40 wire shape: `error` (not `code`), optional `reason`/`message`, and `status_code` only when set. Hand-rolled rather than struct tags so an unset reason/status omits its key entirely (no leaky zero values, no `"status_code":0` on an SSRF block).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL