Documentation
¶
Overview ¶
Package urlx is the URL ⇄ filesystem contract at the heart of kage.
Every reference kage meets — a page link, a stylesheet, an image, a font — is funnelled through Normalize so that two different-looking URLs that point at the same resource collapse to one canonical key. LocalPath then maps that canonical URL to a deterministic path on disk, and Rel turns two such paths into the relative link that goes back into the rewritten HTML or CSS.
The package is pure: no network, no filesystem, no clock. That is what makes the rest of kage easy to reason about — a page worker can rewrite a link to an asset long before the asset has been downloaded, because both sides agree on where the bytes will live.
Index ¶
- Constants
- func Dir(file string) string
- func Ext(u *url.URL) string
- func InScope(seed, u *url.URL, cfg ScopeConfig) bool
- func Key(u *url.URL) string
- func LikelyPage(u *url.URL) bool
- func LocalPath(seedHost string, u *url.URL, kind Kind, reserved string) string
- func Normalize(base *url.URL, ref string) (*url.URL, error)
- func ParseSeed(arg string) (*url.URL, error)
- func Rel(fromDir, toFile string) string
- func SameRegistrableDomain(seed, u *url.URL) bool
- func SameSite(seed, u *url.URL, allowSub bool) bool
- type Kind
- type ScopeConfig
Constants ¶
const DefaultReserved = "_kage"
DefaultReserved is the directory under the mirror root where every asset and kage's own state live. It is deliberately unlikely to collide with a real URL path segment.
Variables ¶
This section is empty.
Functions ¶
func Ext ¶ added in v0.3.0
Ext returns the lowercased file extension of a URL's last path segment, including the leading dot (".pdf", ".mp4"), or "" when there is none. It ignores the query string, so "/a/clip.mp4?v=2" reports ".mp4".
func InScope ¶
func InScope(seed, u *url.URL, cfg ScopeConfig) bool
InScope reports whether u should be crawled as a page given the seed and cfg.
func LikelyPage ¶
LikelyPage reports whether an <a href> target should be rendered as a page rather than downloaded as a file. Links ending in a known binary/document extension are treated as assets.
func LocalPath ¶
LocalPath maps a canonical URL to a slash-separated path relative to the mirror root (out/<seedHost>/). Pages mirror the URL path as a directory index; assets live under reserved/<host>/<path>. See spec §4.3.
func Normalize ¶
Normalize resolves ref against base and canonicalises the result. It returns an error for references kage cannot crawl or download: empty, fragment-only, or a non-http(s) scheme (mailto:, tel:, data:, javascript:, blob:, …).
func ParseSeed ¶
ParseSeed turns a command-line argument like "example.com", "https://example.com/docs" or "http://ex.com" into a canonical absolute URL. A bare host (no scheme) is assumed to be https.
func Rel ¶
Rel returns the relative link from the directory of fromFile to toFile, both slash paths relative to the mirror root. The result always uses '/'.
func SameRegistrableDomain ¶ added in v0.3.0
SameRegistrableDomain reports whether u shares the seed's registrable domain (its eTLD+1). It is looser than SameSite: developer.apple.com, www.apple.com, and images.apple.com all fold to apple.com and count as same-domain, while a separate brand like cdn-apple.com or an unrelated third party like ec.europa.eu does not. It is how kage decides whether an asset host is "the site's own" without listing every subdomain a CDN might use. When either host has no registrable domain (an IP, an oddball TLD), it falls back to an exact host match so the decision stays conservative.
Types ¶
type Kind ¶
type Kind int
Kind distinguishes a crawlable page from a downloadable asset; the two map to different places on disk (pages mirror the URL path, assets live under the reserved prefix).
type ScopeConfig ¶
type ScopeConfig struct {
IncludeSubdomains bool
ScopePrefix string // only crawl paths under this prefix, e.g. "/docs/"
ExcludePaths []string // skip any path containing one of these substrings
}
ScopeConfig controls which page URLs are in scope for crawling.