urlx

package
v0.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 16, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package urlx is the URL ⇄ filesystem contract at the heart of kage.

Every reference kage meets — a page link, a stylesheet, an image, a font — is funnelled through Normalize so that two different-looking URLs that point at the same resource collapse to one canonical key. LocalPath then maps that canonical URL to a deterministic path on disk, and Rel turns two such paths into the relative link that goes back into the rewritten HTML or CSS.

The package is pure: no network, no filesystem, no clock. That is what makes the rest of kage easy to reason about — a page worker can rewrite a link to an asset long before the asset has been downloaded, because both sides agree on where the bytes will live.

Index

Constants

View Source
const DefaultReserved = "_kage"

DefaultReserved is the directory under the mirror root where every asset and kage's own state live. It is deliberately unlikely to collide with a real URL path segment.

Variables

This section is empty.

Functions

func Dir

func Dir(file string) string

Dir returns the directory portion of a slash file path.

func Ext added in v0.3.0

func Ext(u *url.URL) string

Ext returns the lowercased file extension of a URL's last path segment, including the leading dot (".pdf", ".mp4"), or "" when there is none. It ignores the query string, so "/a/clip.mp4?v=2" reports ".mp4".

func InScope

func InScope(seed, u *url.URL, cfg ScopeConfig) bool

InScope reports whether u should be crawled as a page given the seed and cfg.

func Key

func Key(u *url.URL) string

Key is the canonical string form used to dedup pages and assets.

func LikelyPage

func LikelyPage(u *url.URL) bool

LikelyPage reports whether an <a href> target should be rendered as a page rather than downloaded as a file. Links ending in a known binary/document extension are treated as assets.

func LocalPath

func LocalPath(seedHost string, u *url.URL, kind Kind, reserved string) string

LocalPath maps a canonical URL to a slash-separated path relative to the mirror root (out/<seedHost>/). Pages mirror the URL path as a directory index; assets live under reserved/<host>/<path>. See spec §4.3.

func Normalize

func Normalize(base *url.URL, ref string) (*url.URL, error)

Normalize resolves ref against base and canonicalises the result. It returns an error for references kage cannot crawl or download: empty, fragment-only, or a non-http(s) scheme (mailto:, tel:, data:, javascript:, blob:, …).

func ParseSeed

func ParseSeed(arg string) (*url.URL, error)

ParseSeed turns a command-line argument like "example.com", "https://example.com/docs" or "http://ex.com" into a canonical absolute URL. A bare host (no scheme) is assumed to be https.

func Rel

func Rel(fromDir, toFile string) string

Rel returns the relative link from the directory of fromFile to toFile, both slash paths relative to the mirror root. The result always uses '/'.

func SameRegistrableDomain added in v0.3.0

func SameRegistrableDomain(seed, u *url.URL) bool

SameRegistrableDomain reports whether u shares the seed's registrable domain (its eTLD+1). It is looser than SameSite: developer.apple.com, www.apple.com, and images.apple.com all fold to apple.com and count as same-domain, while a separate brand like cdn-apple.com or an unrelated third party like ec.europa.eu does not. It is how kage decides whether an asset host is "the site's own" without listing every subdomain a CDN might use. When either host has no registrable domain (an IP, an oddball TLD), it falls back to an exact host match so the decision stays conservative.

func SameSite

func SameSite(seed, u *url.URL, allowSub bool) bool

SameSite reports whether u belongs to the seed's site: the same host, or a subdomain of it when allowSub is set.

Types

type Kind

type Kind int

Kind distinguishes a crawlable page from a downloadable asset; the two map to different places on disk (pages mirror the URL path, assets live under the reserved prefix).

const (
	// Page is an HTML document kage renders and rewrites.
	Page Kind = iota
	// Asset is a stylesheet, image, font, or media file kage downloads verbatim.
	Asset
)

type ScopeConfig

type ScopeConfig struct {
	IncludeSubdomains bool
	ScopePrefix       string   // only crawl paths under this prefix, e.g. "/docs/"
	ExcludePaths      []string // skip any path containing one of these substrings
}

ScopeConfig controls which page URLs are in scope for crawling.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL