Documentation
¶
Overview ¶
Package clone is kage's engine: it ties the Chrome pool, the JavaScript stripper, the asset localiser, and the URL↔path mapper into one resumable, polite crawl that turns a live site into a browsable offline folder.
Index ¶
Constants ¶
const DefaultUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
DefaultUserAgent is a current desktop Chrome UA, used by the asset fetcher and the robots fetch so a site treats kage like the browser it drives.
Variables ¶
This section is empty.
Functions ¶
func DefaultOutDir ¶
func DefaultOutDir() string
DefaultOutDir is where mirrors land unless --out overrides it: a per-user data directory ($HOME/data/kage) so clones from anywhere collect in one place, falling back to a local kage-out when the home directory cannot be resolved.
Types ¶
type Cloner ¶
type Cloner struct {
// contains filtered or unexported fields
}
Cloner runs one clone. Build it with New, then call Run.
func New ¶
New builds a Cloner for seed under cfg. It does not touch the network until Run is called.
type Config ¶
type Config struct {
OutDir string // output root; the mirror lands in <OutDir>/<host>/
Reserved string // reserved dir name for assets and state (default "_kage")
Workers int // page render workers
AssetWorkers int // HTTP asset download workers
BrowserPages int // Chrome page-pool size
MaxPages int // stop after N pages (0 = unlimited)
MaxDepth int // BFS/DFS depth cap (0 = unlimited)
Traversal string
MaxAssetBytes int64
Timeout time.Duration // per HTTP request
Settle time.Duration // network-idle quiet period
RenderTimeout time.Duration // hard cap per page render
Scroll bool
UserAgent string
IncludeSubdomains bool
ScopePrefix string
ExcludePaths []string
RespectRobots bool
FollowSitemap bool
Headless bool
KeepNoscript bool
ChromeBin string
ControlURL string
// Resume loads the prior run's visited set and skips pages already written,
// so an interrupted or repeated clone picks up where it left off instead of
// refetching. Refresh forces every page to be re-rendered in place (the
// mirror is kept, files are overwritten) to pull in changed content. Force
// deletes the mirror first for a clean-slate clone. Persist writes the
// visited set back to state.json when the run ends.
Resume bool
Refresh bool
Force bool
Persist bool
}
Config is the full set of knobs for a clone run. DefaultConfig fills the baseline; the CLI overlays flags on top.
type Failure ¶ added in v0.1.2
type Failure struct {
Kind string // "page" or "asset"
URL string
Referer string // the page that referenced it, when known
Reason string // e.g. "HTTP 403 Forbidden"
}
Failure is one thing that went wrong, kept for the end-of-run report so the errors are visible as a list rather than only as a count.