clone

package
v0.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 16, 2026 License: MIT Imports: 22 Imported by: 0

Documentation

Overview

Package clone is kage's engine: it ties the Chrome pool, the JavaScript stripper, the asset localiser, and the URL↔path mapper into one resumable, polite crawl that turns a live site into a browsable offline folder.

Index

Constants

View Source
const DefaultUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
	"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"

DefaultUserAgent is a current desktop Chrome UA, used by the asset fetcher and the robots fetch so a site treats kage like the browser it drives.

Variables

This section is empty.

Functions

func DefaultOutDir

func DefaultOutDir() string

DefaultOutDir is where mirrors land unless --out overrides it: a per-user data directory ($HOME/data/kage) so clones from anywhere collect in one place, falling back to a local kage-out when the home directory cannot be resolved.

func DefaultSkipAssetExts added in v0.3.0

func DefaultSkipAssetExts() map[string]bool

DefaultSkipAssetExts returns the asset extensions kage leaves on the live web by default: bulk media, installers, and archives that rarely matter for reading a site offline but dominate its download size (a docs site's WWDC videos, .dmg/.pkg installers, and PDF manuals can be most of the bytes). Page-rendering assets (images, fonts, CSS) are deliberately absent, so the offline pages still look right.

Types

type Cloner

type Cloner struct {
	// contains filtered or unexported fields
}

Cloner runs one clone. Build it with New, then call Run.

func New

func New(seed *url.URL, cfg Config, logf Logf) *Cloner

New builds a Cloner for seed under cfg. It does not touch the network until Run is called.

func (*Cloner) Run

func (c *Cloner) Run(ctx context.Context) (Result, error)

Run executes the clone until the frontier drains, MaxPages is hit, or ctx is cancelled (which flushes the resume state). It returns the final Result.

func (*Cloner) Snapshot

func (c *Cloner) Snapshot() Progress

Snapshot returns the current progress, for a CLI ticker.

type Config

type Config struct {
	OutDir   string // output root; the mirror lands in <OutDir>/<host>/
	Reserved string // reserved dir name for assets and state (default "_kage")

	Workers       int // page render workers
	AssetWorkers  int // HTTP asset download workers
	BrowserPages  int // Chrome page-pool size
	MaxPages      int // stop after N pages (0 = unlimited)
	MaxDepth      int // BFS/DFS depth cap (0 = unlimited)
	Traversal     string
	MaxAssetBytes int64

	// AssetSameDomain, when set, localizes only assets whose host shares the
	// seed's registrable domain (apple.com covers developer.apple.com and
	// www.apple.com but not cdn-apple.com or an unrelated third party). An
	// off-domain asset is left pointing at its live URL instead of downloaded.
	AssetSameDomain bool
	// SkipAssetExts lists asset extensions (".mp4", ".pdf", ".dmg", …) that are
	// left on the live web rather than downloaded, so bulk media, installers, and
	// archives do not bloat the mirror. The reference keeps its remote URL.
	SkipAssetExts map[string]bool

	Timeout       time.Duration // per HTTP request
	Settle        time.Duration // network-idle quiet period
	RenderTimeout time.Duration // hard cap per page render
	Scroll        bool

	UserAgent         string
	IncludeSubdomains bool
	ScopePrefix       string
	ExcludePaths      []string

	RespectRobots bool
	FollowSitemap bool
	Headless      bool
	KeepNoscript  bool
	ChromeBin     string
	ControlURL    string

	// Resume loads the prior run's visited set and skips pages already written,
	// so an interrupted or repeated clone picks up where it left off instead of
	// refetching. Refresh forces every page to be re-rendered in place (the
	// mirror is kept, files are overwritten) to pull in changed content. Force
	// deletes the mirror first for a clean-slate clone. Persist writes the
	// visited set back to state.json when the run ends.
	Resume  bool
	Refresh bool
	Force   bool
	Persist bool
}

Config is the full set of knobs for a clone run. DefaultConfig fills the baseline; the CLI overlays flags on top.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns the baseline configuration.

func (Config) HostDir

func (c Config) HostDir(host string) string

HostDir returns the mirror root for a seed host: <OutDir>/<host>.

type Failure added in v0.1.2

type Failure struct {
	Kind    string // "page" or "asset"
	URL     string
	Referer string // the page that referenced it, when known
	Reason  string // e.g. "HTTP 403 Forbidden"
}

Failure is one thing that went wrong, kept for the end-of-run report so the errors are visible as a list rather than only as a count.

type Logf

type Logf func(format string, args ...any)

Logf is an optional sink for human-readable progress lines.

type Progress

type Progress struct {
	Pages        int64
	PagePaths    int64
	PagesLinked  int64
	Assets       int64
	AssetSkipped int64
	PageErrors   int64
	AssetErrors  int64
	Skipped      int64
}

Progress is a snapshot of a run for display. Pages is every page document written (it equals the count of HTML files on disk); PagePaths is how many distinct URL paths those represent once query strings are ignored. The difference, Pages-PagePaths, is the number of query-string variants.

type Result

type Result struct {
	Progress
	OutDir string
	// Failures is a capped sample of what went wrong, for the final report.
	Failures []Failure
}

Result is the final outcome returned by Run.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL