clone

package
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2026 License: MIT Imports: 21 Imported by: 0

Documentation

Overview

Package clone is kage's engine: it ties the Chrome pool, the JavaScript stripper, the asset localiser, and the URL↔path mapper into one resumable, polite crawl that turns a live site into a browsable offline folder.

Index

Constants

View Source
const DefaultUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
	"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"

DefaultUserAgent is a current desktop Chrome UA, used by the asset fetcher and the robots fetch so a site treats kage like the browser it drives.

Variables

This section is empty.

Functions

func DefaultOutDir

func DefaultOutDir() string

DefaultOutDir is where mirrors land unless --out overrides it: a per-user data directory ($HOME/data/kage) so clones from anywhere collect in one place, falling back to a local kage-out when the home directory cannot be resolved.

Types

type Cloner

type Cloner struct {
	// contains filtered or unexported fields
}

Cloner runs one clone. Build it with New, then call Run.

func New

func New(seed *url.URL, cfg Config, logf Logf) *Cloner

New builds a Cloner for seed under cfg. It does not touch the network until Run is called.

func (*Cloner) Run

func (c *Cloner) Run(ctx context.Context) (Result, error)

Run executes the clone until the frontier drains, MaxPages is hit, or ctx is cancelled (which flushes the resume state). It returns the final Result.

func (*Cloner) Snapshot

func (c *Cloner) Snapshot() Progress

Snapshot returns the current progress, for a CLI ticker.

type Config

type Config struct {
	OutDir   string // output root; the mirror lands in <OutDir>/<host>/
	Reserved string // reserved dir name for assets and state (default "_kage")

	Workers       int // page render workers
	AssetWorkers  int // HTTP asset download workers
	BrowserPages  int // Chrome page-pool size
	MaxPages      int // stop after N pages (0 = unlimited)
	MaxDepth      int // BFS/DFS depth cap (0 = unlimited)
	Traversal     string
	MaxAssetBytes int64

	Timeout       time.Duration // per HTTP request
	Settle        time.Duration // network-idle quiet period
	RenderTimeout time.Duration // hard cap per page render
	Scroll        bool

	UserAgent         string
	IncludeSubdomains bool
	ScopePrefix       string
	ExcludePaths      []string

	RespectRobots bool
	FollowSitemap bool
	Headless      bool
	KeepNoscript  bool
	ChromeBin     string
	ControlURL    string

	// Resume loads the prior run's visited set and skips pages already written,
	// so an interrupted or repeated clone picks up where it left off instead of
	// refetching. Refresh forces every page to be re-rendered in place (the
	// mirror is kept, files are overwritten) to pull in changed content. Force
	// deletes the mirror first for a clean-slate clone. Persist writes the
	// visited set back to state.json when the run ends.
	Resume  bool
	Refresh bool
	Force   bool
	Persist bool
}

Config is the full set of knobs for a clone run. DefaultConfig fills the baseline; the CLI overlays flags on top.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns the baseline configuration.

func (Config) HostDir

func (c Config) HostDir(host string) string

HostDir returns the mirror root for a seed host: <OutDir>/<host>.

type Failure added in v0.1.2

type Failure struct {
	Kind    string // "page" or "asset"
	URL     string
	Referer string // the page that referenced it, when known
	Reason  string // e.g. "HTTP 403 Forbidden"
}

Failure is one thing that went wrong, kept for the end-of-run report so the errors are visible as a list rather than only as a count.

type Logf

type Logf func(format string, args ...any)

Logf is an optional sink for human-readable progress lines.

type Progress

type Progress struct {
	Pages       int64
	Assets      int64
	PageErrors  int64
	AssetErrors int64
	Skipped     int64
}

Progress is a snapshot of a run for display.

type Result

type Result struct {
	Progress
	OutDir string
	// Failures is a capped sample of what went wrong, for the final report.
	Failures []Failure
}

Result is the final outcome returned by Run.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL