wayback

package
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DownloadAll

func DownloadAll(cfg *Config) error

DownloadAll fetches the CDX index and downloads every snapshot concurrently.

func RelativeLink(fromDir, toFile string) string

RelativeLink returns the relative path from fromDir to toFile.

func RewriteCSSContent

func RewriteCSSContent(css, pageURL string, cfg *Config, idx *SnapshotIndex) string

RewriteCSSContent rewrites url() and @import references in CSS text.

func ToPosix

func ToPosix(p string) string

ToPosix converts backslashes to forward slashes.

func URLToLocalPath

func URLToLocalPath(rawURL string, pretty bool) string

URLToLocalPath converts an absolute URL to a relative filesystem path fragment (no leading slash) suitable for joining with the output directory. The URL fragment (#…) is always stripped.

When pretty is true (–prettyPath flag), extension-less last segments are treated as implicit directories and resolved to index.html; query parameters are embedded before the file extension using "_" separators; characters are normalised with sanitize.PathName (keeps [a-zA-Z0-9_-] only).

When pretty is false (default), the original URL structure is preserved:

  • Path percent-encodings from the source URL are kept as-is.
  • Only characters forbidden in Windows file names (\ : * ? " < > |) and ASCII control characters are percent-encoded; everything else is kept.
  • The query string is appended to the filename with "?" encoded as %3F so the original file extension is never obscured.
  • Extension-less segments remain plain files (not turned into directories).

func WaybackAssetURL

func WaybackAssetURL(assetURL, fallbackTS string, idx *SnapshotIndex) string

WaybackAssetURL builds a Wayback raw-content URL for an asset, resolving the best available timestamp via the snapshot index.

Types

type CDXEntry

type CDXEntry struct {
	Timestamp   string
	OriginalURL string
}

CDXEntry holds one CDX result row.

type CSSRewriter

type CSSRewriter struct{}

CSSRewriter implements Rewriter for CSS resources.

func (CSSRewriter) Match

func (CSSRewriter) Match(logicalPath, contentType string, firstBytes []byte) bool

Match reports whether this resource should be treated as CSS. Checks Content-Type and file extension (.css).

func (CSSRewriter) Rewrite

func (CSSRewriter) Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error

type Config

type Config struct {
	BaseURL                string
	Variants               []string
	BareHost               string
	UnicodeHost            string
	ExactURL               bool
	Directory              string
	FromTimestamp          string
	ToTimestamp            string
	Threads                int
	RewriteLinks           bool
	PrettyPath             bool
	CanonicalAction        string
	DownloadExternalAssets bool
	Debug                  bool
	StopOnError            bool
	CDXRatePerMin          int     // CDX API requests per minute (default 60)
	CDXMaxRetries          int     // max retry attempts on throttle/5xx (default 5)
	Storage                Storage // if nil, NewLocalStorage(Directory) is used
}

Config holds all runtime configuration for the downloader.

type HTMLRewriter

type HTMLRewriter struct{}

HTMLRewriter implements Rewriter for HTML resources.

func (HTMLRewriter) Match

func (HTMLRewriter) Match(logicalPath, contentType string, firstBytes []byte) bool

Match reports whether this resource should be treated as HTML. Checks Content-Type, file extension (.html/.htm), then magic bytes.

func (HTMLRewriter) Rewrite

func (HTMLRewriter) Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error

type LocalStorage

type LocalStorage struct {
	// contains filtered or unexported fields
}

LocalStorage is the default Storage implementation that mirrors the logical layout into a root directory on the OS filesystem.

func NewLocalStorage

func NewLocalStorage(dir string) *LocalStorage

NewLocalStorage returns a LocalStorage rooted at dir. The root directory is created lazily by Put/PutBytes.

func (*LocalStorage) Exists

func (s *LocalStorage) Exists(path string) bool

Exists reports whether path already exists in storage.

func (*LocalStorage) Get

func (s *LocalStorage) Get(path string) ([]byte, error)

Get returns the full content of path.

func (*LocalStorage) Put

func (s *LocalStorage) Put(path string, r io.Reader) error

Put streams r into path atomically via a temp file + rename.

func (*LocalStorage) PutBytes

func (s *LocalStorage) PutBytes(path string, data []byte) error

PutBytes writes data to path, creating parent directories as needed.

type NormalizedBase

type NormalizedBase struct {
	CanonicalURL string
	Variants     []string // all http/https + www combinations
	BareHost     string   // hostname without www.
	UnicodeHost  string   // IDN-decoded hostname
}

NormalizedBase holds the canonical form and all URL variants for a base URL.

func NormalizeBaseURL

func NormalizeBaseURL(input string) (*NormalizedBase, error)

NormalizeBaseURL parses and normalises the user-supplied URL/domain input.

type Progress

type Progress struct {
	// contains filtered or unexported fields
}

Progress is a nil-safe wrapper around progressbar.ProgressBar. A nil *Progress is valid; all methods are no-ops, making it trivial to disable output in tests or non-interactive pipelines.

func NewCDXProgress

func NewCDXProgress() *Progress

NewCDXProgress creates an indeterminate spinner for the CDX index-fetch phase. Each call to Inc() advances the spinner and adds one to the page counter.

func NewDownloadProgress

func NewDownloadProgress(total int) *Progress

NewDownloadProgress creates a determinate bar for the file-download phase.

func (*Progress) Finish

func (p *Progress) Finish()

Finish marks the bar as complete and moves to a new line.

func (*Progress) Inc

func (p *Progress) Inc()

Inc increments the progress bar by one step.

func (*Progress) SetMax

func (p *Progress) SetMax(num int)

type Rewriter

type Rewriter interface {
	// Match reports whether this rewriter handles the given resource.
	Match(logicalPath, contentType string, firstBytes []byte) bool
	// Rewrite rewrites the resource in storage.
	Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error
}

Rewriter detects and rewrites a stored resource in-place.

func DetectRewriter

func DetectRewriter(logicalPath, contentType string, firstBytes []byte) Rewriter

DetectRewriter returns the Rewriter appropriate for the given resource, or nil when no rewriting is needed.

type Snapshot

type Snapshot struct {
	FileURL   string // original URL
	Timestamp string // CDX timestamp string
	FileID    string // decoded URL path (deduplication key)
}

Snapshot represents a single archived file to download.

type SnapshotIndex

type SnapshotIndex struct {
	// contains filtered or unexported fields
}

SnapshotIndex deduplicates CDX entries and builds lookup maps.

func NewSnapshotIndex

func NewSnapshotIndex() *SnapshotIndex

NewSnapshotIndex creates an empty index.

func (*SnapshotIndex) GetManifest

func (idx *SnapshotIndex) GetManifest() []Snapshot

GetManifest builds and returns the full sorted snapshot list (newest first). Also initialises the lookup maps for Resolve.

func (*SnapshotIndex) Register

func (idx *SnapshotIndex) Register(rawURL, timestamp string)

Register adds a CDX entry to the index, keeping the lexicographically greatest timestamp.

func (*SnapshotIndex) Resolve

func (idx *SnapshotIndex) Resolve(assetURL, fallback string) string

Resolve finds the best timestamp for an asset URL. It checks path+query first, then path only, falling back to the provided default.

type Storage

type Storage interface {
	// Exists reports whether the logical path already has content.
	Exists(path string) bool
	// Put writes the content of r to path. The write is atomic —
	// no partial file is visible to concurrent readers.
	Put(path string, r io.Reader) error
	// Get returns the full content of path.
	Get(path string) ([]byte, error)
	// PutBytes writes data to path (convenience wrapper around Put).
	PutBytes(path string, data []byte) error
}

Storage abstracts reading and writing downloaded snapshot files. Logical paths are forward-slash relative paths as returned by URLToLocalPath (e.g. "example.com/page/index.html"). Implementations map them to wherever files actually live (OS directory, zip archive, memory map, …).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL