Documentation
¶
Index ¶
- func DownloadAll(cfg *Config) error
- func RelativeLink(fromDir, toFile string) string
- func RewriteCSSContent(css, pageURL string, cfg *Config, idx *SnapshotIndex) string
- func ToPosix(p string) string
- func URLToLocalPath(rawURL string, pretty bool) string
- func WaybackAssetURL(assetURL, fallbackTS string, idx *SnapshotIndex) string
- type CDXEntry
- type CSSRewriter
- type Config
- type HTMLRewriter
- type LocalStorage
- type NormalizedBase
- type Progress
- type Rewriter
- type Snapshot
- type SnapshotIndex
- type Storage
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DownloadAll ¶
DownloadAll fetches the CDX index and downloads every snapshot concurrently.
func RelativeLink ¶
RelativeLink returns the relative path from fromDir to toFile.
func RewriteCSSContent ¶
func RewriteCSSContent(css, pageURL string, cfg *Config, idx *SnapshotIndex) string
RewriteCSSContent rewrites url() and @import references in CSS text.
func URLToLocalPath ¶
URLToLocalPath converts an absolute URL to a relative filesystem path fragment (no leading slash) suitable for joining with the output directory. The URL fragment (#…) is always stripped.
When pretty is true (–prettyPath flag), extension-less last segments are treated as implicit directories and resolved to index.html; query parameters are embedded before the file extension using "_" separators; characters are normalised with sanitize.PathName (keeps [a-zA-Z0-9_-] only).
When pretty is false (default), the original URL structure is preserved:
- Path percent-encodings from the source URL are kept as-is.
- Only characters forbidden in Windows file names (\ : * ? " < > |) and ASCII control characters are percent-encoded; everything else is kept.
- The query string is appended to the filename with "?" encoded as %3F so the original file extension is never obscured.
- Extension-less segments remain plain files (not turned into directories).
func WaybackAssetURL ¶
func WaybackAssetURL(assetURL, fallbackTS string, idx *SnapshotIndex) string
WaybackAssetURL builds a Wayback raw-content URL for an asset, resolving the best available timestamp via the snapshot index.
Types ¶
type CSSRewriter ¶
type CSSRewriter struct{}
CSSRewriter implements Rewriter for CSS resources.
func (CSSRewriter) Match ¶
func (CSSRewriter) Match(logicalPath, contentType string, firstBytes []byte) bool
Match reports whether this resource should be treated as CSS. Checks Content-Type and file extension (.css).
func (CSSRewriter) Rewrite ¶
func (CSSRewriter) Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error
type Config ¶
type Config struct {
BaseURL string
Variants []string
BareHost string
UnicodeHost string
ExactURL bool
Directory string
FromTimestamp string
ToTimestamp string
Threads int
RewriteLinks bool
PrettyPath bool
CanonicalAction string
DownloadExternalAssets bool
Debug bool
StopOnError bool
CDXRatePerMin int // CDX API requests per minute (default 60)
CDXMaxRetries int // max retry attempts on throttle/5xx (default 5)
Storage Storage // if nil, NewLocalStorage(Directory) is used
}
Config holds all runtime configuration for the downloader.
type HTMLRewriter ¶
type HTMLRewriter struct{}
HTMLRewriter implements Rewriter for HTML resources.
func (HTMLRewriter) Match ¶
func (HTMLRewriter) Match(logicalPath, contentType string, firstBytes []byte) bool
Match reports whether this resource should be treated as HTML. Checks Content-Type, file extension (.html/.htm), then magic bytes.
func (HTMLRewriter) Rewrite ¶
func (HTMLRewriter) Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error
type LocalStorage ¶
type LocalStorage struct {
// contains filtered or unexported fields
}
LocalStorage is the default Storage implementation that mirrors the logical layout into a root directory on the OS filesystem.
func NewLocalStorage ¶
func NewLocalStorage(dir string) *LocalStorage
NewLocalStorage returns a LocalStorage rooted at dir. The root directory is created lazily by Put/PutBytes.
func (*LocalStorage) Exists ¶
func (s *LocalStorage) Exists(path string) bool
Exists reports whether path already exists in storage.
func (*LocalStorage) Get ¶
func (s *LocalStorage) Get(path string) ([]byte, error)
Get returns the full content of path.
type NormalizedBase ¶
type NormalizedBase struct {
CanonicalURL string
Variants []string // all http/https + www combinations
BareHost string // hostname without www.
UnicodeHost string // IDN-decoded hostname
}
NormalizedBase holds the canonical form and all URL variants for a base URL.
func NormalizeBaseURL ¶
func NormalizeBaseURL(input string) (*NormalizedBase, error)
NormalizeBaseURL parses and normalises the user-supplied URL/domain input.
type Progress ¶
type Progress struct {
// contains filtered or unexported fields
}
Progress is a nil-safe wrapper around progressbar.ProgressBar. A nil *Progress is valid; all methods are no-ops, making it trivial to disable output in tests or non-interactive pipelines.
func NewCDXProgress ¶
func NewCDXProgress() *Progress
NewCDXProgress creates an indeterminate spinner for the CDX index-fetch phase. Each call to Inc() advances the spinner and adds one to the page counter.
func NewDownloadProgress ¶
NewDownloadProgress creates a determinate bar for the file-download phase.
type Rewriter ¶
type Rewriter interface {
// Match reports whether this rewriter handles the given resource.
Match(logicalPath, contentType string, firstBytes []byte) bool
// Rewrite rewrites the resource in storage.
Rewrite(store Storage, logicalPath, pageURL string, cfg *Config, idx *SnapshotIndex) error
}
Rewriter detects and rewrites a stored resource in-place.
func DetectRewriter ¶
DetectRewriter returns the Rewriter appropriate for the given resource, or nil when no rewriting is needed.
type Snapshot ¶
type Snapshot struct {
FileURL string // original URL
Timestamp string // CDX timestamp string
FileID string // decoded URL path (deduplication key)
}
Snapshot represents a single archived file to download.
type SnapshotIndex ¶
type SnapshotIndex struct {
// contains filtered or unexported fields
}
SnapshotIndex deduplicates CDX entries and builds lookup maps.
func NewSnapshotIndex ¶
func NewSnapshotIndex() *SnapshotIndex
NewSnapshotIndex creates an empty index.
func (*SnapshotIndex) GetManifest ¶
func (idx *SnapshotIndex) GetManifest() []Snapshot
GetManifest builds and returns the full sorted snapshot list (newest first). Also initialises the lookup maps for Resolve.
func (*SnapshotIndex) Register ¶
func (idx *SnapshotIndex) Register(rawURL, timestamp string)
Register adds a CDX entry to the index, keeping the lexicographically greatest timestamp.
func (*SnapshotIndex) Resolve ¶
func (idx *SnapshotIndex) Resolve(assetURL, fallback string) string
Resolve finds the best timestamp for an asset URL. It checks path+query first, then path only, falling back to the provided default.
type Storage ¶
type Storage interface {
// Exists reports whether the logical path already has content.
Exists(path string) bool
// Put writes the content of r to path. The write is atomic —
// no partial file is visible to concurrent readers.
Put(path string, r io.Reader) error
// Get returns the full content of path.
Get(path string) ([]byte, error)
// PutBytes writes data to path (convenience wrapper around Put).
PutBytes(path string, data []byte) error
}
Storage abstracts reading and writing downloaded snapshot files. Logical paths are forward-slash relative paths as returned by URLToLocalPath (e.g. "example.com/page/index.html"). Implementations map them to wherever files actually live (OS directory, zip archive, memory map, …).