ia

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2026 License: Apache-2.0 Imports: 22 Imported by: 0

Documentation

Overview

Package ia is the library behind the archive command line: everything it knows about archive.org and the Wayback Machine lives here, with no dependency on the command framework.

Index

Constants

View Source
const (
	BaseURL        = "https://archive.org/"
	MetadataURL    = "https://archive.org/metadata/"
	DownloadURL    = "https://archive.org/download/"
	DetailsURL     = "https://archive.org/details/"
	AdvancedSearch = "https://archive.org/advancedsearch.php"
	ScrapeURL      = "https://archive.org/services/search/v1/scrape"
	ViewsURL       = "https://be-api.us.archive.org/views/v1/short/"
	TasksURL       = "https://archive.org/services/tasks.php"
	XAuthnURL      = "https://archive.org/services/xauthn/"
	S3URL          = "https://s3.us.archive.org/"

	// Wayback Machine.
	WaybackAvailURL = "https://archive.org/wayback/available"
	WaybackCDXURL   = "http://web.archive.org/cdx/search/cdx"
	WaybackReplay   = "https://web.archive.org/web/"
	WaybackSaveURL  = "https://web.archive.org/save/"

	// UserAgent identifies the client politely to the Archive's edge.
	UserAgent = "archive-cli/1.0 (+https://github.com/tamnd/archive-cli)"
)

Internet Archive endpoints.

View Source
const (
	DefaultTimeout = 120 * time.Second
	DefaultRetries = 5
	DefaultDelay   = 250 * time.Millisecond
)

Defaults for the client and downloader.

Variables

View Source
var DefaultFields = []string{"identifier", "title", "mediatype", "downloads", "date"}

DefaultFields are returned when the caller asks for none.

View Source
var ErrNoCredentials = errors.New("no credentials configured; run 'archive configure'")

ErrNoCredentials is returned when an authenticated operation runs without configured credentials. The CLI maps it to exit code 4.

Functions

func CDX

func CDX(ctx context.Context, h *HTTPClient, q CDXQuery, fn func(CDXRecord) error) (int, error)

CDX fetches the capture history for a URL and calls fn for each record. The CDX server returns a header row first, which is consumed to map columns.

func ConfigDir

func ConfigDir() string

ConfigDir returns the directory holding the credentials file.

func CredentialsPath

func CredentialsPath() string

CredentialsPath is the file archive configure writes.

func DeleteFile

func DeleteFile(ctx context.Context, h *HTTPClient, identifier, remoteName string) error

DeleteFile removes a file from an item over IAS3.

func DetailsURLFor

func DetailsURLFor(identifier string) string

DetailsURLFor returns the human details page for an item.

func DownloadURLFor

func DownloadURLFor(identifier, name string) string

DownloadURLFor returns the canonical download URL for a file in an item.

func ExtractText

func ExtractText(body []byte) string

ExtractText returns the visible text of an HTML document, with script/style stripped and whitespace collapsed to single blank-line-separated blocks.

func ExtractTitle

func ExtractTitle(body []byte) string

ExtractTitle returns the document <title>.

func GetMetadataSub

func GetMetadataSub(ctx context.Context, h *HTTPClient, identifier, subpath string) ([]byte, error)

GetMetadataSub fetches a sub-resource of the Metadata API (e.g. "files", "server", "metadata", or "files/<name>") and returns the raw JSON bytes.

func GetViews

func GetViews(ctx context.Context, h *HTTPClient, cache *Cache, identifiers []string) (map[string]Views, error)

GetViews fetches short view statistics for one or more items in a single call.

func QuoteQuery

func QuoteQuery(s string) string

QuoteQuery wraps a Lucene phrase in quotes when it contains spaces and is not already a field:value expression.

func ReplayURL

func ReplayURL(timestamp, target string, raw bool) string

ReplayURL builds a Wayback replay URL. When raw is set it requests the original archived bytes (the "id_" modifier) instead of the rewritten page.

func SaveAnonymous

func SaveAnonymous(ctx context.Context, h *HTTPClient, target string) (string, error)

SaveAnonymous triggers an anonymous Save Page Now capture (fire and forget). It returns the replay URL the capture will live at.

func Scrape

func Scrape(ctx context.Context, h *HTTPClient, q SearchQuery, limit int, fn func(SearchDoc) error) (int, error)

Scrape exports every matching item via the cursor-based Scraping API, calling fn for each hit, up to limit (0 = all). This is the unbounded path used by --all; it does not support sort.

func SearchEach

func SearchEach(ctx context.Context, h *HTTPClient, cache *Cache, q SearchQuery, limit int, fn func(SearchDoc) error) (int, error)

SearchEach pages through Advanced Search calling fn for each hit, up to limit (0 = the Solr deep-paging ceiling). It is the bounded, sortable path.

func Upload

func Upload(ctx context.Context, h *HTTPClient, identifier, localPath string, opts UploadOptions) (string, error)

Upload puts one local file into an item over IAS3.

func UploadHeaders

func UploadHeaders(opts UploadOptions, size int64) map[string]string

UploadHeaders builds the IAS3 request headers for an upload (exposed so the CLI can show them under --dry-run).

func UploadURL

func UploadURL(identifier, remoteName string) string

UploadURL returns the IAS3 target for a file in an item.

func ValidIdentifier

func ValidIdentifier(id string) bool

ValidIdentifier reports whether id is a syntactically valid item identifier.

Types

type APIError

type APIError struct {
	Status int
	URL    string
}

APIError carries a non-2xx HTTP status so the CLI can map it to an exit code (401/403 -> auth required).

func (*APIError) Error

func (e *APIError) Error() string

type AdvancedSearchResult

type AdvancedSearchResult struct {
	NumFound int
	Start    int
	Docs     []SearchDoc
}

AdvancedSearchResult is the decoded Advanced Search response.

Search runs one page of Advanced Search.

type CDXQuery

type CDXQuery struct {
	URL       string
	From      string
	To        string
	MatchType string   // exact | prefix | host | domain
	Filters   []string // e.g. "statuscode:200", "!mimetype:text/html"
	Collapse  string   // e.g. "digest" or "timestamp:8"
	Limit     int      // negative = newest N
}

CDXQuery describes a Wayback CDX history request.

type CDXRecord

type CDXRecord struct {
	URLKey     string `json:"urlkey"`
	Timestamp  string `json:"timestamp"`
	Original   string `json:"original"`
	MimeType   string `json:"mimetype"`
	StatusCode string `json:"statuscode"`
	Digest     string `json:"digest"`
	Length     string `json:"length"`
}

CDXRecord is one Wayback capture from the CDX server.

type Cache

type Cache struct {
	// contains filtered or unexported fields
}

Cache is a tiny on-disk blob cache keyed by an arbitrary string, with a TTL per entry. It is safe for the simple single-process use the CLI makes of it.

func NewCache

func NewCache(dir string, enabled bool) *Cache

NewCache returns a cache rooted under dir. If dir is empty or enabled is false, all operations are no-ops (cache miss on every Get).

func (*Cache) Clear

func (c *Cache) Clear() (int, error)

Clear removes every cached entry and returns the number of files removed.

func (*Cache) Dir

func (c *Cache) Dir() string

Dir returns the cache directory.

func (*Cache) Get

func (c *Cache) Get(key string, ttl time.Duration) ([]byte, bool)

Get returns cached bytes for key if present and younger than ttl.

func (*Cache) Put

func (c *Cache) Put(key string, data []byte)

Put stores data under key.

type Config

type Config struct {
	DataDir   string
	CacheDir  string
	Workers   int
	Timeout   time.Duration
	Delay     time.Duration
	Retries   int
	UserAgent string
}

Config controls library behaviour. The zero value is not usable; call DefaultConfig and adjust.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns a Config rooted at the data directory with polite client defaults.

func (Config) DownloadDir

func (c Config) DownloadDir() string

DownloadDir is where downloaded item files land by default.

type Credentials

type Credentials struct {
	Access string
	Secret string
}

Credentials are the IAS3 access/secret key pair used for authenticated reads and all writes.

func LoadCredentials

func LoadCredentials() (*Credentials, error)

LoadCredentials reads the credentials file (simple "key = value" lines).

func Login

func Login(ctx context.Context, h *HTTPClient, email, password string) (*Credentials, error)

Login exchanges an email and password for IAS3 keys via the xauthn endpoint.

func ResolveCredentials

func ResolveCredentials(access, secret string) *Credentials

ResolveCredentials finds credentials in priority order: explicit values, environment, then the config file. It always returns a non-nil pointer; call Valid to test whether it is usable.

func (*Credentials) AuthHeader

func (c *Credentials) AuthHeader() string

AuthHeader builds the IAS3 authorization header value.

func (*Credentials) MaskedSecret

func (c *Credentials) MaskedSecret() string

MaskedSecret returns the secret with all but the last four characters hidden.

func (*Credentials) Save

func (c *Credentials) Save() error

Save writes the credentials to the config file with mode 0600.

func (*Credentials) Valid

func (c *Credentials) Valid() bool

Valid reports whether both keys are present.

type DownloadResult

type DownloadResult struct {
	File     FileInfo
	Path     string
	Bytes    int64
	Skipped  bool // already present with a matching md5
	Verified bool // md5 checked and matched
}

DownloadResult reports the outcome of fetching one file.

func DownloadFile

func DownloadFile(ctx context.Context, h *HTTPClient, m Metadata, f FileInfo, destDir string, verify, flat bool) (DownloadResult, error)

DownloadFile fetches one file of an item into destDir. When verify is set the download is checked against the manifest md5; a pre-existing file with a matching md5 is skipped. flat omits any sub-directory in the file name.

type FileInfo

type FileInfo struct {
	Name   string `json:"name"`
	Source string `json:"source"` // original | derivative | metadata
	Format string `json:"format"`
	MD5    string `json:"md5"`
	CRC32  string `json:"crc32"`
	SHA1   string `json:"sha1"`
	SizeS  string `json:"size"`
	MtimeS string `json:"mtime"`

	Raw json.RawMessage `json:"-"`
}

FileInfo is one file record from an item's manifest. The Archive encodes the numeric fields as strings; Size/Mtime expose them coerced to int64. The full raw record is kept so media-specific fields (length, bitrate, track, album, width, height, rotation, ...) survive into json/jsonl/template output.

func (FileInfo) Fields

func (f FileInfo) Fields() map[string]any

Fields decodes the complete file record into a map so every field the API returned is available to json/jsonl/template output, not just the typed ones.

func (FileInfo) Mtime

func (f FileInfo) Mtime() int64

Mtime returns the file modification time as Unix seconds.

func (FileInfo) Size

func (f FileInfo) Size() int64

Size returns the file size in bytes.

func (*FileInfo) UnmarshalJSON

func (f *FileInfo) UnmarshalJSON(b []byte) error

UnmarshalJSON fills the typed fields and retains the complete raw record so no per-file field is ever dropped.

type HTTPClient

type HTTPClient struct {
	// contains filtered or unexported fields
}

HTTPClient is a polite, retrying HTTP client for archive.org. It rate-limits requests, retries on 429/5xx with linear backoff (honouring Retry-After), supports byte-range requests for resumable downloads, and attaches the IAS3 "LOW access:secret" authorization header when credentials are present.

func NewHTTPClient

func NewHTTPClient(cfg Config) *HTTPClient

NewHTTPClient builds an HTTPClient from cfg.

func (*HTTPClient) Delete

func (h *HTTPClient) Delete(ctx context.Context, url string, headers map[string]string) (*http.Response, error)

Delete removes url over IAS3. It requires credentials.

func (*HTTPClient) FetchBytes

func (h *HTTPClient) FetchBytes(ctx context.Context, url string) ([]byte, error)

FetchBytes fetches url and returns the whole body, erroring on non-2xx.

func (*HTTPClient) Get

func (h *HTTPClient) Get(ctx context.Context, url string) (*http.Response, error)

Get fetches url with retries.

func (*HTTPClient) GetDownload

func (h *HTTPClient) GetDownload(ctx context.Context, url string) (*http.Response, error)

GetDownload fetches url with no client timeout (relies on ctx cancellation), for large file bodies.

func (*HTTPClient) GetJSON

func (h *HTTPClient) GetJSON(ctx context.Context, url string, v any) error

GetJSON fetches url and decodes the JSON body into v.

func (*HTTPClient) GetRange

func (h *HTTPClient) GetRange(ctx context.Context, url string, offset, length int64) (*http.Response, error)

GetRange fetches the [offset, offset+length) byte span of url. A length <= 0 requests from offset to the end (used to resume a partial download).

func (*HTTPClient) PostForm

func (h *HTTPClient) PostForm(ctx context.Context, rawURL string, form url.Values) ([]byte, error)

PostForm posts form-encoded values to url (used by xauthn and SPN2) and returns the response body. Credentials, when present, are attached.

func (*HTTPClient) Put

func (h *HTTPClient) Put(ctx context.Context, url string, body io.Reader, size int64, headers map[string]string) (*http.Response, error)

Put uploads body to url with the given extra headers. It requires credentials. Returns the response for the caller to inspect.

func (*HTTPClient) WithCredentials

func (h *HTTPClient) WithCredentials(c *Credentials) *HTTPClient

WithCredentials attaches credentials used by authenticated requests. Passing nil leaves the client anonymous.

type Link struct {
	URL  string `json:"url"`
	Text string `json:"text"`
}

Link is an extracted hyperlink.

func ExtractLinks(body []byte) []Link

ExtractLinks returns every hyperlink in the document.

type MetaDict

type MetaDict map[string]json.RawMessage

MetaDict is an item's metadata dictionary. Values may be a string or an array of strings; Strings flattens either into a slice.

func (MetaDict) Get

func (d MetaDict) Get(key string) string

Get returns the first value for key as a string.

func (MetaDict) Strings

func (d MetaDict) Strings(key string) []string

Strings returns all values for key as a slice (handling single or array).

type Metadata

type Metadata struct {
	Identifier         string          `json:"-"`
	Meta               MetaDict        `json:"metadata"`
	Files              []FileInfo      `json:"files"`
	Server             string          `json:"server"`
	D1                 string          `json:"d1"`
	D2                 string          `json:"d2"`
	Dir                string          `json:"dir"`
	FilesCount         int             `json:"files_count"`
	ItemSize           int64           `json:"item_size"`
	Created            int64           `json:"created"`
	ItemLastUpdated    int64           `json:"item_last_updated"`
	Uniq               int64           `json:"uniq"`
	IsDark             bool            `json:"is_dark"`
	WorkableServers    []string        `json:"workable_servers"`
	AlternateLocations json.RawMessage `json:"alternate_locations"`
	Raw                json.RawMessage `json:"-"`
}

Metadata is the decoded Metadata API document for an item. Every top-level key the API returns is captured; the full document is also kept in Raw so the metadata command emits it byte-for-byte.

func GetMetadata

func GetMetadata(ctx context.Context, h *HTTPClient, cache *Cache, identifier string) (Metadata, error)

GetMetadata fetches and decodes the Metadata API document for an item.

func (Metadata) Exists

func (m Metadata) Exists() bool

Exists reports whether the metadata describes a real item (the API returns an empty object for an unknown identifier).

func (Metadata) FilterFiles

func (m Metadata) FilterFiles(glob, format string) []FileInfo

FilterFiles returns the files matching an optional case-insensitive name glob and/or an exact-or-substring format match. Empty filters match everything.

func (Metadata) NodeURLFor

func (m Metadata) NodeURLFor(name string) string

NodeURLFor returns the direct datanode URL for a file, given the item's metadata (server+dir). It falls back to the canonical download URL when the node is unknown.

func (Metadata) Title

func (m Metadata) Title() string

Title returns the item's display title, falling back to the identifier.

type NotFoundError

type NotFoundError struct{ URL string }

NotFoundError marks a 404 so the CLI can map it to exit code 5.

func (*NotFoundError) Error

func (e *NotFoundError) Error() string

type SPNJob

type SPNJob struct {
	JobID     string `json:"job_id"`
	URL       string `json:"url"`
	Timestamp string `json:"timestamp"`
	Status    string `json:"status"`
	Message   string `json:"message"`
}

SPNJob is the result of a Save Page Now request.

func Save

func Save(ctx context.Context, h *HTTPClient, target string, outlinks, screenshot bool) (SPNJob, error)

Save submits an authenticated SPN2 capture and returns the job. Requires credentials. capture_outlinks/screenshot follow the SPN2 contract.

func SaveStatus

func SaveStatus(ctx context.Context, h *HTTPClient, jobID string) (SPNJob, error)

SaveStatus polls an SPN2 job once.

type SearchDoc

type SearchDoc map[string]json.RawMessage

SearchDoc is one hit from Advanced Search or the Scraping API. Fields are kept as raw JSON so any requested column survives; convenience accessors cover the common ones.

func (SearchDoc) Identifier

func (d SearchDoc) Identifier() string

Identifier is the item id of a hit.

func (SearchDoc) String

func (d SearchDoc) String(key string) string

String returns field key as a single string (first element if it is an array).

func (SearchDoc) TemplateValue

func (d SearchDoc) TemplateValue() any

TemplateValue returns a decoded view of the document so Go templates see real strings, numbers, and slices instead of raw JSON bytes. Single-element arrays collapse to their element, which is what most metadata fields want.

type SearchQuery

type SearchQuery struct {
	Query  string   // Lucene query
	Fields []string // fields to return (fl[])
	Sorts  []string // sort keys (sort[]), e.g. "downloads desc"
	Rows   int      // page size for Advanced Search
	Page   int      // 1-based page for Advanced Search
}

SearchQuery describes a search over the item store.

type Snapshot

type Snapshot struct {
	Available bool   `json:"available"`
	URL       string `json:"url"`
	Timestamp string `json:"timestamp"`
	Status    string `json:"status"`
}

Snapshot is the closest capture from the Availability API.

func Available

func Available(ctx context.Context, h *HTTPClient, cache *Cache, target, timestamp string) (Snapshot, bool, error)

Available returns the closest Wayback snapshot for a URL, optionally anchored to a timestamp (YYYYMMDDhhmmss, partial allowed).

type Task

type Task struct {
	TaskID    int64  `json:"task_id"`
	Server    string `json:"server"`
	Cmd       string `json:"cmd"`
	Status    string `json:"status"`
	Args      any    `json:"args"`
	Category  string `json:"category"`
	Priority  int    `json:"priority"`
	Submitter string `json:"submitter"`
	DateSub   string `json:"submittime"`
	Finished  int64  `json:"finished"`
	Notes     string `json:"notes"`

	Raw json.RawMessage `json:"-"`
}

Task is one entry from an item's catalog/derive history. The complete raw record is retained so no task field is dropped in json/jsonl output.

func GetTasks

func GetTasks(ctx context.Context, h *HTTPClient, identifier string) ([]Task, error)

GetTasks fetches the catalog task history for an item. The tasks endpoint requires credentials for items you do not own; the client attaches them when present.

func (Task) Fields

func (t Task) Fields() map[string]any

Fields decodes the complete task record so every field survives into output.

func (*Task) UnmarshalJSON

func (t *Task) UnmarshalJSON(b []byte) error

UnmarshalJSON fills the typed fields and retains the complete raw record.

type UploadOptions

type UploadOptions struct {
	Metadata    map[string]string // x-archive-meta-<k>:<v> set on item creation
	RemoteName  string            // override the name in the item (default: base of local path)
	MakeBucket  bool              // create the item if it does not exist
	NoDerive    bool              // skip derivation after upload
	ContentType string            // override Content-Type
}

UploadOptions tune an IAS3 upload.

type Views

type Views struct {
	AllTime  int64 `json:"all_time"`
	Last30   int64 `json:"last_30day"`
	Last7    int64 `json:"last_7day"`
	HaveData bool  `json:"have_data"`
}

Views holds the short view statistics for an item.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL