Documentation
¶
Overview ¶
Package ia is the library behind the archive command line: everything it knows about archive.org and the Wayback Machine lives here, with no dependency on the command framework.
Index ¶
- Constants
- Variables
- func CDX(ctx context.Context, h *HTTPClient, q CDXQuery, fn func(CDXRecord) error) (int, error)
- func ConfigDir() string
- func CredentialsPath() string
- func DeleteFile(ctx context.Context, h *HTTPClient, identifier, remoteName string) error
- func DetailsURLFor(identifier string) string
- func DownloadURLFor(identifier, name string) string
- func ExtractText(body []byte) string
- func ExtractTitle(body []byte) string
- func GetMetadataSub(ctx context.Context, h *HTTPClient, identifier, subpath string) ([]byte, error)
- func GetViews(ctx context.Context, h *HTTPClient, cache *Cache, identifiers []string) (map[string]Views, error)
- func QuoteQuery(s string) string
- func ReplayURL(timestamp, target string, raw bool) string
- func SaveAnonymous(ctx context.Context, h *HTTPClient, target string) (string, error)
- func Scrape(ctx context.Context, h *HTTPClient, q SearchQuery, limit int, ...) (int, error)
- func SearchEach(ctx context.Context, h *HTTPClient, cache *Cache, q SearchQuery, limit int, ...) (int, error)
- func Upload(ctx context.Context, h *HTTPClient, identifier, localPath string, ...) (string, error)
- func UploadHeaders(opts UploadOptions, size int64) map[string]string
- func UploadURL(identifier, remoteName string) string
- func ValidIdentifier(id string) bool
- type APIError
- type AdvancedSearchResult
- type CDXQuery
- type CDXRecord
- type Cache
- type Config
- type Credentials
- type DownloadResult
- type FileInfo
- type HTTPClient
- func (h *HTTPClient) Delete(ctx context.Context, url string, headers map[string]string) (*http.Response, error)
- func (h *HTTPClient) FetchBytes(ctx context.Context, url string) ([]byte, error)
- func (h *HTTPClient) Get(ctx context.Context, url string) (*http.Response, error)
- func (h *HTTPClient) GetDownload(ctx context.Context, url string) (*http.Response, error)
- func (h *HTTPClient) GetJSON(ctx context.Context, url string, v any) error
- func (h *HTTPClient) GetRange(ctx context.Context, url string, offset, length int64) (*http.Response, error)
- func (h *HTTPClient) PostForm(ctx context.Context, rawURL string, form url.Values) ([]byte, error)
- func (h *HTTPClient) Put(ctx context.Context, url string, body io.Reader, size int64, ...) (*http.Response, error)
- func (h *HTTPClient) WithCredentials(c *Credentials) *HTTPClient
- type Link
- type MetaDict
- type Metadata
- type NotFoundError
- type SPNJob
- type SearchDoc
- type SearchQuery
- type Snapshot
- type Task
- type UploadOptions
- type Views
Constants ¶
const ( BaseURL = "https://archive.org/" MetadataURL = "https://archive.org/metadata/" DownloadURL = "https://archive.org/download/" DetailsURL = "https://archive.org/details/" AdvancedSearch = "https://archive.org/advancedsearch.php" ScrapeURL = "https://archive.org/services/search/v1/scrape" ViewsURL = "https://be-api.us.archive.org/views/v1/short/" TasksURL = "https://archive.org/services/tasks.php" XAuthnURL = "https://archive.org/services/xauthn/" S3URL = "https://s3.us.archive.org/" // Wayback Machine. WaybackAvailURL = "https://archive.org/wayback/available" WaybackCDXURL = "http://web.archive.org/cdx/search/cdx" WaybackReplay = "https://web.archive.org/web/" WaybackSaveURL = "https://web.archive.org/save/" // UserAgent identifies the client politely to the Archive's edge. UserAgent = "archive-cli/1.0 (+https://github.com/tamnd/archive-cli)" )
Internet Archive endpoints.
const ( DefaultTimeout = 120 * time.Second DefaultRetries = 5 DefaultDelay = 250 * time.Millisecond )
Defaults for the client and downloader.
Variables ¶
var DefaultFields = []string{"identifier", "title", "mediatype", "downloads", "date"}
DefaultFields are returned when the caller asks for none.
var ErrNoCredentials = errors.New("no credentials configured; run 'archive configure'")
ErrNoCredentials is returned when an authenticated operation runs without configured credentials. The CLI maps it to exit code 4.
Functions ¶
func CDX ¶
CDX fetches the capture history for a URL and calls fn for each record. The CDX server returns a header row first, which is consumed to map columns.
func ConfigDir ¶
func ConfigDir() string
ConfigDir returns the directory holding the credentials file.
func CredentialsPath ¶
func CredentialsPath() string
CredentialsPath is the file archive configure writes.
func DeleteFile ¶
func DeleteFile(ctx context.Context, h *HTTPClient, identifier, remoteName string) error
DeleteFile removes a file from an item over IAS3.
func DetailsURLFor ¶
DetailsURLFor returns the human details page for an item.
func DownloadURLFor ¶
DownloadURLFor returns the canonical download URL for a file in an item.
func ExtractText ¶
ExtractText returns the visible text of an HTML document, with script/style stripped and whitespace collapsed to single blank-line-separated blocks.
func ExtractTitle ¶
ExtractTitle returns the document <title>.
func GetMetadataSub ¶
GetMetadataSub fetches a sub-resource of the Metadata API (e.g. "files", "server", "metadata", or "files/<name>") and returns the raw JSON bytes.
func GetViews ¶
func GetViews(ctx context.Context, h *HTTPClient, cache *Cache, identifiers []string) (map[string]Views, error)
GetViews fetches short view statistics for one or more items in a single call.
func QuoteQuery ¶
QuoteQuery wraps a Lucene phrase in quotes when it contains spaces and is not already a field:value expression.
func ReplayURL ¶
ReplayURL builds a Wayback replay URL. When raw is set it requests the original archived bytes (the "id_" modifier) instead of the rewritten page.
func SaveAnonymous ¶
SaveAnonymous triggers an anonymous Save Page Now capture (fire and forget). It returns the replay URL the capture will live at.
func Scrape ¶
func Scrape(ctx context.Context, h *HTTPClient, q SearchQuery, limit int, fn func(SearchDoc) error) (int, error)
Scrape exports every matching item via the cursor-based Scraping API, calling fn for each hit, up to limit (0 = all). This is the unbounded path used by --all; it does not support sort.
func SearchEach ¶
func SearchEach(ctx context.Context, h *HTTPClient, cache *Cache, q SearchQuery, limit int, fn func(SearchDoc) error) (int, error)
SearchEach pages through Advanced Search calling fn for each hit, up to limit (0 = the Solr deep-paging ceiling). It is the bounded, sortable path.
func Upload ¶
func Upload(ctx context.Context, h *HTTPClient, identifier, localPath string, opts UploadOptions) (string, error)
Upload puts one local file into an item over IAS3.
func UploadHeaders ¶
func UploadHeaders(opts UploadOptions, size int64) map[string]string
UploadHeaders builds the IAS3 request headers for an upload (exposed so the CLI can show them under --dry-run).
func ValidIdentifier ¶
ValidIdentifier reports whether id is a syntactically valid item identifier.
Types ¶
type APIError ¶
APIError carries a non-2xx HTTP status so the CLI can map it to an exit code (401/403 -> auth required).
type AdvancedSearchResult ¶
AdvancedSearchResult is the decoded Advanced Search response.
func Search ¶
func Search(ctx context.Context, h *HTTPClient, cache *Cache, q SearchQuery) (AdvancedSearchResult, error)
Search runs one page of Advanced Search.
type CDXQuery ¶
type CDXQuery struct {
URL string
From string
To string
MatchType string // exact | prefix | host | domain
Filters []string // e.g. "statuscode:200", "!mimetype:text/html"
Collapse string // e.g. "digest" or "timestamp:8"
Limit int // negative = newest N
}
CDXQuery describes a Wayback CDX history request.
type CDXRecord ¶
type CDXRecord struct {
URLKey string `json:"urlkey"`
Timestamp string `json:"timestamp"`
Original string `json:"original"`
MimeType string `json:"mimetype"`
StatusCode string `json:"statuscode"`
Digest string `json:"digest"`
Length string `json:"length"`
}
CDXRecord is one Wayback capture from the CDX server.
type Cache ¶
type Cache struct {
// contains filtered or unexported fields
}
Cache is a tiny on-disk blob cache keyed by an arbitrary string, with a TTL per entry. It is safe for the simple single-process use the CLI makes of it.
func NewCache ¶
NewCache returns a cache rooted under dir. If dir is empty or enabled is false, all operations are no-ops (cache miss on every Get).
type Config ¶
type Config struct {
DataDir string
CacheDir string
Workers int
Timeout time.Duration
Delay time.Duration
Retries int
UserAgent string
}
Config controls library behaviour. The zero value is not usable; call DefaultConfig and adjust.
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns a Config rooted at the data directory with polite client defaults.
func (Config) DownloadDir ¶
DownloadDir is where downloaded item files land by default.
type Credentials ¶
Credentials are the IAS3 access/secret key pair used for authenticated reads and all writes.
func LoadCredentials ¶
func LoadCredentials() (*Credentials, error)
LoadCredentials reads the credentials file (simple "key = value" lines).
func Login ¶
func Login(ctx context.Context, h *HTTPClient, email, password string) (*Credentials, error)
Login exchanges an email and password for IAS3 keys via the xauthn endpoint.
func ResolveCredentials ¶
func ResolveCredentials(access, secret string) *Credentials
ResolveCredentials finds credentials in priority order: explicit values, environment, then the config file. It always returns a non-nil pointer; call Valid to test whether it is usable.
func (*Credentials) AuthHeader ¶
func (c *Credentials) AuthHeader() string
AuthHeader builds the IAS3 authorization header value.
func (*Credentials) MaskedSecret ¶
func (c *Credentials) MaskedSecret() string
MaskedSecret returns the secret with all but the last four characters hidden.
func (*Credentials) Save ¶
func (c *Credentials) Save() error
Save writes the credentials to the config file with mode 0600.
func (*Credentials) Valid ¶
func (c *Credentials) Valid() bool
Valid reports whether both keys are present.
type DownloadResult ¶
type DownloadResult struct {
File FileInfo
Path string
Bytes int64
Skipped bool // already present with a matching md5
Verified bool // md5 checked and matched
}
DownloadResult reports the outcome of fetching one file.
func DownloadFile ¶
func DownloadFile(ctx context.Context, h *HTTPClient, m Metadata, f FileInfo, destDir string, verify, flat bool) (DownloadResult, error)
DownloadFile fetches one file of an item into destDir. When verify is set the download is checked against the manifest md5; a pre-existing file with a matching md5 is skipped. flat omits any sub-directory in the file name.
type FileInfo ¶
type FileInfo struct {
Name string `json:"name"`
Source string `json:"source"` // original | derivative | metadata
Format string `json:"format"`
MD5 string `json:"md5"`
CRC32 string `json:"crc32"`
SHA1 string `json:"sha1"`
SizeS string `json:"size"`
MtimeS string `json:"mtime"`
Raw json.RawMessage `json:"-"`
}
FileInfo is one file record from an item's manifest. The Archive encodes the numeric fields as strings; Size/Mtime expose them coerced to int64. The full raw record is kept so media-specific fields (length, bitrate, track, album, width, height, rotation, ...) survive into json/jsonl/template output.
func (FileInfo) Fields ¶
Fields decodes the complete file record into a map so every field the API returned is available to json/jsonl/template output, not just the typed ones.
func (*FileInfo) UnmarshalJSON ¶
UnmarshalJSON fills the typed fields and retains the complete raw record so no per-file field is ever dropped.
type HTTPClient ¶
type HTTPClient struct {
// contains filtered or unexported fields
}
HTTPClient is a polite, retrying HTTP client for archive.org. It rate-limits requests, retries on 429/5xx with linear backoff (honouring Retry-After), supports byte-range requests for resumable downloads, and attaches the IAS3 "LOW access:secret" authorization header when credentials are present.
func NewHTTPClient ¶
func NewHTTPClient(cfg Config) *HTTPClient
NewHTTPClient builds an HTTPClient from cfg.
func (*HTTPClient) Delete ¶
func (h *HTTPClient) Delete(ctx context.Context, url string, headers map[string]string) (*http.Response, error)
Delete removes url over IAS3. It requires credentials.
func (*HTTPClient) FetchBytes ¶
FetchBytes fetches url and returns the whole body, erroring on non-2xx.
func (*HTTPClient) GetDownload ¶
GetDownload fetches url with no client timeout (relies on ctx cancellation), for large file bodies.
func (*HTTPClient) GetRange ¶
func (h *HTTPClient) GetRange(ctx context.Context, url string, offset, length int64) (*http.Response, error)
GetRange fetches the [offset, offset+length) byte span of url. A length <= 0 requests from offset to the end (used to resume a partial download).
func (*HTTPClient) PostForm ¶
PostForm posts form-encoded values to url (used by xauthn and SPN2) and returns the response body. Credentials, when present, are attached.
func (*HTTPClient) Put ¶
func (h *HTTPClient) Put(ctx context.Context, url string, body io.Reader, size int64, headers map[string]string) (*http.Response, error)
Put uploads body to url with the given extra headers. It requires credentials. Returns the response for the caller to inspect.
func (*HTTPClient) WithCredentials ¶
func (h *HTTPClient) WithCredentials(c *Credentials) *HTTPClient
WithCredentials attaches credentials used by authenticated requests. Passing nil leaves the client anonymous.
type Link ¶
Link is an extracted hyperlink.
func ExtractLinks ¶
ExtractLinks returns every hyperlink in the document.
type MetaDict ¶
type MetaDict map[string]json.RawMessage
MetaDict is an item's metadata dictionary. Values may be a string or an array of strings; Strings flattens either into a slice.
type Metadata ¶
type Metadata struct {
Identifier string `json:"-"`
Meta MetaDict `json:"metadata"`
Files []FileInfo `json:"files"`
Server string `json:"server"`
D1 string `json:"d1"`
D2 string `json:"d2"`
Dir string `json:"dir"`
FilesCount int `json:"files_count"`
ItemSize int64 `json:"item_size"`
Created int64 `json:"created"`
ItemLastUpdated int64 `json:"item_last_updated"`
Uniq int64 `json:"uniq"`
IsDark bool `json:"is_dark"`
WorkableServers []string `json:"workable_servers"`
AlternateLocations json.RawMessage `json:"alternate_locations"`
Raw json.RawMessage `json:"-"`
}
Metadata is the decoded Metadata API document for an item. Every top-level key the API returns is captured; the full document is also kept in Raw so the metadata command emits it byte-for-byte.
func GetMetadata ¶
func GetMetadata(ctx context.Context, h *HTTPClient, cache *Cache, identifier string) (Metadata, error)
GetMetadata fetches and decodes the Metadata API document for an item.
func (Metadata) Exists ¶
Exists reports whether the metadata describes a real item (the API returns an empty object for an unknown identifier).
func (Metadata) FilterFiles ¶
FilterFiles returns the files matching an optional case-insensitive name glob and/or an exact-or-substring format match. Empty filters match everything.
func (Metadata) NodeURLFor ¶
NodeURLFor returns the direct datanode URL for a file, given the item's metadata (server+dir). It falls back to the canonical download URL when the node is unknown.
type NotFoundError ¶
type NotFoundError struct{ URL string }
NotFoundError marks a 404 so the CLI can map it to exit code 5.
func (*NotFoundError) Error ¶
func (e *NotFoundError) Error() string
type SPNJob ¶
type SPNJob struct {
JobID string `json:"job_id"`
URL string `json:"url"`
Timestamp string `json:"timestamp"`
Status string `json:"status"`
Message string `json:"message"`
}
SPNJob is the result of a Save Page Now request.
func Save ¶
func Save(ctx context.Context, h *HTTPClient, target string, outlinks, screenshot bool) (SPNJob, error)
Save submits an authenticated SPN2 capture and returns the job. Requires credentials. capture_outlinks/screenshot follow the SPN2 contract.
func SaveStatus ¶
SaveStatus polls an SPN2 job once.
type SearchDoc ¶
type SearchDoc map[string]json.RawMessage
SearchDoc is one hit from Advanced Search or the Scraping API. Fields are kept as raw JSON so any requested column survives; convenience accessors cover the common ones.
func (SearchDoc) Identifier ¶
Identifier is the item id of a hit.
func (SearchDoc) String ¶
String returns field key as a single string (first element if it is an array).
func (SearchDoc) TemplateValue ¶
TemplateValue returns a decoded view of the document so Go templates see real strings, numbers, and slices instead of raw JSON bytes. Single-element arrays collapse to their element, which is what most metadata fields want.
type SearchQuery ¶
type SearchQuery struct {
Query string // Lucene query
Fields []string // fields to return (fl[])
Sorts []string // sort keys (sort[]), e.g. "downloads desc"
Rows int // page size for Advanced Search
Page int // 1-based page for Advanced Search
}
SearchQuery describes a search over the item store.
type Snapshot ¶
type Snapshot struct {
Available bool `json:"available"`
URL string `json:"url"`
Timestamp string `json:"timestamp"`
Status string `json:"status"`
}
Snapshot is the closest capture from the Availability API.
type Task ¶
type Task struct {
TaskID int64 `json:"task_id"`
Server string `json:"server"`
Cmd string `json:"cmd"`
Status string `json:"status"`
Args any `json:"args"`
Category string `json:"category"`
Priority int `json:"priority"`
Submitter string `json:"submitter"`
DateSub string `json:"submittime"`
Finished int64 `json:"finished"`
Notes string `json:"notes"`
Raw json.RawMessage `json:"-"`
}
Task is one entry from an item's catalog/derive history. The complete raw record is retained so no task field is dropped in json/jsonl output.
func GetTasks ¶
GetTasks fetches the catalog task history for an item. The tasks endpoint requires credentials for items you do not own; the client attaches them when present.
func (*Task) UnmarshalJSON ¶
UnmarshalJSON fills the typed fields and retains the complete raw record.
type UploadOptions ¶
type UploadOptions struct {
Metadata map[string]string // x-archive-meta-<k>:<v> set on item creation
RemoteName string // override the name in the item (default: base of local path)
MakeBucket bool // create the item if it does not exist
NoDerive bool // skip derivation after upload
ContentType string // override Content-Type
}
UploadOptions tune an IAS3 upload.