archiver

package
v0.0.0-...-85bfd8b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 22, 2026 License: AGPL-3.0 Imports: 40 Imported by: 0

README

Readeck Archiver

This package is a fork of Obelisk by Radhi Fadlillah.

What started as a soft fork with few changes is now an independent package that retains most of Obelisk's logic for finding resources but introduces a modular and less memory consumming way to store resources after downloading them.

The Archiver

The Archiver is a structure that provides public methods to archive a document. It then visits all the images, stylesheets, scripts, etc.

The archiver doesn't store any content but only provides private utilities:

  • fetchInfo: fetch a resource but only sniff its content-type and dimensions when it's an image
  • fetch: retrieve an io.ReadCloser (the response's body)
  • saveResource: saves an io.ReadCloser into the Collector

The Collector

Upon each visit, it may call a Collector that takes care of the following:

  • Give it a name (default to an UUID) that is used as a new attribute value (or CSS URL)
  • Provide an io.Writer in which the resource's content can be saved

In the most simple cases, this design provides a direct connection between an http.Response.Body and the provided io.Writer, without any intermediate storage.

DownloadCollector

DownloadCollector is a partial Collector that keeps a resource inventory and takes care of retrieving remote documents.

FileCollector

FileCollector is a full Collector that renames resources to an UUID (URL namespace) and saves them into an os.Root filesystem.

ZipCollector

ZipCollector is a full Collector that saves resources inside a zip.Writer.

SingleFileCollector

SingleFileCollector is a Collector that doesn't save files but replaces all URLs with data: URIs. As one can imagine, it is memory intensive.

Documentation

Overview

Package archiver provides functions to archive a full HTML page with its assets.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrSkippedURL is an error returned so the current URL is not processed.
	ErrSkippedURL = errors.New("skip processing url")

	// ErrRemoveSrc joins [ErrSkippedURL] and instructs the archiver to
	// skip the URL and remove the related node.
	ErrRemoveSrc = errors.Join(ErrSkippedURL, errors.New("remove source"))
)

Functions

func GetExtension

func GetExtension(mimeType string) string

GetExtension returns an extension for a given mime type. It defaults to .bin when none was found.

func GetNodeContext

func GetNodeContext(ctx context.Context) *html.Node

GetNodeContext returns an html.Node stored in context.

func IsArchiverRequest

func IsArchiverRequest(req *http.Request) bool

IsArchiverRequest returns true when an http.Request was made using the archiver.

func Logger

func Logger(c Collector) *slog.Logger

Logger returns the Collector's logger when it's a LoggerCollector. It returns a null logger otherwise.

func MultiReadCloser

func MultiReadCloser(readers ...io.Reader) io.ReadCloser

MultiReadCloser returns an io.ReadCloser that's the concatenation of multiple io.Reader. It stores a reader resulting from io.MultiReader and a list of io.Closer for the provided readers that implement io.Closer.

func NodeLogValue

func NodeLogValue(n *html.Node) slog.LogValuer

NodeLogValue is an slog.LogValuer for an *html.Node. Its LogValue method renders and truncate the node as HTML.

func WithTimeout

func WithTimeout(timeout time.Duration) func(c *http.Client)

WithTimeout set the HTTP client's timeout for downloading resources.

Types

type ArchiveFlag

type ArchiveFlag uint8

ArchiveFlag is an archiver feature to enable.

const (
	// EnableCSS enables extraction of CSS files and tags.
	EnableCSS ArchiveFlag = 1 << iota

	// EnableEmbeds enables extraction of Embedes contents.
	EnableEmbeds

	// EnableJS enables extraction of JavaScript contents.
	EnableJS

	// EnableMedia enables extraction of media contents
	// other than image.
	EnableMedia

	// EnableImages enables extraction of images.
	EnableImages

	// EnableFonts enables font extraction.
	EnableFonts

	// EnableDataAttributes enables data attributes in HTML elements.
	EnableDataAttributes

	// EnableBestImage enables an image sorting process to find the
	// best suitable image from srcset and picture>source elements.
	EnableBestImage
)

type Archiver

type Archiver struct {
	// contains filtered or unexported fields
}

Archiver is the core of the archiver process. It hold the flags ArchiveFlag and a Collector that caches collected content.

func New

func New(options ...Option) *Archiver

New creates a new Archiver.

func (*Archiver) ArchiveDocument

func (arc *Archiver) ArchiveDocument(ctx context.Context, doc *html.Node, uri *url.URL, name string) (err error)

ArchiveDocument runs the archiver on a document *html.Node for a given URL. If name is empty, the Collector will generate one.

func (*Archiver) ArchiveReader

func (arc *Archiver) ArchiveReader(ctx context.Context, r io.Reader, uri *url.URL, name string) error

ArchiveReader runs the archiver on an io.Reader for a given URL. If name is empty, the Collector will generate one.

type ClientOptions

type ClientOptions func(c *http.Client)

ClientOptions is a function to set HTTP client's properties.

type Collector

type Collector interface {
	sync.Locker
	Get(uri string) (*Resource, bool)
	Set(uri string, res *Resource)
	Name(uri string) string
	Fetch(req *http.Request) (*http.Response, error)
	Create(res *Resource) (io.Writer, error)
	Resources() iter.Seq[*Resource]
}

Collector describes a resource collector. Its role is to provide some methods to retrieve and keep track of remote resources. A collector is orchestrated by [Archiver.fetch], and [Archiver.saveResource].

type ConvertCollector

type ConvertCollector interface {
	Convert(ctx context.Context, res *Resource, r io.ReadCloser) (io.ReadCloser, error)
}

ConvertCollector describes a collector providing a method to transform a response's body and/or the associated resource.

type DownloadCollector

type DownloadCollector struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

DownloadCollector is a Collector that takes care of keeping track of fetched resources and their cached state.

func NewDownloadCollector

func NewDownloadCollector(client *http.Client, options ...ClientOptions) *DownloadCollector

NewDownloadCollector returns a DownloadCollector.

func (*DownloadCollector) Fetch

func (c *DownloadCollector) Fetch(req *http.Request) (*http.Response, error)

Fetch calls the collector's HTTP client and returns an *http.Response.

func (*DownloadCollector) Get

func (c *DownloadCollector) Get(uri string) (res *Resource, ok bool)

Get returns the *Resource associated with a given URL.

func (*DownloadCollector) Resources

func (c *DownloadCollector) Resources() iter.Seq[*Resource]

Resources returns an iter.Seq of all the collected resources.

func (*DownloadCollector) Set

func (c *DownloadCollector) Set(uri string, res *Resource)

Set sets a *Resource for a given URL.

type FileCollector

type FileCollector struct {
	*DownloadCollector
	// contains filtered or unexported fields
}

FileCollector is a Collector that saves resources on a filesystem.

func NewFileCollector

func NewFileCollector(root string, client *http.Client, options ...ClientOptions) *FileCollector

NewFileCollector returns a new *FileCollector.

func (*FileCollector) Create

func (c *FileCollector) Create(res *Resource) (io.Writer, error)

Create implement Collector. It creates a new resource io.Writer and returns a reader. The new reader can simply be the original one or a buffer created after an custom transformation. At this point, the *Resource properties can change, including its name and it will reflect on the final document.

func (FileCollector) Name

func (c FileCollector) Name(uri string) string

Name returns a name for a URL, using UUID's URL namespace.

type LoggerCollector

type LoggerCollector interface {
	Log() *slog.Logger
}

LoggerCollector describes a logger provider.

type Option

type Option func(arc *Archiver)

Option is a function that can set an Archiver options.

func WithCollector

func WithCollector(collector Collector) Option

WithCollector sets a Collector to an Archiver.

func WithConcurrency

func WithConcurrency(v int64) Option

WithConcurrency set the maximum concurrent downloads that can take place during archiving.

func WithFlags

func WithFlags(flags ArchiveFlag) Option

WithFlags sets ArchiveFlag to an Archiver.

type PostWriteCollector

type PostWriteCollector interface {
	PostWrite(res *Resource, w io.Writer)
}

PostWriteCollector describes a collector providing a method to perform an action just after writing a resource's content.

type Resource

type Resource struct {
	Name        string
	ContentType string
	Width       int
	Height      int
	Size        int64
	Contents    *bytes.Buffer
	// contains filtered or unexported fields
}

Resource is a remote resource.

func (*Resource) Saved

func (c *Resource) Saved() bool

Saved returns the resource's saved state.

func (*Resource) URL

func (c *Resource) URL() string

URL returns the resource's URL.

func (*Resource) Value

func (c *Resource) Value() string

Value returns the resource value. It's usually [Resource.Name] but it can be [Resource.Contents] when it's not a nil value.

type SingleFileCollector

type SingleFileCollector struct {
	*DownloadCollector
	// contains filtered or unexported fields
}

SingleFileCollector is a Collector that produces a single HTML file with every resource URL base64 encoded. Note that it is very memory inneficient and should only be used for testing purposes.

func NewSingleFileCollector

func NewSingleFileCollector(w io.Writer, client *http.Client, options ...ClientOptions) *SingleFileCollector

NewSingleFileCollector returns a new SingleFileCollector.

func (*SingleFileCollector) Create

func (c *SingleFileCollector) Create(res *Resource) (io.Writer, error)

Create implements Collector. For resources, that is, not index.html, it returns a bytes.Buffer that will be filled with the resource's content.

func (SingleFileCollector) Name

func (c SingleFileCollector) Name(uri string) string

Name returns a name for a URL, using UUID's URL namespace.

func (*SingleFileCollector) PostWrite

func (c *SingleFileCollector) PostWrite(res *Resource, w io.Writer)

PostWrite implements PostWriteCollector. For any resource that's not index.html it renames it to a data URL using the previously created buffer.

type URLLogValue

type URLLogValue string

URLLogValue is a slog.LogValuer for URLs. It truncates the string when there too long (ie. data: URLs).

func (URLLogValue) LogValue

func (s URLLogValue) LogValue() slog.Value

LogValue implements slog.LogValuer.

type ZipCollector

type ZipCollector struct {
	*DownloadCollector
	// contains filtered or unexported fields
}

ZipCollector is a Collector that saves resources in a zip file.

func NewZipCollector

func NewZipCollector(zw *zip.Writer, client *http.Client, options ...ClientOptions) *ZipCollector

NewZipCollector returns a ZipCollector instance. The zip.Writer must be open and it's the caller's responsibility to close it when done adding files.

func (*ZipCollector) Create

func (c *ZipCollector) Create(res *Resource) (io.Writer, error)

Create implement Collector. The returned io.Writer is a zip fileWriter. It creates the necessary directory entries. See FileCollector.Create for more information.

func (ZipCollector) Name

func (c ZipCollector) Name(uri string) string

Name returns a name for a URL, using UUID's URL namespace.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL