Documentation
¶
Overview ¶
Package archiver provides functions to archive a full HTML page with its assets.
Index ¶
- Variables
- func GetExtension(mimeType string) string
- func GetNodeContext(ctx context.Context) *html.Node
- func IsArchiverRequest(req *http.Request) bool
- func Logger(c Collector) *slog.Logger
- func MultiReadCloser(readers ...io.Reader) io.ReadCloser
- func NodeLogValue(n *html.Node) slog.LogValuer
- func WithTimeout(timeout time.Duration) func(c *http.Client)
- type ArchiveFlag
- type Archiver
- type ClientOptions
- type Collector
- type ConvertCollector
- type DownloadCollector
- type FileCollector
- type LoggerCollector
- type Option
- type PostWriteCollector
- type Resource
- type SingleFileCollector
- type URLLogValue
- type ZipCollector
Constants ¶
This section is empty.
Variables ¶
var ( // ErrSkippedURL is an error returned so the current URL is not processed. ErrSkippedURL = errors.New("skip processing url") // ErrRemoveSrc joins [ErrSkippedURL] and instructs the archiver to // skip the URL and remove the related node. ErrRemoveSrc = errors.Join(ErrSkippedURL, errors.New("remove source")) )
Functions ¶
func GetExtension ¶
GetExtension returns an extension for a given mime type. It defaults to .bin when none was found.
func GetNodeContext ¶
GetNodeContext returns an html.Node stored in context.
func IsArchiverRequest ¶
IsArchiverRequest returns true when an http.Request was made using the archiver.
func Logger ¶
Logger returns the Collector's logger when it's a LoggerCollector. It returns a null logger otherwise.
func MultiReadCloser ¶
func MultiReadCloser(readers ...io.Reader) io.ReadCloser
MultiReadCloser returns an io.ReadCloser that's the concatenation of multiple io.Reader. It stores a reader resulting from io.MultiReader and a list of io.Closer for the provided readers that implement io.Closer.
func NodeLogValue ¶
NodeLogValue is an slog.LogValuer for an *html.Node. Its LogValue method renders and truncate the node as HTML.
Types ¶
type ArchiveFlag ¶
type ArchiveFlag uint8
ArchiveFlag is an archiver feature to enable.
const ( // EnableCSS enables extraction of CSS files and tags. EnableCSS ArchiveFlag = 1 << iota // EnableEmbeds enables extraction of Embedes contents. EnableEmbeds // EnableJS enables extraction of JavaScript contents. EnableJS // EnableMedia enables extraction of media contents // other than image. EnableMedia // EnableImages enables extraction of images. EnableImages // EnableFonts enables font extraction. EnableFonts // EnableDataAttributes enables data attributes in HTML elements. EnableDataAttributes // EnableBestImage enables an image sorting process to find the // best suitable image from srcset and picture>source elements. EnableBestImage )
type Archiver ¶
type Archiver struct {
// contains filtered or unexported fields
}
Archiver is the core of the archiver process. It hold the flags ArchiveFlag and a Collector that caches collected content.
type ClientOptions ¶
ClientOptions is a function to set HTTP client's properties.
type Collector ¶
type Collector interface {
sync.Locker
Get(uri string) (*Resource, bool)
Set(uri string, res *Resource)
Name(uri string) string
Fetch(req *http.Request) (*http.Response, error)
Create(res *Resource) (io.Writer, error)
Resources() iter.Seq[*Resource]
}
Collector describes a resource collector. Its role is to provide some methods to retrieve and keep track of remote resources. A collector is orchestrated by [Archiver.fetch], and [Archiver.saveResource].
type ConvertCollector ¶
type ConvertCollector interface {
Convert(ctx context.Context, res *Resource, r io.ReadCloser) (io.ReadCloser, error)
}
ConvertCollector describes a collector providing a method to transform a response's body and/or the associated resource.
type DownloadCollector ¶
DownloadCollector is a Collector that takes care of keeping track of fetched resources and their cached state.
func NewDownloadCollector ¶
func NewDownloadCollector(client *http.Client, options ...ClientOptions) *DownloadCollector
NewDownloadCollector returns a DownloadCollector.
func (*DownloadCollector) Fetch ¶
Fetch calls the collector's HTTP client and returns an *http.Response.
func (*DownloadCollector) Get ¶
func (c *DownloadCollector) Get(uri string) (res *Resource, ok bool)
Get returns the *Resource associated with a given URL.
type FileCollector ¶
type FileCollector struct {
*DownloadCollector
// contains filtered or unexported fields
}
FileCollector is a Collector that saves resources on a filesystem.
func NewFileCollector ¶
func NewFileCollector(root string, client *http.Client, options ...ClientOptions) *FileCollector
NewFileCollector returns a new *FileCollector.
func (*FileCollector) Create ¶
func (c *FileCollector) Create(res *Resource) (io.Writer, error)
Create implement Collector. It creates a new resource io.Writer and returns a reader. The new reader can simply be the original one or a buffer created after an custom transformation. At this point, the *Resource properties can change, including its name and it will reflect on the final document.
type LoggerCollector ¶
LoggerCollector describes a logger provider.
type Option ¶
type Option func(arc *Archiver)
Option is a function that can set an Archiver options.
func WithCollector ¶
func WithConcurrency ¶
WithConcurrency set the maximum concurrent downloads that can take place during archiving.
func WithFlags ¶
func WithFlags(flags ArchiveFlag) Option
WithFlags sets ArchiveFlag to an Archiver.
type PostWriteCollector ¶
PostWriteCollector describes a collector providing a method to perform an action just after writing a resource's content.
type Resource ¶
type Resource struct {
Name string
ContentType string
Width int
Height int
Size int64
Contents *bytes.Buffer
// contains filtered or unexported fields
}
Resource is a remote resource.
type SingleFileCollector ¶
type SingleFileCollector struct {
*DownloadCollector
// contains filtered or unexported fields
}
SingleFileCollector is a Collector that produces a single HTML file with every resource URL base64 encoded. Note that it is very memory inneficient and should only be used for testing purposes.
func NewSingleFileCollector ¶
func NewSingleFileCollector(w io.Writer, client *http.Client, options ...ClientOptions) *SingleFileCollector
NewSingleFileCollector returns a new SingleFileCollector.
func (*SingleFileCollector) Create ¶
func (c *SingleFileCollector) Create(res *Resource) (io.Writer, error)
Create implements Collector. For resources, that is, not index.html, it returns a bytes.Buffer that will be filled with the resource's content.
func (*SingleFileCollector) PostWrite ¶
func (c *SingleFileCollector) PostWrite(res *Resource, w io.Writer)
PostWrite implements PostWriteCollector. For any resource that's not index.html it renames it to a data URL using the previously created buffer.
type URLLogValue ¶
type URLLogValue string
URLLogValue is a slog.LogValuer for URLs. It truncates the string when there too long (ie. data: URLs).
func (URLLogValue) LogValue ¶
func (s URLLogValue) LogValue() slog.Value
LogValue implements slog.LogValuer.
type ZipCollector ¶
type ZipCollector struct {
*DownloadCollector
// contains filtered or unexported fields
}
ZipCollector is a Collector that saves resources in a zip file.
func NewZipCollector ¶
func NewZipCollector(zw *zip.Writer, client *http.Client, options ...ClientOptions) *ZipCollector
NewZipCollector returns a ZipCollector instance. The zip.Writer must be open and it's the caller's responsibility to close it when done adding files.
func (*ZipCollector) Create ¶
func (c *ZipCollector) Create(res *Resource) (io.Writer, error)
Create implement Collector. The returned io.Writer is a zip fileWriter. It creates the necessary directory entries. See FileCollector.Create for more information.