extract

package
v0.0.0-...-11da2c6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 10, 2025 License: AGPL-3.0 Imports: 29 Imported by: 0

Documentation

Overview

Package extract is a content extractor for HTML pages. It works by using processors that are triggers at different (or several) steps of the extraction process.

Index

Constants

View Source
const (
	// ImageSizeThumbnail is the width of a regular thumbnail image.
	ImageSizeThumbnail = 800

	// ImageSizeWide is the width of a bigger image.
	// Its dimension matches 48rem (on a 16px basis)
	// on an HDPI screen: 48 × 16px × 2.
	ImageSizeWide = 1536
)

Variables

View Source
var (
	// WithRequestType returns a new context that contains the given [FetchType].
	WithRequestType = ctxr.Setter[FetchType](ctxRequestTypeKey{})
	// CheckRequestType returns the [FetchType] of a given context.
	CheckRequestType = ctxr.Checker[FetchType](ctxRequestTypeKey{})

	// WithRequestHeader returns a new context that contains the given [http.Header].
	WithRequestHeader = ctxr.Setter[http.Header](ctxRequestHeaderKey{})
	// CheckRequestHeader returns the [http.Header] of a given context.
	CheckRequestHeader = ctxr.Checker[http.Header](ctxRequestHeaderKey{})
)

Functions

func Fetch

func Fetch(ctx context.Context, client *http.Client, url string) (*http.Response, error)

Fetch builds and performs a GET requests to a given URL. It uses [FetchOptions] to add the request type to the request's context and headers, if any.

func NewRemoteImage

func NewRemoteImage(ctx context.Context, client *http.Client, src string) (img.Image, error)

NewRemoteImage loads an image and returns a new img.Image instance. If the image is a GIF, it returns its first frame only.

func WithClient

func WithClient(client *http.Client) func(e *Extractor)

WithClient sets the extractor HTTP client.

func WithLogger

func WithLogger(logger *slog.Logger, level slog.Level, args ...any) func(e *Extractor)

WithLogger sets the extractor logger. This logger will copy everything to the extractor internal log and error list. Arguments are slog.With arguments and are shared between the parent logger and the log recorder.

func WithReferrer

func WithReferrer(ctx context.Context, u *url.URL) context.Context

WithReferrer sets a Referer value to the context's http.Header. The value is only "{scheme}://{host}/".

Types

type Drop

type Drop struct {
	URL          *url.URL
	Domain       string
	ContentType  string
	Charset      string
	DocumentType string

	Title         string
	Description   string
	Authors       []string
	Site          string
	Lang          string
	TextDirection string
	Date          time.Time

	Header     http.Header
	Meta       DropMeta
	Properties DropProperties
	Body       []byte `json:"-"`

	Pictures map[string]*Picture
}

Drop is the result of a content extraction of one resource.

func NewDrop

func NewDrop(src *url.URL) *Drop

NewDrop returns a Drop instance.

func (*Drop) AddAuthors

func (d *Drop) AddAuthors(values ...string)

AddAuthors add authors to the author list, ignoring potential duplicates.

func (*Drop) IsHTML

func (d *Drop) IsHTML() bool

IsHTML returns true when the resource is of type HTML.

func (*Drop) IsMedia

func (d *Drop) IsMedia() bool

IsMedia returns true when the document type is a media type.

func (*Drop) Load

func (d *Drop) Load(client *http.Client) error

Load loads the remote URL and retrieve data.

func (*Drop) SetURL

func (d *Drop) SetURL(src *url.URL)

SetURL sets the Drop's URL and Domain properties in their unicode versions.

func (*Drop) UnescapedURL

func (d *Drop) UnescapedURL() string

UnescapedURL returns the Drop's URL unescaped, for storage.

type DropMeta

type DropMeta map[string][]string

DropMeta is a map of list of strings that contains the collected metadata.

func (DropMeta) Add

func (m DropMeta) Add(name, value string)

Add adds a value to the raw metadata list.

func (DropMeta) Lookup

func (m DropMeta) Lookup(names ...string) []string

Lookup returns all the found values for the provided metadata names.

func (DropMeta) LookupGet

func (m DropMeta) LookupGet(names ...string) string

LookupGet returns the first value found for the provided metadata names.

type DropProperties

type DropProperties map[string]any

DropProperties contains the raw properties of an extracted page.

type Error

type Error []error

Error holds all the non-fatal errors that were caught during extraction.

func (Error) Error

func (e Error) Error() string

type Extractor

type Extractor struct {
	URL     *url.URL
	HTML    []byte
	Text    string
	Visited URLList
	Logs    []string
	Context context.Context
	// contains filtered or unexported fields
}

Extractor is a page extractor.

func New

func New(src string, options ...func(e *Extractor)) (*Extractor, error)

New returns an Extractor instance for a given URL, with a default HTTP client.

func (*Extractor) AddDrop

func (e *Extractor) AddDrop(src *url.URL)

AddDrop adds a new Drop to the drop list.

func (*Extractor) AddError

func (e *Extractor) AddError(err error)

AddError add a new error to the extractor's error list.

func (*Extractor) AddProcessors

func (e *Extractor) AddProcessors(p ...Processor)

AddProcessors adds extract processor(s) to the list.

func (*Extractor) Client

func (e *Extractor) Client() *http.Client

Client returns the extractor's HTTP client.

func (*Extractor) Drop

func (e *Extractor) Drop() *Drop

Drop return the extractor's first drop, when there is one.

func (*Extractor) Drops

func (e *Extractor) Drops() []*Drop

Drops returns the extractor's drop list.

func (*Extractor) Errors

func (e *Extractor) Errors() Error

Errors returns the extractor's error list.

func (*Extractor) Log

func (e *Extractor) Log() *slog.Logger

Log returns the extractor's logger.

func (*Extractor) NewProcessMessage

func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage

NewProcessMessage returns a new ProcessMessage for a given step.

func (*Extractor) ReplaceDrop

func (e *Extractor) ReplaceDrop(src *url.URL) error

ReplaceDrop replaces the main Drop with a new one.

func (*Extractor) Run

func (e *Extractor) Run()

Run start the extraction process.

type FetchType

type FetchType uint8

FetchType is the type of request the extractor and related tools can make.

const (
	// PageRequest is a page request type.
	PageRequest FetchType = iota + 1
	// ImageRequest is an image request type.
	ImageRequest
	// ResourceRequest is a resource request type.
	ResourceRequest
	// ContentScriptRequest identifies a request made from a content-script.
	ContentScriptRequest
)

type Picture

type Picture struct {
	Href string
	Type string
	Size [2]int
	// contains filtered or unexported fields
}

Picture is a remote picture.

func NewPicture

func NewPicture(src string, base *url.URL) (*Picture, error)

NewPicture returns a new Picture instance from a given URL and its base.

func (*Picture) Bytes

func (p *Picture) Bytes() []byte

Bytes returns the image data.

func (*Picture) Copy

func (p *Picture) Copy(size uint, toFormat string) (*Picture, error)

Copy returns a resized copy of the image, as a new Picture instance.

func (*Picture) Encoded

func (p *Picture) Encoded() string

Encoded returns a base64 encoded string of the image.

func (*Picture) Load

func (p *Picture) Load(ctx context.Context, client *http.Client, size uint, toFormat string) error

Load loads the image remotely and fit it into the given boundaries size.

func (*Picture) Name

func (p *Picture) Name(name string) string

Name returns the given name of the picture with the correct extension.

type ProcessList

type ProcessList []Processor

ProcessList holds the processes that will be applied.

type ProcessMessage

type ProcessMessage struct {
	Extractor *Extractor
	Dom       *html.Node
	// contains filtered or unexported fields
}

ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.

func (*ProcessMessage) Cancel

func (m *ProcessMessage) Cancel(reason string, args ...interface{})

Cancel fully cancel the extract process.

func (*ProcessMessage) Log

func (m *ProcessMessage) Log() *slog.Logger

Log returns the message's slog.Logger.

func (*ProcessMessage) Position

func (m *ProcessMessage) Position() int

Position returns the current process position.

func (*ProcessMessage) ResetContent

func (m *ProcessMessage) ResetContent()

ResetContent empty the message Dom and all the drops body.

func (*ProcessMessage) ResetPosition

func (m *ProcessMessage) ResetPosition()

ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).

func (*ProcessMessage) Step

func (m *ProcessMessage) Step() ProcessStep

Step returns the current process step.

type ProcessStep

type ProcessStep int

ProcessStep defines a type of process applied during extraction.

const (
	// StepStart happens before the connection is made.
	StepStart ProcessStep = iota + 1

	// StepBody happens after receiving the resource body.
	StepBody

	// StepDom happens after parsing the resource DOM tree.
	StepDom

	// StepFinish happens at the very end of the extraction.
	StepFinish

	// StepPostProcess happens after looping over each Drop.
	StepPostProcess

	// StepDone is always called at the very end of the extraction.
	StepDone
)

func (ProcessStep) String

func (s ProcessStep) String() string

type Processor

type Processor func(*ProcessMessage, Processor) Processor

Processor is the process function.

type ProxyMatcher

type ProxyMatcher interface {
	// Returns the matching host
	Host() string
	// Returns the proxy URL
	URL() *url.URL
}

ProxyMatcher describes a mapping of host/url for proxy dispatch.

type URLList

type URLList map[string]bool

URLList hold a list of URLs.

func (URLList) Add

func (l URLList) Add(v *url.URL)

Add adds a new URL to the list.

func (URLList) IsPresent

func (l URLList) IsPresent(v *url.URL) bool

IsPresent returns.

Directories

Path Synopsis
Package contents provide extraction processes for content processing (readability) and plain text conversion.
Package contents provide extraction processes for content processing (readability) and plain text conversion.
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
Package microdata provides a JSON-LD and HTML microdata parser and resolver.
Package microdata provides a JSON-LD and HTML microdata parser and resolver.
Package srcset is an srcset value parser.
Package srcset is an srcset value parser.
Package testing provides some tools for fixture loading as HTTP mock responses.
Package testing provides some tools for fixture loading as HTTP mock responses.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL