Documentation
¶
Overview ¶
Package extract is a content extractor for HTML pages. It works by using processors that are triggers at different (or several) steps of the extraction process.
Index ¶
- Constants
- Variables
- func Fetch(ctx context.Context, client *http.Client, url string) (*http.Response, error)
- func NewRemoteImage(ctx context.Context, client *http.Client, src string) (img.Image, error)
- func WithClient(client *http.Client) func(e *Extractor)
- func WithLogger(logger *slog.Logger, level slog.Level, args ...any) func(e *Extractor)
- func WithReferrer(ctx context.Context, u *url.URL) context.Context
- type Drop
- type DropMeta
- type DropProperties
- type Error
- type Extractor
- func (e *Extractor) AddDrop(src *url.URL)
- func (e *Extractor) AddError(err error)
- func (e *Extractor) AddProcessors(p ...Processor)
- func (e *Extractor) Client() *http.Client
- func (e *Extractor) Drop() *Drop
- func (e *Extractor) Drops() []*Drop
- func (e *Extractor) Errors() Error
- func (e *Extractor) Log() *slog.Logger
- func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
- func (e *Extractor) ReplaceDrop(src *url.URL) error
- func (e *Extractor) Run()
- type FetchType
- type Picture
- type ProcessList
- type ProcessMessage
- type ProcessStep
- type Processor
- type ProxyMatcher
- type URLList
Constants ¶
const ( // ImageSizeThumbnail is the width of a regular thumbnail image. ImageSizeThumbnail = 800 // ImageSizeWide is the width of a bigger image. // Its dimension matches 48rem (on a 16px basis) // on an HDPI screen: 48 × 16px × 2. ImageSizeWide = 1536 )
Variables ¶
var ( // WithRequestType returns a new context that contains the given [FetchType]. WithRequestType = ctxr.Setter[FetchType](ctxRequestTypeKey{}) // CheckRequestType returns the [FetchType] of a given context. CheckRequestType = ctxr.Checker[FetchType](ctxRequestTypeKey{}) // WithRequestHeader returns a new context that contains the given [http.Header]. WithRequestHeader = ctxr.Setter[http.Header](ctxRequestHeaderKey{}) // CheckRequestHeader returns the [http.Header] of a given context. CheckRequestHeader = ctxr.Checker[http.Header](ctxRequestHeaderKey{}) )
Functions ¶
func Fetch ¶
Fetch builds and performs a GET requests to a given URL. It uses [FetchOptions] to add the request type to the request's context and headers, if any.
func NewRemoteImage ¶
NewRemoteImage loads an image and returns a new img.Image instance. If the image is a GIF, it returns its first frame only.
func WithClient ¶
WithClient sets the extractor HTTP client.
func WithLogger ¶
WithLogger sets the extractor logger. This logger will copy everything to the extractor internal log and error list. Arguments are slog.With arguments and are shared between the parent logger and the log recorder.
func WithReferrer ¶
WithReferrer sets a Referer value to the context's http.Header. The value is only "{scheme}://{host}/".
Types ¶
type Drop ¶
type Drop struct {
URL *url.URL
Domain string
ContentType string
Charset string
DocumentType string
Title string
Description string
Authors []string
Site string
Lang string
TextDirection string
Date time.Time
Header http.Header
Meta DropMeta
Properties DropProperties
Body []byte `json:"-"`
Pictures map[string]*Picture
}
Drop is the result of a content extraction of one resource.
func (*Drop) AddAuthors ¶
AddAuthors add authors to the author list, ignoring potential duplicates.
func (*Drop) UnescapedURL ¶
UnescapedURL returns the Drop's URL unescaped, for storage.
type DropMeta ¶
DropMeta is a map of list of strings that contains the collected metadata.
type DropProperties ¶
DropProperties contains the raw properties of an extracted page.
type Error ¶
type Error []error
Error holds all the non-fatal errors that were caught during extraction.
type Extractor ¶
type Extractor struct {
URL *url.URL
HTML []byte
Text string
Visited URLList
Logs []string
Context context.Context
// contains filtered or unexported fields
}
Extractor is a page extractor.
func (*Extractor) AddProcessors ¶
AddProcessors adds extract processor(s) to the list.
func (*Extractor) NewProcessMessage ¶
func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
NewProcessMessage returns a new ProcessMessage for a given step.
func (*Extractor) ReplaceDrop ¶
ReplaceDrop replaces the main Drop with a new one.
type FetchType ¶
type FetchType uint8
FetchType is the type of request the extractor and related tools can make.
type Picture ¶
type Picture struct {
Href string
Type string
Size [2]int
// contains filtered or unexported fields
}
Picture is a remote picture.
func NewPicture ¶
NewPicture returns a new Picture instance from a given URL and its base.
type ProcessList ¶
type ProcessList []Processor
ProcessList holds the processes that will be applied.
type ProcessMessage ¶
type ProcessMessage struct {
Extractor *Extractor
Dom *html.Node
// contains filtered or unexported fields
}
ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.
func (*ProcessMessage) Cancel ¶
func (m *ProcessMessage) Cancel(reason string, args ...interface{})
Cancel fully cancel the extract process.
func (*ProcessMessage) Log ¶
func (m *ProcessMessage) Log() *slog.Logger
Log returns the message's slog.Logger.
func (*ProcessMessage) Position ¶
func (m *ProcessMessage) Position() int
Position returns the current process position.
func (*ProcessMessage) ResetContent ¶
func (m *ProcessMessage) ResetContent()
ResetContent empty the message Dom and all the drops body.
func (*ProcessMessage) ResetPosition ¶
func (m *ProcessMessage) ResetPosition()
ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).
func (*ProcessMessage) Step ¶
func (m *ProcessMessage) Step() ProcessStep
Step returns the current process step.
type ProcessStep ¶
type ProcessStep int
ProcessStep defines a type of process applied during extraction.
const ( // StepStart happens before the connection is made. StepStart ProcessStep = iota + 1 // StepBody happens after receiving the resource body. StepBody // StepDom happens after parsing the resource DOM tree. StepDom // StepFinish happens at the very end of the extraction. StepFinish // StepPostProcess happens after looping over each Drop. StepPostProcess // StepDone is always called at the very end of the extraction. StepDone )
func (ProcessStep) String ¶
func (s ProcessStep) String() string
type Processor ¶
type Processor func(*ProcessMessage, Processor) Processor
Processor is the process function.
type ProxyMatcher ¶
type ProxyMatcher interface {
// Returns the matching host
Host() string
// Returns the proxy URL
URL() *url.URL
}
ProxyMatcher describes a mapping of host/url for proxy dispatch.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package contents provide extraction processes for content processing (readability) and plain text conversion.
|
Package contents provide extraction processes for content processing (readability) and plain text conversion. |
|
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
|
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process. |
|
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
|
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...). |
|
Package microdata provides a JSON-LD and HTML microdata parser and resolver.
|
Package microdata provides a JSON-LD and HTML microdata parser and resolver. |
|
Package srcset is an srcset value parser.
|
Package srcset is an srcset value parser. |
|
Package testing provides some tools for fixture loading as HTTP mock responses.
|
Package testing provides some tools for fixture loading as HTTP mock responses. |