extract

package

v0.0.0-...-60192f8 Latest Latest Go to latest Published: Apr 26, 2024 License: AGPL-3.0 Imports: 31 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

codeberg.org/readeck/readeck

Links

Open Source Insights

Documentation ¶

Overview ¶

Package extract is a content extractor for HTML pages. It works by using processors that are triggers at different (or several) steps of the extraction process.

Index ¶

func NewClient() *http.Client
func NewRemoteImage(src string, client *http.Client) (img.Image, error)
func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)
func SetHeader(client *http.Client, name, value string)
func SetLogFields(f *log.Fields) func(e *Extractor)
func SetProxyList(list []ProxyMatcher) func(e *Extractor)
type Drop
- func NewDrop(src *url.URL) *Drop
- func (d *Drop) AddAuthors(values ...string)
- func (d *Drop) IsHTML() bool
- func (d *Drop) IsMedia() bool
- func (d *Drop) Load(client *http.Client) error
- func (d *Drop) SetURL(src *url.URL)
- func (d *Drop) UnescapedURL() string
type DropMeta
- func (m DropMeta) Add(name, value string)
- func (m DropMeta) Lookup(names ...string) []string
- func (m DropMeta) LookupGet(names ...string) string
type DropProperties
type Error
- func (e Error) Error() string
type Extractor
- func New(src string, options ...func(e *Extractor)) (*Extractor, error)
- func (e *Extractor) AddDrop(src *url.URL)
- func (e *Extractor) AddError(err error)
- func (e *Extractor) AddProcessors(p ...Processor)
- func (e *Extractor) AddToCache(url string, headers map[string]string, body []byte)
- func (e *Extractor) Client() *http.Client
- func (e *Extractor) Drop() *Drop
- func (e *Extractor) Drops() []*Drop
- func (e *Extractor) Errors() Error
- func (e *Extractor) GetLogger() *log.Logger
- func (e *Extractor) IsInCache(url string) bool
- func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
- func (e *Extractor) ReplaceDrop(src *url.URL) error
- func (e *Extractor) Run()
type Picture
- func NewPicture(src string, base *url.URL) (*Picture, error)
- func (p *Picture) Bytes() []byte
- func (p *Picture) Copy(size uint, toFormat string) (*Picture, error)
- func (p *Picture) Encoded() string
- func (p *Picture) Load(client *http.Client, size uint, toFormat string) error
- func (p *Picture) Name(name string) string
type ProcessList
type ProcessMessage
- func (m *ProcessMessage) Cancel(reason string, args ...interface{})
- func (m *ProcessMessage) Position() int
- func (m *ProcessMessage) ResetContent()
- func (m *ProcessMessage) ResetPosition()
- func (m *ProcessMessage) Step() ProcessStep
type ProcessStep
type Processor
type ProxyMatcher
type Transport
- func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)
- func (t *Transport) SetRoundTripper(f transportCache)
type URLList
- func (l URLList) Add(v *url.URL)
- func (l URLList) IsPresent(v *url.URL) bool

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewClient ¶

func NewClient() *http.Client

NewClient returns a new http.Client with our custom transport.

func NewRemoteImage ¶

func NewRemoteImage(src string, client *http.Client) (img.Image, error)

NewRemoteImage loads an image and returns a new img.Image instance.

func SetDeniedIPs ¶

func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)

SetDeniedIPs sets a list of ip or cird that cannot be reached by the extraction client.

func SetHeader ¶

func SetHeader(client *http.Client, name, value string)

SetHeader sets a header on a given client.

func SetLogFields ¶

func SetLogFields(f *log.Fields) func(e *Extractor)

SetLogFields sets the default log fields for the extractor.

func SetProxyList ¶

func SetProxyList(list []ProxyMatcher) func(e *Extractor)

SetProxyList adds a new proxy dispatcher function to the HTTP transport.

Types ¶

type Drop ¶

type Drop struct {
	URL          *url.URL
	Domain       string
	ContentType  string
	Charset      string
	DocumentType string

	Title         string
	Description   string
	Authors       []string
	Site          string
	Lang          string
	TextDirection string
	Date          time.Time

	Header     http.Header
	Meta       DropMeta
	Properties DropProperties
	Body       []byte `json:"-"`

	Pictures map[string]*Picture
}

Drop is the result of a content extraction of one resource.

func NewDrop ¶

func NewDrop(src *url.URL) *Drop

NewDrop returns a Drop instance.

func (*Drop) AddAuthors ¶

func (d *Drop) AddAuthors(values ...string)

AddAuthors add authors to the author list, ignoring potential duplicates.

func (*Drop) IsHTML ¶

func (d *Drop) IsHTML() bool

IsHTML returns true when the resource is of type HTML.

func (*Drop) IsMedia ¶

func (d *Drop) IsMedia() bool

IsMedia returns true when the document type is a media type.

func (*Drop) Load ¶

func (d *Drop) Load(client *http.Client) error

Load loads the remote URL and retrieve data.

func (*Drop) SetURL ¶

func (d *Drop) SetURL(src *url.URL)

SetURL sets the Drop's URL and Domain properties in their unicode versions.

func (*Drop) UnescapedURL ¶

func (d *Drop) UnescapedURL() string

UnescapedURL returns the Drop's URL unescaped, for storage.

type DropMeta ¶

type DropMeta map[string][]string

DropMeta is a map of list of strings that contains the collected metadata.

func (DropMeta) Add ¶

func (m DropMeta) Add(name, value string)

Add adds a value to the raw metadata list.

func (DropMeta) Lookup ¶

func (m DropMeta) Lookup(names ...string) []string

Lookup returns all the found values for the provided metadata names.

func (DropMeta) LookupGet ¶

func (m DropMeta) LookupGet(names ...string) string

LookupGet returns the first value found for the provided metadata names.

type DropProperties ¶

type DropProperties map[string]any

DropProperties contains the raw properties of an extracted page.

type Error ¶

type Error []error

Error holds all the non-fatal errors that were caught during extraction.

func (Error) Error ¶

func (e Error) Error() string

type Extractor ¶

type Extractor struct {
	URL       *url.URL
	HTML      []byte
	Text      string
	Visited   URLList
	Logs      []string
	Context   context.Context
	LogFields *log.Fields
	// contains filtered or unexported fields
}

Extractor is a page extractor.

func New ¶

func New(src string, options ...func(e *Extractor)) (*Extractor, error)

New returns an Extractor instance for a given URL, with a default HTTP client.

func (*Extractor) AddDrop ¶

func (e *Extractor) AddDrop(src *url.URL)

AddDrop adds a new Drop to the drop list.

func (*Extractor) AddError ¶

func (e *Extractor) AddError(err error)

AddError add a new error to the extractor's error list.

func (*Extractor) AddProcessors ¶

func (e *Extractor) AddProcessors(p ...Processor)

AddProcessors adds extract processor(s) to the list.

func (*Extractor) AddToCache ¶

func (e *Extractor) AddToCache(url string, headers map[string]string, body []byte)

AddToCache adds a resource to the extractor's resource cache. The cache will be used by the HTTP client during its round trip.

func (*Extractor) Client ¶

func (e *Extractor) Client() *http.Client

Client returns the extractor's HTTP client.

func (*Extractor) Drop ¶

func (e *Extractor) Drop() *Drop

Drop return the extractor's first drop, when there is one.

func (*Extractor) Drops ¶

func (e *Extractor) Drops() []*Drop

Drops returns the extractor's drop list.

func (*Extractor) Errors ¶

func (e *Extractor) Errors() Error

Errors returns the extractor's error list.

func (*Extractor) GetLogger ¶

func (e *Extractor) GetLogger() *log.Logger

GetLogger returns a logger for the extractor. This standard logger will copy everything to the extractor Log slice.

func (*Extractor) IsInCache ¶

func (e *Extractor) IsInCache(url string) bool

IsInCache returns true if a given URL is present in the resource cache mapping.

func (*Extractor) NewProcessMessage ¶

func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage

NewProcessMessage returns a new ProcessMessage for a given step.

func (*Extractor) ReplaceDrop ¶

func (e *Extractor) ReplaceDrop(src *url.URL) error

ReplaceDrop replaces the main Drop with a new one.

func (*Extractor) Run ¶

func (e *Extractor) Run()

Run start the extraction process.

type Picture ¶

type Picture struct {
	Href string
	Type string
	Size [2]int
	// contains filtered or unexported fields
}

Picture is a remote picture.

func NewPicture ¶

func NewPicture(src string, base *url.URL) (*Picture, error)

NewPicture returns a new Picture instance from a given URL and its base.

func (*Picture) Bytes ¶

func (p *Picture) Bytes() []byte

Bytes returns the image data.

func (*Picture) Copy ¶

func (p *Picture) Copy(size uint, toFormat string) (*Picture, error)

Copy returns a resized copy of the image, as a new Picture instance.

func (*Picture) Encoded ¶

func (p *Picture) Encoded() string

Encoded returns a base64 encoded string of the image.

func (*Picture) Load ¶

func (p *Picture) Load(client *http.Client, size uint, toFormat string) error

Load loads the image remotely and fit it into the given boundaries size.

func (*Picture) Name ¶

func (p *Picture) Name(name string) string

Name returns the given name of the picture with the correct extension.

type ProcessList ¶

type ProcessList []Processor

ProcessList holds the processes that will be applied.

type ProcessMessage ¶

type ProcessMessage struct {
	Extractor *Extractor
	Log       *log.Entry
	Dom       *html.Node
	// contains filtered or unexported fields
}

ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.

func (*ProcessMessage) Cancel ¶

func (m *ProcessMessage) Cancel(reason string, args ...interface{})

Cancel fully cancel the extract process.

func (*ProcessMessage) Position ¶

func (m *ProcessMessage) Position() int

Position returns the current process position.

func (*ProcessMessage) ResetContent ¶

func (m *ProcessMessage) ResetContent()

ResetContent empty the message Dom and all the drops body.

func (*ProcessMessage) ResetPosition ¶

func (m *ProcessMessage) ResetPosition()

ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).

func (*ProcessMessage) Step ¶

func (m *ProcessMessage) Step() ProcessStep

Step returns the current process step.

type ProcessStep ¶

type ProcessStep int

ProcessStep defines a type of process applied during extraction.

const (
	// StepStart happens before the connection is made.
	StepStart ProcessStep = iota + 1

	// StepBody happens after receiving the resource body.
	StepBody

	// StepDom happens after parsing the resource DOM tree.
	StepDom

	// StepFinish happens at the very end of the extraction.
	StepFinish

	// StepPostProcess happens after looping over each Drop.
	StepPostProcess

	// StepDone is always called at the very end of the extraction.
	StepDone
)

type Processor ¶

type Processor func(*ProcessMessage, Processor) Processor

Processor is the process function.

type ProxyMatcher ¶

type ProxyMatcher interface {
	// Returns the matching host
	Host() string
	// Returns the proxy URL
	URL() *url.URL
}

ProxyMatcher describes a mapping of host/url for proxy dispatch.

type Transport ¶

type Transport struct {
	// contains filtered or unexported fields
}

Transport is a wrapper around http.RoundTripper that lets you set default headers sent with every request.

func (*Transport) RoundTrip ¶

func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip is the transport interceptor.

func (*Transport) SetRoundTripper ¶

func (t *Transport) SetRoundTripper(f transportCache)

SetRoundTripper sets an extra transport's round trip function.

type URLList ¶

type URLList map[string]bool

URLList hold a list of URLs.

func (URLList) Add ¶

func (l URLList) Add(v *url.URL)

Add adds a new URL to the list.

func (URLList) IsPresent ¶

func (l URLList) IsPresent(v *url.URL) bool

IsPresent returns.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
contents Package contents provide extraction processes for content processing (readability) and plain text conversion.	Package contents provide extraction processes for content processing (readability) and plain text conversion.
contentscripts Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.	Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
meta Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).	Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
srcset Package srcset is an srcset value parser.	Package srcset is an srcset value parser.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL