extract

package
v0.0.0-...-60192f8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2024 License: AGPL-3.0 Imports: 31 Imported by: 0

Documentation

Overview

Package extract is a content extractor for HTML pages. It works by using processors that are triggers at different (or several) steps of the extraction process.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewClient

func NewClient() *http.Client

NewClient returns a new http.Client with our custom transport.

func NewRemoteImage

func NewRemoteImage(src string, client *http.Client) (img.Image, error)

NewRemoteImage loads an image and returns a new img.Image instance.

func SetDeniedIPs

func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)

SetDeniedIPs sets a list of ip or cird that cannot be reached by the extraction client.

func SetHeader

func SetHeader(client *http.Client, name, value string)

SetHeader sets a header on a given client.

func SetLogFields

func SetLogFields(f *log.Fields) func(e *Extractor)

SetLogFields sets the default log fields for the extractor.

func SetProxyList

func SetProxyList(list []ProxyMatcher) func(e *Extractor)

SetProxyList adds a new proxy dispatcher function to the HTTP transport.

Types

type Drop

type Drop struct {
	URL          *url.URL
	Domain       string
	ContentType  string
	Charset      string
	DocumentType string

	Title         string
	Description   string
	Authors       []string
	Site          string
	Lang          string
	TextDirection string
	Date          time.Time

	Header     http.Header
	Meta       DropMeta
	Properties DropProperties
	Body       []byte `json:"-"`

	Pictures map[string]*Picture
}

Drop is the result of a content extraction of one resource.

func NewDrop

func NewDrop(src *url.URL) *Drop

NewDrop returns a Drop instance.

func (*Drop) AddAuthors

func (d *Drop) AddAuthors(values ...string)

AddAuthors add authors to the author list, ignoring potential duplicates.

func (*Drop) IsHTML

func (d *Drop) IsHTML() bool

IsHTML returns true when the resource is of type HTML.

func (*Drop) IsMedia

func (d *Drop) IsMedia() bool

IsMedia returns true when the document type is a media type.

func (*Drop) Load

func (d *Drop) Load(client *http.Client) error

Load loads the remote URL and retrieve data.

func (*Drop) SetURL

func (d *Drop) SetURL(src *url.URL)

SetURL sets the Drop's URL and Domain properties in their unicode versions.

func (*Drop) UnescapedURL

func (d *Drop) UnescapedURL() string

UnescapedURL returns the Drop's URL unescaped, for storage.

type DropMeta

type DropMeta map[string][]string

DropMeta is a map of list of strings that contains the collected metadata.

func (DropMeta) Add

func (m DropMeta) Add(name, value string)

Add adds a value to the raw metadata list.

func (DropMeta) Lookup

func (m DropMeta) Lookup(names ...string) []string

Lookup returns all the found values for the provided metadata names.

func (DropMeta) LookupGet

func (m DropMeta) LookupGet(names ...string) string

LookupGet returns the first value found for the provided metadata names.

type DropProperties

type DropProperties map[string]any

DropProperties contains the raw properties of an extracted page.

type Error

type Error []error

Error holds all the non-fatal errors that were caught during extraction.

func (Error) Error

func (e Error) Error() string

type Extractor

type Extractor struct {
	URL       *url.URL
	HTML      []byte
	Text      string
	Visited   URLList
	Logs      []string
	Context   context.Context
	LogFields *log.Fields
	// contains filtered or unexported fields
}

Extractor is a page extractor.

func New

func New(src string, options ...func(e *Extractor)) (*Extractor, error)

New returns an Extractor instance for a given URL, with a default HTTP client.

func (*Extractor) AddDrop

func (e *Extractor) AddDrop(src *url.URL)

AddDrop adds a new Drop to the drop list.

func (*Extractor) AddError

func (e *Extractor) AddError(err error)

AddError add a new error to the extractor's error list.

func (*Extractor) AddProcessors

func (e *Extractor) AddProcessors(p ...Processor)

AddProcessors adds extract processor(s) to the list.

func (*Extractor) AddToCache

func (e *Extractor) AddToCache(url string, headers map[string]string, body []byte)

AddToCache adds a resource to the extractor's resource cache. The cache will be used by the HTTP client during its round trip.

func (*Extractor) Client

func (e *Extractor) Client() *http.Client

Client returns the extractor's HTTP client.

func (*Extractor) Drop

func (e *Extractor) Drop() *Drop

Drop return the extractor's first drop, when there is one.

func (*Extractor) Drops

func (e *Extractor) Drops() []*Drop

Drops returns the extractor's drop list.

func (*Extractor) Errors

func (e *Extractor) Errors() Error

Errors returns the extractor's error list.

func (*Extractor) GetLogger

func (e *Extractor) GetLogger() *log.Logger

GetLogger returns a logger for the extractor. This standard logger will copy everything to the extractor Log slice.

func (*Extractor) IsInCache

func (e *Extractor) IsInCache(url string) bool

IsInCache returns true if a given URL is present in the resource cache mapping.

func (*Extractor) NewProcessMessage

func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage

NewProcessMessage returns a new ProcessMessage for a given step.

func (*Extractor) ReplaceDrop

func (e *Extractor) ReplaceDrop(src *url.URL) error

ReplaceDrop replaces the main Drop with a new one.

func (*Extractor) Run

func (e *Extractor) Run()

Run start the extraction process.

type Picture

type Picture struct {
	Href string
	Type string
	Size [2]int
	// contains filtered or unexported fields
}

Picture is a remote picture.

func NewPicture

func NewPicture(src string, base *url.URL) (*Picture, error)

NewPicture returns a new Picture instance from a given URL and its base.

func (*Picture) Bytes

func (p *Picture) Bytes() []byte

Bytes returns the image data.

func (*Picture) Copy

func (p *Picture) Copy(size uint, toFormat string) (*Picture, error)

Copy returns a resized copy of the image, as a new Picture instance.

func (*Picture) Encoded

func (p *Picture) Encoded() string

Encoded returns a base64 encoded string of the image.

func (*Picture) Load

func (p *Picture) Load(client *http.Client, size uint, toFormat string) error

Load loads the image remotely and fit it into the given boundaries size.

func (*Picture) Name

func (p *Picture) Name(name string) string

Name returns the given name of the picture with the correct extension.

type ProcessList

type ProcessList []Processor

ProcessList holds the processes that will be applied.

type ProcessMessage

type ProcessMessage struct {
	Extractor *Extractor
	Log       *log.Entry
	Dom       *html.Node
	// contains filtered or unexported fields
}

ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.

func (*ProcessMessage) Cancel

func (m *ProcessMessage) Cancel(reason string, args ...interface{})

Cancel fully cancel the extract process.

func (*ProcessMessage) Position

func (m *ProcessMessage) Position() int

Position returns the current process position.

func (*ProcessMessage) ResetContent

func (m *ProcessMessage) ResetContent()

ResetContent empty the message Dom and all the drops body.

func (*ProcessMessage) ResetPosition

func (m *ProcessMessage) ResetPosition()

ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).

func (*ProcessMessage) Step

func (m *ProcessMessage) Step() ProcessStep

Step returns the current process step.

type ProcessStep

type ProcessStep int

ProcessStep defines a type of process applied during extraction.

const (
	// StepStart happens before the connection is made.
	StepStart ProcessStep = iota + 1

	// StepBody happens after receiving the resource body.
	StepBody

	// StepDom happens after parsing the resource DOM tree.
	StepDom

	// StepFinish happens at the very end of the extraction.
	StepFinish

	// StepPostProcess happens after looping over each Drop.
	StepPostProcess

	// StepDone is always called at the very end of the extraction.
	StepDone
)

type Processor

type Processor func(*ProcessMessage, Processor) Processor

Processor is the process function.

type ProxyMatcher

type ProxyMatcher interface {
	// Returns the matching host
	Host() string
	// Returns the proxy URL
	URL() *url.URL
}

ProxyMatcher describes a mapping of host/url for proxy dispatch.

type Transport

type Transport struct {
	// contains filtered or unexported fields
}

Transport is a wrapper around http.RoundTripper that lets you set default headers sent with every request.

func (*Transport) RoundTrip

func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error)

RoundTrip is the transport interceptor.

func (*Transport) SetRoundTripper

func (t *Transport) SetRoundTripper(f transportCache)

SetRoundTripper sets an extra transport's round trip function.

type URLList

type URLList map[string]bool

URLList hold a list of URLs.

func (URLList) Add

func (l URLList) Add(v *url.URL)

Add adds a new URL to the list.

func (URLList) IsPresent

func (l URLList) IsPresent(v *url.URL) bool

IsPresent returns.

Directories

Path Synopsis
Package contents provide extraction processes for content processing (readability) and plain text conversion.
Package contents provide extraction processes for content processing (readability) and plain text conversion.
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
Package srcset is an srcset value parser.
Package srcset is an srcset value parser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL