Documentation ¶
Overview ¶
Package extract is a content extractor for HTML pages. It works by using processors that are triggers at different (or several) steps of the extraction process.
Index ¶
- func NewClient() *http.Client
- func NewRemoteImage(src string, client *http.Client) (img.Image, error)
- func SetDeniedIPs(netList []*net.IPNet) func(e *Extractor)
- func SetHeader(client *http.Client, name, value string)
- func SetLogFields(f *log.Fields) func(e *Extractor)
- func SetProxyList(list []ProxyMatcher) func(e *Extractor)
- type Drop
- type DropMeta
- type DropProperties
- type Error
- type Extractor
- func (e *Extractor) AddDrop(src *url.URL)
- func (e *Extractor) AddError(err error)
- func (e *Extractor) AddProcessors(p ...Processor)
- func (e *Extractor) AddToCache(url string, headers map[string]string, body []byte)
- func (e *Extractor) Client() *http.Client
- func (e *Extractor) Drop() *Drop
- func (e *Extractor) Drops() []*Drop
- func (e *Extractor) Errors() Error
- func (e *Extractor) GetLogger() *log.Logger
- func (e *Extractor) IsInCache(url string) bool
- func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
- func (e *Extractor) ReplaceDrop(src *url.URL) error
- func (e *Extractor) Run()
- type Picture
- type ProcessList
- type ProcessMessage
- type ProcessStep
- type Processor
- type ProxyMatcher
- type Transport
- type URLList
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func NewRemoteImage ¶
NewRemoteImage loads an image and returns a new img.Image instance.
func SetDeniedIPs ¶
SetDeniedIPs sets a list of ip or cird that cannot be reached by the extraction client.
func SetLogFields ¶
SetLogFields sets the default log fields for the extractor.
func SetProxyList ¶
func SetProxyList(list []ProxyMatcher) func(e *Extractor)
SetProxyList adds a new proxy dispatcher function to the HTTP transport.
Types ¶
type Drop ¶
type Drop struct { URL *url.URL Domain string ContentType string Charset string DocumentType string Title string Description string Authors []string Site string Lang string TextDirection string Date time.Time Header http.Header Meta DropMeta Properties DropProperties Body []byte `json:"-"` Pictures map[string]*Picture }
Drop is the result of a content extraction of one resource.
func (*Drop) AddAuthors ¶
AddAuthors add authors to the author list, ignoring potential duplicates.
func (*Drop) UnescapedURL ¶
UnescapedURL returns the Drop's URL unescaped, for storage.
type DropMeta ¶
DropMeta is a map of list of strings that contains the collected metadata.
type DropProperties ¶
DropProperties contains the raw properties of an extracted page.
type Error ¶
type Error []error
Error holds all the non-fatal errors that were caught during extraction.
type Extractor ¶
type Extractor struct { URL *url.URL HTML []byte Text string Visited URLList Logs []string Context context.Context LogFields *log.Fields // contains filtered or unexported fields }
Extractor is a page extractor.
func (*Extractor) AddProcessors ¶
AddProcessors adds extract processor(s) to the list.
func (*Extractor) AddToCache ¶
AddToCache adds a resource to the extractor's resource cache. The cache will be used by the HTTP client during its round trip.
func (*Extractor) GetLogger ¶
GetLogger returns a logger for the extractor. This standard logger will copy everything to the extractor Log slice.
func (*Extractor) IsInCache ¶
IsInCache returns true if a given URL is present in the resource cache mapping.
func (*Extractor) NewProcessMessage ¶
func (e *Extractor) NewProcessMessage(step ProcessStep) *ProcessMessage
NewProcessMessage returns a new ProcessMessage for a given step.
func (*Extractor) ReplaceDrop ¶
ReplaceDrop replaces the main Drop with a new one.
type Picture ¶
type Picture struct { Href string Type string Size [2]int // contains filtered or unexported fields }
Picture is a remote picture.
func NewPicture ¶
NewPicture returns a new Picture instance from a given URL and its base.
type ProcessList ¶
type ProcessList []Processor
ProcessList holds the processes that will be applied.
type ProcessMessage ¶
type ProcessMessage struct { Extractor *Extractor Log *log.Entry Dom *html.Node // contains filtered or unexported fields }
ProcessMessage holds the process message that is passed (and changed) by the subsequent processes.
func (*ProcessMessage) Cancel ¶
func (m *ProcessMessage) Cancel(reason string, args ...interface{})
Cancel fully cancel the extract process.
func (*ProcessMessage) Position ¶
func (m *ProcessMessage) Position() int
Position returns the current process position.
func (*ProcessMessage) ResetContent ¶
func (m *ProcessMessage) ResetContent()
ResetContent empty the message Dom and all the drops body.
func (*ProcessMessage) ResetPosition ¶
func (m *ProcessMessage) ResetPosition()
ResetPosition lets the process start over (normally with a new URL). It holds a counter and cancels everything after too many resets (defined by maxReset).
func (*ProcessMessage) Step ¶
func (m *ProcessMessage) Step() ProcessStep
Step returns the current process step.
type ProcessStep ¶
type ProcessStep int
ProcessStep defines a type of process applied during extraction.
const ( // StepStart happens before the connection is made. StepStart ProcessStep = iota + 1 // StepBody happens after receiving the resource body. StepBody // StepDom happens after parsing the resource DOM tree. StepDom // StepFinish happens at the very end of the extraction. StepFinish // StepPostProcess happens after looping over each Drop. StepPostProcess // StepDone is always called at the very end of the extraction. StepDone )
type Processor ¶
type Processor func(*ProcessMessage, Processor) Processor
Processor is the process function.
type ProxyMatcher ¶
type ProxyMatcher interface { // Returns the matching host Host() string // Returns the proxy URL URL() *url.URL }
ProxyMatcher describes a mapping of host/url for proxy dispatch.
type Transport ¶
type Transport struct {
// contains filtered or unexported fields
}
Transport is a wrapper around http.RoundTripper that lets you set default headers sent with every request.
func (*Transport) SetRoundTripper ¶
func (t *Transport) SetRoundTripper(f transportCache)
SetRoundTripper sets an extra transport's round trip function.
Directories ¶
Path | Synopsis |
---|---|
Package contents provide extraction processes for content processing (readability) and plain text conversion.
|
Package contents provide extraction processes for content processing (readability) and plain text conversion. |
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process.
|
Package contentscripts provides a JavaScript engine that runs builtin, or user defined, scripts during the extraction process. |
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...).
|
Package meta provides extract processors to retrieve several meta information from a page (meta tags, favicon, pictures...). |
Package srcset is an srcset value parser.
|
Package srcset is an srcset value parser. |