arts

package
v0.0.0-...-e72d39b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 14, 2018 License: AGPL-3.0 Imports: 24 Imported by: 4

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Debug = struct {
	// HeadlineLogger is where debug output from the headline extraction will be sent
	HeadlineLogger *log.Logger
	// AuthorsLogger is where debug output from the author extraction will be sent
	AuthorsLogger *log.Logger
	// ContentLogger is where debug output from the content extraction will be sent
	ContentLogger *log.Logger
	// DatesLogger is where debug output from the pubdate/lastupdated extraction will be sent
	DatesLogger *log.Logger

	// URLLogger is where debug output from URL extraction will be sent (rel-canonical etc)
	URLLogger *log.Logger

	// CruftLogger is where debug output from cruft classification will be sent (adverts/social/sidebars etc)
	CruftLogger *log.Logger
}{
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
	nullLogger,
}

Debug is the global debug control for the scraper. Set up any loggers you want before calling Extract() By default all logging is suppressed.

Functions

func ContainedCandidates

func ContainedCandidates(container *html.Node, candidates candidateList) candidateList

get any candidates within container (including itself)

func ParseHTML

func ParseHTML(rawHTML []byte) (*html.Node, error)

Types

type Article

type Article struct {
	CanonicalURL string `json:"canonical_url,omitempty"`
	// all known URLs for article (including canonical)
	// TODO: first url should be considered "preferred" if no canonical?
	URLs     []string `json:"urls,omitempty"`
	Headline string   `json:"headline,omitempty"`
	Authors  []Author `json:"authors,omitempty"`
	Content  string   `json:"content,omitempty"`
	// Published contains date of publication.
	// An ISO8601 string is used instead of time.Time, so that
	// less-precise representations can be held (eg YYYY-MM)
	Published   string      `json:"published,omitempty"`
	Updated     string      `json:"updated,omitempty"`
	Publication Publication `json:"publication,omitempty"`
	Keywords    []Keyword   `json:"keywords,omitempty"`
	Section     string      `json:"section,omitempty"`
}

func Extract

func Extract(client *http.Client, srcURL string) (*Article, error)

delete this and leave it up to user?

func ExtractFromHTML

func ExtractFromHTML(rawHTML []byte, artURL string) (*Article, error)

func ExtractFromTree

func ExtractFromTree(root *html.Node, artURL string) (*Article, error)

func (*Article) BestURL

func (art *Article) BestURL() string

type Author

type Author struct {
	Name    string `json:"name"`
	RelLink string `json:"rellink,omitempty"`
	Email   string `json:"email,omitempty"`
	Twitter string `json:"twitter,omitempty"`
}

type Keyword

type Keyword struct {
	Name string `json:"name"`
	URL  string `json:"url,omitempty"`
}

type Publication

type Publication struct {
	Name   string `json:"name,omitempty"`
	Domain string `json:"domain,omitempty"`
}

type Reverse

type Reverse struct {
	sort.Interface
}

wrapper for reversing any sortable

func (Reverse) Less

func (r Reverse) Less(i, j int) bool

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL