Documentation
¶
Overview ¶
Package extract turns a full HTML page into its article: the main-content node with the chrome removed, plus the page metadata (title, byline, site name, excerpt, language, publish date) and the outbound links.
It runs go-readability for the content node and harvests metadata from the document's own tags first, falling back to what readability recovers. The content node is sanitised with kage's CleanTree so no script or handler survives into the Markdown.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Article ¶
type Article struct {
Title string
Byline string
SiteName string
Excerpt string
Lang string
Published string
// Node is the main-content subtree, sanitised and ready for conversion. It is
// nil when readability found no article.
Node *html.Node
// Links are every outbound hyperlink in the whole document.
Links []Link
// LowConfidence is true when readability could not isolate a clear article and
// yomi fell back to a coarse selection.
LowConfidence bool
}
Article is the extracted form of one HTML page.
Click to show internal directories.
Click to hide internal directories.