extract

package
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 17, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package extract turns a full HTML page into its article: the main-content node with the chrome removed, plus the page metadata (title, byline, site name, excerpt, language, publish date) and the outbound links.

It runs go-readability for the content node and harvests metadata from the document's own tags first, falling back to what readability recovers. The content node is sanitised with kage's CleanTree so no script or handler survives into the Markdown.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Article

type Article struct {
	Title     string
	Byline    string
	SiteName  string
	Excerpt   string
	Lang      string
	Published string
	// Node is the main-content subtree, sanitised and ready for conversion. It is
	// nil when readability found no article.
	Node *html.Node
	// Links are every outbound hyperlink in the whole document.
	Links []Link
	// LowConfidence is true when readability could not isolate a clear article and
	// yomi fell back to a coarse selection.
	LowConfidence bool
}

Article is the extracted form of one HTML page.

func FromHTML

func FromHTML(body []byte, pageURL string) (*Article, error)

FromHTML parses an HTML body and extracts its Article. pageURL is the absolute URL of the page, used to resolve relative links and to guide readability.

type Link struct {
	Text string
	URL  string
}

Link is one outbound hyperlink discovered on a page, resolved to an absolute URL.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL