readability

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2025 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package readability provides a pure Go implementation of Mozilla's Readability.js algorithm for extracting article content from web pages.

This implementation follows the same content extraction logic as the original JavaScript implementation, including scoring elements based on content quality, handling special cases, and cleaning up the final article content.

Key features: - No JavaScript dependencies (100% Go) - Compatible with Mozilla's Readability algorithm - Proper handling of important links, headings, and navigation elements - Built-in adapters for integration with the main extractor package

Package readability provides a pure Go implementation of Mozilla's Readability.js for extracting the main content from web pages.

Index

Constants

View Source
const (
	FlagStripUnlikelys     = 0x1
	FlagWeightClasses      = 0x2
	FlagCleanConditionally = 0x4
)

Flags for controlling the content extraction process

View Source
const (
	ElementNode = 1
	TextNode    = 3
	CommentNode = 8
	DoctypeNode = 10
)

Node types from the HTML package

View Source
const (
	// DefaultMaxElemsToParse is the maximum number of elements to parse (0 = no limit)
	DefaultMaxElemsToParse = 0

	// DefaultNTopCandidates is the number of top candidates to consider
	DefaultNTopCandidates = 5

	// DefaultCharThreshold is the minimum number of characters required for content
	DefaultCharThreshold = 500
)

Default settings

Variables

View Source
var (
	// Unlikely candidates for content
	RegexpUnlikelyCandidates = regexp.MustCompile(`-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote`)

	// Candidates that might be content despite matching the unlikelyCandidates pattern
	RegexpMaybeCandidate = regexp.MustCompile(`and|article|body|column|content|main|shadow`)

	// Positive indicators of content
	RegexpPositive = regexp.MustCompile(`article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story`)

	// Negative indicators of content
	RegexpNegative = regexp.MustCompile(`-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget`)

	// Extraneous content areas
	RegexpExtraneous = regexp.MustCompile(`print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility`)

	// Byline indicators
	RegexpByline = regexp.MustCompile(`byline|author|dateline|writtenby|p-author`)

	// Font elements to replace
	RegexpReplaceFonts = regexp.MustCompile(`<(/?)font[^>]*>`)

	// Normalize whitespace
	RegexpNormalize = regexp.MustCompile(`\s{2,}`)

	// Video services to preserve
	RegexpVideos = regexp.MustCompile(`//(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)`)

	// Share elements
	RegexpShareElements = regexp.MustCompile(`(\b|_)(share|sharedaddy)(\b|_)`)

	// Next page links
	RegexpNextLink = regexp.MustCompile(`(next|weiter|continue|>([^\|]|$)|»([^\|]|$))`)

	// Previous page links
	RegexpPrevLink = regexp.MustCompile(`(prev|earl|old|new|<|«)`)

	// Tokenize text
	RegexpTokenize = regexp.MustCompile(`\W+`)

	// Whitespace
	RegexpWhitespace = regexp.MustCompile(`^\s*$`)

	// Has content
	RegexpHasContent = regexp.MustCompile(`\S$`)

	// Hash URL
	RegexpHashUrl = regexp.MustCompile(`^#.+`)

	// Srcset URL
	RegexpSrcsetUrl = regexp.MustCompile(`(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))`)

	// Base64 data URL
	RegexpB64DataUrl = regexp.MustCompile(`^data:\s*([^\s;,]+)\s*;\s*base64\s*,`)

	// JSON-LD article types
	RegexpJsonLdArticleTypes = regexp.MustCompile(`^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$`)
)

Regular expressions used in the Readability algorithm

View Source
var AlterToDivExceptions = []string{"DIV", "ARTICLE", "SECTION", "P"}

AlterToDivExceptions defines elements that should not be converted to <div>

View Source
var ClassesToPreserve = []string{"page"}

ClassesToPreserve defines CSS classes that should be preserved in the output

View Source
var DefaultTagsToScore = []string{"SECTION", "H2", "H3", "H4", "H5", "H6", "P", "TD", "PRE"}

DefaultTagsToScore defines the element tags that should be scored

View Source
var DeprecatedSizeAttributeElems = []string{"TABLE", "TH", "TD", "HR", "PRE"}

DeprecatedSizeAttributeElems defines elements with deprecated size attributes

View Source
var DivToPElems = []string{"BLOCKQUOTE", "DL", "DIV", "IMG", "OL", "P", "PRE", "TABLE", "UL"}

DivToPElems defines elements that can appear inside a <div> but should be promoted to paragraphs

View Source
var HTMLEscapeMap = map[string]string{
	"lt":   "<",
	"gt":   ">",
	"amp":  "&",
	"quot": "\"",
	"apos": "'",
}

HTMLEscapeMap defines HTML entities that need to be escaped

View Source
var PhrasingElems = []string{
	"ABBR", "AUDIO", "B", "BDO", "BR", "BUTTON", "CITE", "CODE", "DATA",
	"DATALIST", "DFN", "EM", "EMBED", "I", "IMG", "INPUT", "KBD", "LABEL",
	"MARK", "MATH", "METER", "NOSCRIPT", "OBJECT", "OUTPUT", "PROGRESS", "Q",
	"RUBY", "SAMP", "SCRIPT", "SELECT", "SMALL", "SPAN", "STRONG", "SUB",
	"SUP", "TEXTAREA", "TIME", "VAR", "WBR",
}

PhrasingElems defines elements that qualify as phrasing content

View Source
var PresentationalAttributes = []string{"align", "background", "bgcolor", "border", "cellpadding", "cellspacing", "frame", "hspace", "rules", "style", "valign", "vspace"}

PresentationalAttributes defines presentational attributes to remove

View Source
var UnlikelyRoles = []string{"menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog"}

UnlikelyRoles defines ARIA roles that suggest a node is not content

Functions

func ExtractFromHTML

func ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error)

ExtractFromHTML extracts readable content from HTML using pure Go Readability This function adapts our implementation to match the expected interface

Types

type Article

type Article struct {
	Title        string    `json:"title"`
	Byline       string    `json:"byline"`
	Date         time.Time `json:"date"`
	Content      string    `json:"content"`
	PlainContent string    `json:"plain_content"`
	TextContent  string    `json:"text_content"`
	Excerpt      string    `json:"excerpt"`
	SiteName     string    `json:"site_name"`
	Length       int       `json:"length"`
}

ToArticle converts a ReadabilityArticle to a standard Article format This allows compatibility with existing code that expects the Article type

type NodeInfo

type NodeInfo struct {
	// contains filtered or unexported fields
}

NodeInfo holds information about a node

type Readability

type Readability struct {
	// contains filtered or unexported fields
}

Readability implements the Readability algorithm

func NewFromDocument

func NewFromDocument(doc *goquery.Document, opts *ReadabilityOptions) *Readability

NewFromDocument creates a new Readability parser from a goquery document

func NewFromHTML

func NewFromHTML(html string, opts *ReadabilityOptions) (*Readability, error)

NewFromHTML creates a new Readability parser from HTML string

func (*Readability) Parse

func (r *Readability) Parse() (*ReadabilityArticle, error)

Parse runs the Readability algorithm

type ReadabilityArticle

type ReadabilityArticle struct {
	Title       string    // Article title
	Byline      string    // Article byline (author)
	Content     string    // Article content (HTML)
	TextContent string    // Article text content (plain text)
	Length      int       // Length of the text content
	Excerpt     string    // Short excerpt
	SiteName    string    // Site name
	Date        time.Time // Publication date
}

ReadabilityArticle represents the extracted article

func Parse

func Parse(html string) (*ReadabilityArticle, error)

Parse extracts article content from HTML using default options

func ParseHTML

func ParseHTML(html string, opts *ReadabilityOptions) (*ReadabilityArticle, error)

ParseHTML parses HTML content using the Readability algorithm

func ParseWithOptions

func ParseWithOptions(html string, debug bool, maxElems int, charThreshold int) (*ReadabilityArticle, error)

ParseWithOptions extracts article content from HTML using custom options

func (*ReadabilityArticle) ToArticle

func (r *ReadabilityArticle) ToArticle() *Article

ToArticle converts a ReadabilityArticle to a standard Article format

type ReadabilityOptions

type ReadabilityOptions struct {
	Debug             bool           // Debug mode
	MaxElemsToParse   int            // Maximum elements to parse (0 = no limit)
	NbTopCandidates   int            // Number of top candidates to consider
	CharThreshold     int            // Minimum character threshold
	ClassesToPreserve []string       // Classes to preserve
	KeepClasses       bool           // Whether to keep classes
	DisableJSONLD     bool           // Whether to disable JSON-LD processing
	AllowedVideoRegex *regexp.Regexp // Regex for allowed videos
}

ReadabilityOptions defines configuration options for the Readability parser

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL