Documentation
¶
Overview ¶
Package readability provides a pure Go implementation of Mozilla's Readability.js algorithm for extracting article content from web pages.
This implementation follows the same content extraction logic as the original JavaScript implementation, including scoring elements based on content quality, handling special cases, and cleaning up the final article content.
Key features: - No JavaScript dependencies (100% Go) - Compatible with Mozilla's Readability algorithm - Proper handling of important links, headings, and navigation elements - Built-in adapters for integration with the main extractor package
Package readability provides a pure Go implementation of Mozilla's Readability.js for extracting the main content from web pages.
Index ¶
Constants ¶
const ( FlagStripUnlikelys = 0x1 FlagWeightClasses = 0x2 FlagCleanConditionally = 0x4 )
Flags for controlling the content extraction process
const ( ElementNode = 1 TextNode = 3 CommentNode = 8 DoctypeNode = 10 )
Node types from the HTML package
const ( // DefaultMaxElemsToParse is the maximum number of elements to parse (0 = no limit) DefaultMaxElemsToParse = 0 // DefaultNTopCandidates is the number of top candidates to consider DefaultNTopCandidates = 5 // DefaultCharThreshold is the minimum number of characters required for content DefaultCharThreshold = 500 )
Default settings
Variables ¶
var ( // Unlikely candidates for content RegexpUnlikelyCandidates = regexp.MustCompile(`-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote`) // Candidates that might be content despite matching the unlikelyCandidates pattern RegexpMaybeCandidate = regexp.MustCompile(`and|article|body|column|content|main|shadow`) // Positive indicators of content RegexpPositive = regexp.MustCompile(`article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story`) // Negative indicators of content RegexpNegative = regexp.MustCompile(`-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget`) // Extraneous content areas RegexpExtraneous = regexp.MustCompile(`print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility`) // Byline indicators RegexpByline = regexp.MustCompile(`byline|author|dateline|writtenby|p-author`) // Font elements to replace RegexpReplaceFonts = regexp.MustCompile(`<(/?)font[^>]*>`) // Normalize whitespace RegexpNormalize = regexp.MustCompile(`\s{2,}`) // Video services to preserve RegexpVideos = regexp.MustCompile(`//(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)`) RegexpShareElements = regexp.MustCompile(`(\b|_)(share|sharedaddy)(\b|_)`) // Next page links RegexpNextLink = regexp.MustCompile(`(next|weiter|continue|>([^\|]|$)|»([^\|]|$))`) // Previous page links RegexpPrevLink = regexp.MustCompile(`(prev|earl|old|new|<|«)`) // Tokenize text RegexpTokenize = regexp.MustCompile(`\W+`) // Whitespace RegexpWhitespace = regexp.MustCompile(`^\s*$`) // Has content RegexpHasContent = regexp.MustCompile(`\S$`) // Hash URL RegexpHashUrl = regexp.MustCompile(`^#.+`) // Srcset URL RegexpSrcsetUrl = regexp.MustCompile(`(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))`) // Base64 data URL RegexpB64DataUrl = regexp.MustCompile(`^data:\s*([^\s;,]+)\s*;\s*base64\s*,`) // JSON-LD article types RegexpJsonLdArticleTypes = regexp.MustCompile(`^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$`) )
Regular expressions used in the Readability algorithm
var AlterToDivExceptions = []string{"DIV", "ARTICLE", "SECTION", "P"}
AlterToDivExceptions defines elements that should not be converted to <div>
var ClassesToPreserve = []string{"page"}
ClassesToPreserve defines CSS classes that should be preserved in the output
var DefaultTagsToScore = []string{"SECTION", "H2", "H3", "H4", "H5", "H6", "P", "TD", "PRE"}
DefaultTagsToScore defines the element tags that should be scored
var DeprecatedSizeAttributeElems = []string{"TABLE", "TH", "TD", "HR", "PRE"}
DeprecatedSizeAttributeElems defines elements with deprecated size attributes
var DivToPElems = []string{"BLOCKQUOTE", "DL", "DIV", "IMG", "OL", "P", "PRE", "TABLE", "UL"}
DivToPElems defines elements that can appear inside a <div> but should be promoted to paragraphs
var HTMLEscapeMap = map[string]string{
"lt": "<",
"gt": ">",
"amp": "&",
"quot": "\"",
"apos": "'",
}
HTMLEscapeMap defines HTML entities that need to be escaped
var PhrasingElems = []string{
"ABBR", "AUDIO", "B", "BDO", "BR", "BUTTON", "CITE", "CODE", "DATA",
"DATALIST", "DFN", "EM", "EMBED", "I", "IMG", "INPUT", "KBD", "LABEL",
"MARK", "MATH", "METER", "NOSCRIPT", "OBJECT", "OUTPUT", "PROGRESS", "Q",
"RUBY", "SAMP", "SCRIPT", "SELECT", "SMALL", "SPAN", "STRONG", "SUB",
"SUP", "TEXTAREA", "TIME", "VAR", "WBR",
}
PhrasingElems defines elements that qualify as phrasing content
var PresentationalAttributes = []string{"align", "background", "bgcolor", "border", "cellpadding", "cellspacing", "frame", "hspace", "rules", "style", "valign", "vspace"}
PresentationalAttributes defines presentational attributes to remove
var UnlikelyRoles = []string{"menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog"}
UnlikelyRoles defines ARIA roles that suggest a node is not content
Functions ¶
func ExtractFromHTML ¶
ExtractFromHTML extracts readable content from HTML using pure Go Readability This function adapts our implementation to match the expected interface
Types ¶
type Article ¶
type Article struct { Title string `json:"title"` Byline string `json:"byline"` Date time.Time `json:"date"` Content string `json:"content"` PlainContent string `json:"plain_content"` TextContent string `json:"text_content"` Excerpt string `json:"excerpt"` SiteName string `json:"site_name"` Length int `json:"length"` }
ToArticle converts a ReadabilityArticle to a standard Article format This allows compatibility with existing code that expects the Article type
type NodeInfo ¶
type NodeInfo struct {
// contains filtered or unexported fields
}
NodeInfo holds information about a node
type Readability ¶
type Readability struct {
// contains filtered or unexported fields
}
Readability implements the Readability algorithm
func NewFromDocument ¶
func NewFromDocument(doc *goquery.Document, opts *ReadabilityOptions) *Readability
NewFromDocument creates a new Readability parser from a goquery document
func NewFromHTML ¶
func NewFromHTML(html string, opts *ReadabilityOptions) (*Readability, error)
NewFromHTML creates a new Readability parser from HTML string
func (*Readability) Parse ¶
func (r *Readability) Parse() (*ReadabilityArticle, error)
Parse runs the Readability algorithm
type ReadabilityArticle ¶
type ReadabilityArticle struct { Title string // Article title Byline string // Article byline (author) Content string // Article content (HTML) TextContent string // Article text content (plain text) Length int // Length of the text content Excerpt string // Short excerpt SiteName string // Site name Date time.Time // Publication date }
ReadabilityArticle represents the extracted article
func Parse ¶
func Parse(html string) (*ReadabilityArticle, error)
Parse extracts article content from HTML using default options
func ParseHTML ¶
func ParseHTML(html string, opts *ReadabilityOptions) (*ReadabilityArticle, error)
ParseHTML parses HTML content using the Readability algorithm
func ParseWithOptions ¶
func ParseWithOptions(html string, debug bool, maxElems int, charThreshold int) (*ReadabilityArticle, error)
ParseWithOptions extracts article content from HTML using custom options
func (*ReadabilityArticle) ToArticle ¶
func (r *ReadabilityArticle) ToArticle() *Article
ToArticle converts a ReadabilityArticle to a standard Article format
type ReadabilityOptions ¶
type ReadabilityOptions struct { Debug bool // Debug mode MaxElemsToParse int // Maximum elements to parse (0 = no limit) NbTopCandidates int // Number of top candidates to consider CharThreshold int // Minimum character threshold ClassesToPreserve []string // Classes to preserve KeepClasses bool // Whether to keep classes DisableJSONLD bool // Whether to disable JSON-LD processing AllowedVideoRegex *regexp.Regexp // Regex for allowed videos }
ReadabilityOptions defines configuration options for the Readability parser