readability

package

v0.2.0 Latest Latest Go to latest Published: Mar 27, 2025 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mrjoshuak/readabiligo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package readability provides a pure Go implementation of Mozilla's Readability.js algorithm for extracting article content from web pages.

This implementation follows the same content extraction logic as the original JavaScript implementation, including scoring elements based on content quality, handling special cases, and cleaning up the final article content.

Key features: - No JavaScript dependencies (100% Go) - Compatible with Mozilla's Readability algorithm - Proper handling of important links, headings, and navigation elements - Built-in adapters for integration with the main extractor package

Package readability provides a pure Go implementation of Mozilla's Readability.js for extracting the main content from web pages.

Index ¶

Constants
Variables
func ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error)
type Article
type NodeInfo
type Readability
- func NewFromDocument(doc *goquery.Document, opts *ReadabilityOptions) *Readability
- func NewFromHTML(html string, opts *ReadabilityOptions) (*Readability, error)
- func (r *Readability) Parse() (*ReadabilityArticle, error)
type ReadabilityArticle
- func (r *ReadabilityArticle) ToArticle() *Article
type ReadabilityOptions

Constants ¶

View Source

const (
	FlagStripUnlikelys     = 0x1
	FlagWeightClasses      = 0x2
	FlagCleanConditionally = 0x4
)

Flags for controlling the content extraction process

View Source

const (
	ElementNode = 1
	TextNode    = 3
	CommentNode = 8
	DoctypeNode = 10
)

Node types from the HTML package

View Source

const (
	// DefaultMaxElemsToParse is the maximum number of elements to parse (0 = no limit)
	DefaultMaxElemsToParse = 0

	// DefaultNTopCandidates is the number of top candidates to consider
	DefaultNTopCandidates = 5

	// DefaultCharThreshold is the minimum number of characters required for content
	DefaultCharThreshold = 500
)

Default settings

Variables ¶

View Source

var (
	// Unlikely candidates for content
	RegexpUnlikelyCandidates = regexp.MustCompile(`-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote`)

	// Candidates that might be content despite matching the unlikelyCandidates pattern
	RegexpMaybeCandidate = regexp.MustCompile(`and|article|body|column|content|main|shadow`)

	// Positive indicators of content
	RegexpPositive = regexp.MustCompile(`article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story`)

	// Negative indicators of content
	RegexpNegative = regexp.MustCompile(`-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget`)

	// Extraneous content areas
	RegexpExtraneous = regexp.MustCompile(`print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility`)

	// Byline indicators
	RegexpByline = regexp.MustCompile(`byline|author|dateline|writtenby|p-author`)

	// Font elements to replace
	RegexpReplaceFonts = regexp.MustCompile(`<(/?)font[^>]*>`)

	// Normalize whitespace
	RegexpNormalize = regexp.MustCompile(`\s{2,}`)

	// Video services to preserve
	RegexpVideos = regexp.MustCompile(`//(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)`)

	// Share elements
	RegexpShareElements = regexp.MustCompile(`(\b|_)(share|sharedaddy)(\b|_)`)

	// Next page links
	RegexpNextLink = regexp.MustCompile(`(next|weiter|continue|>([^\|]|$)|»([^\|]|$))`)

	// Previous page links
	RegexpPrevLink = regexp.MustCompile(`(prev|earl|old|new|<|«)`)

	// Tokenize text
	RegexpTokenize = regexp.MustCompile(`\W+`)

	// Whitespace
	RegexpWhitespace = regexp.MustCompile(`^\s*$`)

	// Has content
	RegexpHasContent = regexp.MustCompile(`\S$`)

	// Hash URL
	RegexpHashUrl = regexp.MustCompile(`^#.+`)

	// Srcset URL
	RegexpSrcsetUrl = regexp.MustCompile(`(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))`)

	// Base64 data URL
	RegexpB64DataUrl = regexp.MustCompile(`^data:\s*([^\s;,]+)\s*;\s*base64\s*,`)

	// JSON-LD article types
	RegexpJsonLdArticleTypes = regexp.MustCompile(`^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$`)
)

Regular expressions used in the Readability algorithm

View Source

var AlterToDivExceptions = []string{"DIV", "ARTICLE", "SECTION", "P"}

AlterToDivExceptions defines elements that should not be converted to <div>

View Source

var ClassesToPreserve = []string{"page"}

ClassesToPreserve defines CSS classes that should be preserved in the output

View Source

var DefaultTagsToScore = []string{"SECTION", "H2", "H3", "H4", "H5", "H6", "P", "TD", "PRE"}

DefaultTagsToScore defines the element tags that should be scored

View Source

var DeprecatedSizeAttributeElems = []string{"TABLE", "TH", "TD", "HR", "PRE"}

DeprecatedSizeAttributeElems defines elements with deprecated size attributes

View Source

var DivToPElems = []string{"BLOCKQUOTE", "DL", "DIV", "IMG", "OL", "P", "PRE", "TABLE", "UL"}

DivToPElems defines elements that can appear inside a <div> but should be promoted to paragraphs

View Source

var HTMLEscapeMap = map[string]string{
	"lt":   "<",
	"gt":   ">",
	"amp":  "&",
	"quot": "\"",
	"apos": "'",
}

HTMLEscapeMap defines HTML entities that need to be escaped

View Source

var PhrasingElems = []string{
	"ABBR", "AUDIO", "B", "BDO", "BR", "BUTTON", "CITE", "CODE", "DATA",
	"DATALIST", "DFN", "EM", "EMBED", "I", "IMG", "INPUT", "KBD", "LABEL",
	"MARK", "MATH", "METER", "NOSCRIPT", "OBJECT", "OUTPUT", "PROGRESS", "Q",
	"RUBY", "SAMP", "SCRIPT", "SELECT", "SMALL", "SPAN", "STRONG", "SUB",
	"SUP", "TEXTAREA", "TIME", "VAR", "WBR",
}

PhrasingElems defines elements that qualify as phrasing content

View Source

var PresentationalAttributes = []string{"align", "background", "bgcolor", "border", "cellpadding", "cellspacing", "frame", "hspace", "rules", "style", "valign", "vspace"}

PresentationalAttributes defines presentational attributes to remove

View Source

var UnlikelyRoles = []string{"menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog"}

UnlikelyRoles defines ARIA roles that suggest a node is not content

Functions ¶

func ExtractFromHTML ¶

func ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error)

ExtractFromHTML extracts readable content from HTML using pure Go Readability This function adapts our implementation to match the expected interface

Types ¶

type Article ¶

type Article struct {
	Title        string    `json:"title"`
	Byline       string    `json:"byline"`
	Date         time.Time `json:"date"`
	Content      string    `json:"content"`
	PlainContent string    `json:"plain_content"`
	TextContent  string    `json:"text_content"`
	Excerpt      string    `json:"excerpt"`
	SiteName     string    `json:"site_name"`
	Length       int       `json:"length"`
}

ToArticle converts a ReadabilityArticle to a standard Article format This allows compatibility with existing code that expects the Article type

type NodeInfo ¶

type NodeInfo struct {
	// contains filtered or unexported fields
}

NodeInfo holds information about a node

type Readability ¶

type Readability struct {
	// contains filtered or unexported fields
}

Readability implements the Readability algorithm

func NewFromDocument ¶

func NewFromDocument(doc *goquery.Document, opts *ReadabilityOptions) *Readability

NewFromDocument creates a new Readability parser from a goquery document

func NewFromHTML ¶

func NewFromHTML(html string, opts *ReadabilityOptions) (*Readability, error)

NewFromHTML creates a new Readability parser from HTML string

func (*Readability) Parse ¶

func (r *Readability) Parse() (*ReadabilityArticle, error)

Parse runs the Readability algorithm

type ReadabilityArticle ¶

type ReadabilityArticle struct {
	Title       string    // Article title
	Byline      string    // Article byline (author)
	Content     string    // Article content (HTML)
	TextContent string    // Article text content (plain text)
	Length      int       // Length of the text content
	Excerpt     string    // Short excerpt
	SiteName    string    // Site name
	Date        time.Time // Publication date
}

ReadabilityArticle represents the extracted article

func Parse ¶

func Parse(html string) (*ReadabilityArticle, error)

Parse extracts article content from HTML using default options

func ParseHTML ¶

func ParseHTML(html string, opts *ReadabilityOptions) (*ReadabilityArticle, error)

ParseHTML parses HTML content using the Readability algorithm

func ParseWithOptions ¶

func ParseWithOptions(html string, debug bool, maxElems int, charThreshold int) (*ReadabilityArticle, error)

ParseWithOptions extracts article content from HTML using custom options

func (*ReadabilityArticle) ToArticle ¶

func (r *ReadabilityArticle) ToArticle() *Article

ToArticle converts a ReadabilityArticle to a standard Article format

type ReadabilityOptions ¶

type ReadabilityOptions struct {
	Debug             bool           // Debug mode
	MaxElemsToParse   int            // Maximum elements to parse (0 = no limit)
	NbTopCandidates   int            // Number of top candidates to consider
	CharThreshold     int            // Minimum character threshold
	ClassesToPreserve []string       // Classes to preserve
	KeepClasses       bool           // Whether to keep classes
	DisableJSONLD     bool           // Whether to disable JSON-LD processing
	AllowedVideoRegex *regexp.Regexp // Regex for allowed videos
}

ReadabilityOptions defines configuration options for the Readability parser

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL