Documentation
¶
Overview ¶
Package extractor provides the main functionality for extracting readable content from HTML. It implements both a JavaScript-based extraction using Readability.js and a pure Go implementation.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Extractor ¶
type Extractor interface { // ExtractFromHTML extracts article content from an HTML string ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error) // ExtractFromReader extracts article content from an io.Reader ExtractFromReader(r io.Reader, options *types.ExtractionOptions) (*types.Article, error) }
Extractor defines the interface for article extraction. It provides methods to extract article content from HTML strings or io.Readers.
type Option ¶
type Option func(*types.ExtractionOptions)
Option represents a function that modifies ExtractionOptions. This follows the functional options pattern for configuring the extractor.
func WithContentDigests ¶
WithContentDigests enables or disables content digest attributes. Content digests are SHA256 hashes of the content, which can be used to identify and track content across different versions of the same document.
func WithMaxBufferSize ¶
WithMaxBufferSize sets the maximum buffer size for content processing. This limits the amount of memory used during extraction for very large documents.
func WithNodeIndexes ¶
WithNodeIndexes enables or disables node index attributes. Node indexes track the position of elements in the original HTML document, which can be useful for mapping extracted content back to the source.
func WithReadability ¶
WithReadability enables or disables Readability.js usage. When enabled (default), the extractor will attempt to use Readability.js for extraction if Node.js is available. If disabled or if Node.js is not available, it will fall back to the pure Go implementation.
func WithTimeout ¶
WithTimeout sets the timeout duration for extraction. This prevents extraction from hanging indefinitely on problematic documents.