extractor

package

v0.2.0 Latest Latest Go to latest Published: Mar 27, 2025 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

Documentation ¶

Overview ¶

Package extractor provides the main functionality for extracting readable content from HTML. It implements both a JavaScript-based extraction using Readability.js and a pure Go implementation.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Extractor ¶

type Extractor interface {
	// ExtractFromHTML extracts article content from an HTML string
	ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error)

	// ExtractFromReader extracts article content from an io.Reader
	ExtractFromReader(r io.Reader, options *types.ExtractionOptions) (*types.Article, error)
}

Extractor defines the interface for article extraction. It provides methods to extract article content from HTML strings or io.Readers.

func New ¶

func New(opts ...Option) Extractor

New creates a new Extractor instance with the provided options. It returns an implementation of the Extractor interface that can be used to extract article content from HTML.

type Option ¶

type Option func(*types.ExtractionOptions)

Option represents a function that modifies ExtractionOptions. This follows the functional options pattern for configuring the extractor.

func WithContentDigests ¶

func WithContentDigests(enable bool) Option

WithContentDigests enables or disables content digest attributes. Content digests are SHA256 hashes of the content, which can be used to identify and track content across different versions of the same document.

func WithMaxBufferSize ¶

func WithMaxBufferSize(size int) Option

WithMaxBufferSize sets the maximum buffer size for content processing. This limits the amount of memory used during extraction for very large documents.

func WithNodeIndexes ¶

func WithNodeIndexes(enable bool) Option

WithNodeIndexes enables or disables node index attributes. Node indexes track the position of elements in the original HTML document, which can be useful for mapping extracted content back to the source.

func WithReadability ¶

func WithReadability(use bool) Option

WithReadability enables or disables Readability.js usage. When enabled (default), the extractor will attempt to use Readability.js for extraction if Node.js is available. If disabled or if Node.js is not available, it will fall back to the pure Go implementation.

func WithTimeout ¶

func WithTimeout(timeout time.Duration) Option

WithTimeout sets the timeout duration for extraction. This prevents extraction from hanging indefinitely on problematic documents.

Source Files ¶

View all Source files

extractor.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL