extractor

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 27, 2025 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package extractor provides the main functionality for extracting readable content from HTML. It implements both a JavaScript-based extraction using Readability.js and a pure Go implementation.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Extractor

type Extractor interface {
	// ExtractFromHTML extracts article content from an HTML string
	ExtractFromHTML(html string, options *types.ExtractionOptions) (*types.Article, error)

	// ExtractFromReader extracts article content from an io.Reader
	ExtractFromReader(r io.Reader, options *types.ExtractionOptions) (*types.Article, error)
}

Extractor defines the interface for article extraction. It provides methods to extract article content from HTML strings or io.Readers.

func New

func New(opts ...Option) Extractor

New creates a new Extractor instance with the provided options. It returns an implementation of the Extractor interface that can be used to extract article content from HTML.

type Option

type Option func(*types.ExtractionOptions)

Option represents a function that modifies ExtractionOptions. This follows the functional options pattern for configuring the extractor.

func WithContentDigests

func WithContentDigests(enable bool) Option

WithContentDigests enables or disables content digest attributes. Content digests are SHA256 hashes of the content, which can be used to identify and track content across different versions of the same document.

func WithMaxBufferSize

func WithMaxBufferSize(size int) Option

WithMaxBufferSize sets the maximum buffer size for content processing. This limits the amount of memory used during extraction for very large documents.

func WithNodeIndexes

func WithNodeIndexes(enable bool) Option

WithNodeIndexes enables or disables node index attributes. Node indexes track the position of elements in the original HTML document, which can be useful for mapping extracted content back to the source.

func WithReadability

func WithReadability(use bool) Option

WithReadability enables or disables Readability.js usage. When enabled (default), the extractor will attempt to use Readability.js for extraction if Node.js is available. If disabled or if Node.js is not available, it will fall back to the pure Go implementation.

func WithTimeout

func WithTimeout(timeout time.Duration) Option

WithTimeout sets the timeout duration for extraction. This prevents extraction from hanging indefinitely on problematic documents.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL