htmldoc

package
v1.5.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 20, 2026 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package htmldoc provides HTML document parsing.

Package htmldoc provides HTML document parsing.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ElementType

type ElementType int

ElementType represents the type of HTML element.

const (
	ElementParagraph ElementType = iota
	ElementHeading
	ElementList
	ElementTable
	ElementCode
	ElementBlockquote
	ElementLink
)

type ExtractOptions

type ExtractOptions struct {
	IncludeLinks    bool // Preserve link URLs in output
	IncludeMetadata bool // Include meta tags
	ExcludeHeaders  bool // Exclude headers (not applicable for HTML)
	ExcludeFooters  bool // Exclude footers (not applicable for HTML)

	// NavigationExclusion controls filtering of navigation, headers, footers, and sidebars.
	// Default: NavigationExclusionStandard (filters semantic elements and common class/id patterns)
	NavigationExclusion NavigationExclusionMode
}

ExtractOptions holds options for text extraction.

func DefaultExtractOptions added in v1.4.1

func DefaultExtractOptions() ExtractOptions

DefaultExtractOptions returns extract options with sensible defaults.

type NavigationExclusionMode int

NavigationExclusionMode controls how navigation, headers, and footers are filtered.

const (
	// NavigationExclusionNone includes all content without filtering.
	NavigationExclusionNone NavigationExclusionMode = iota

	// NavigationExclusionExplicit skips only explicit semantic HTML5 elements:
	// <nav>, <aside>, and ARIA roles (role="navigation", role="complementary").
	// <header> and <footer> are only skipped when they are direct children of <body>
	// or a single top-level wrapper element.
	NavigationExclusionExplicit

	// NavigationExclusionStandard (default) combines explicit element detection with
	// common class/id pattern matching. This catches navigation and boilerplate content
	// even when sites don't use semantic HTML5 elements.
	// Patterns matched include: nav, navbar, navigation, menu, footer, sidebar, etc.
	NavigationExclusionStandard

	// NavigationExclusionAggressive adds link-density heuristics to standard detection.
	// Sections with very high link-to-text ratios are excluded. This may occasionally
	// exclude legitimate content like link-heavy documentation or "related articles" sections.
	NavigationExclusionAggressive
)

type ParsedTable

type ParsedTable struct {
	Rows      [][]TableCell
	HasHeader bool
}

ParsedTable represents a table extracted from HTML.

func (*ParsedTable) ToMarkdown

func (t *ParsedTable) ToMarkdown() string

ToMarkdown converts the table to markdown format.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader provides access to HTML document content.

func Open

func Open(filename string) (*Reader, error)

Open opens an HTML file for reading.

func OpenReader

func OpenReader(r io.Reader) (*Reader, error)

OpenReader parses HTML from an io.Reader.

func (*Reader) Close

func (r *Reader) Close() error

Close releases resources associated with the Reader.

func (*Reader) Document

func (r *Reader) Document() (*model.Document, error)

Document returns a model.Document representation of the HTML content. Uses default options which include NavigationExclusionStandard.

func (*Reader) DocumentWithOptions added in v1.4.1

func (r *Reader) DocumentWithOptions(opts ExtractOptions) (*model.Document, error)

DocumentWithOptions returns a model.Document with the specified extraction options.

func (*Reader) Markdown

func (r *Reader) Markdown() (string, error)

Markdown returns the HTML content as Markdown. Uses default options which include NavigationExclusionStandard.

func (*Reader) MarkdownWithOptions

func (r *Reader) MarkdownWithOptions(opts ExtractOptions) (string, error)

MarkdownWithOptions returns HTML content as Markdown with options.

func (*Reader) MarkdownWithRAGOptions

func (r *Reader) MarkdownWithRAGOptions(extractOpts ExtractOptions, mdOpts rag.MarkdownOptions) (string, error)

MarkdownWithRAGOptions returns HTML content as Markdown with RAG options.

func (*Reader) Metadata

func (r *Reader) Metadata() model.Metadata

Metadata returns document metadata.

func (*Reader) PageCount

func (r *Reader) PageCount() (int, error)

PageCount returns 1 (HTML documents are single-page).

func (*Reader) Text

func (r *Reader) Text() (string, error)

Text extracts and returns all text content from the HTML document. Uses default options which include NavigationExclusionStandard.

func (*Reader) TextWithOptions

func (r *Reader) TextWithOptions(opts ExtractOptions) (string, error)

TextWithOptions extracts text content with the specified options.

type TableCell

type TableCell struct {
	Text     string
	IsHeader bool
	RowSpan  int
	ColSpan  int
}

TableCell represents a cell in an HTML table.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL