htmldoc

package

v1.5.3 Latest Latest Go to latest Published: Jan 20, 2026 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

Documentation ¶

Overview ¶

Package htmldoc provides HTML document parsing.

Index ¶

type ElementType
type ExtractOptions
- func DefaultExtractOptions() ExtractOptions
type NavigationExclusionMode
type ParsedTable
- func (t *ParsedTable) ToMarkdown() string
type Reader
- func Open(filename string) (*Reader, error)
- func OpenReader(r io.Reader) (*Reader, error)
type TableCell

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ElementType ¶

type ElementType int

ElementType represents the type of HTML element.

const (
	ElementParagraph ElementType = iota
	ElementHeading
	ElementList
	ElementTable
	ElementCode
	ElementBlockquote
	ElementLink
)

type ExtractOptions ¶

type ExtractOptions struct {
	IncludeLinks    bool // Preserve link URLs in output
	IncludeMetadata bool // Include meta tags
	ExcludeHeaders  bool // Exclude headers (not applicable for HTML)
	ExcludeFooters  bool // Exclude footers (not applicable for HTML)

	// NavigationExclusion controls filtering of navigation, headers, footers, and sidebars.
	// Default: NavigationExclusionStandard (filters semantic elements and common class/id patterns)
	NavigationExclusion NavigationExclusionMode
}

ExtractOptions holds options for text extraction.

func DefaultExtractOptions ¶ added in v1.4.1

func DefaultExtractOptions() ExtractOptions

DefaultExtractOptions returns extract options with sensible defaults.

type NavigationExclusionMode ¶ added in v1.4.1

type NavigationExclusionMode int

NavigationExclusionMode controls how navigation, headers, and footers are filtered.

const (
	// NavigationExclusionNone includes all content without filtering.
	NavigationExclusionNone NavigationExclusionMode = iota

	// NavigationExclusionExplicit skips only explicit semantic HTML5 elements:
	// <nav>, <aside>, and ARIA roles (role="navigation", role="complementary").
	// <header> and <footer> are only skipped when they are direct children of <body>
	// or a single top-level wrapper element.
	NavigationExclusionExplicit

	// NavigationExclusionStandard (default) combines explicit element detection with
	// common class/id pattern matching. This catches navigation and boilerplate content
	// even when sites don't use semantic HTML5 elements.
	// Patterns matched include: nav, navbar, navigation, menu, footer, sidebar, etc.
	NavigationExclusionStandard

	// NavigationExclusionAggressive adds link-density heuristics to standard detection.
	// Sections with very high link-to-text ratios are excluded. This may occasionally
	// exclude legitimate content like link-heavy documentation or "related articles" sections.
	NavigationExclusionAggressive
)

type ParsedTable ¶

type ParsedTable struct {
	Rows      [][]TableCell
	HasHeader bool
}

ParsedTable represents a table extracted from HTML.

func (*ParsedTable) ToMarkdown ¶

func (t *ParsedTable) ToMarkdown() string

ToMarkdown converts the table to markdown format.

type Reader ¶

type Reader struct {
	// contains filtered or unexported fields
}

Reader provides access to HTML document content.

func Open ¶

func Open(filename string) (*Reader, error)

Open opens an HTML file for reading.

func OpenReader ¶

func OpenReader(r io.Reader) (*Reader, error)

OpenReader parses HTML from an io.Reader.

func (*Reader) Close ¶

func (r *Reader) Close() error

Close releases resources associated with the Reader.

func (*Reader) Document ¶

func (r *Reader) Document() (*model.Document, error)

Document returns a model.Document representation of the HTML content. Uses default options which include NavigationExclusionStandard.

func (*Reader) DocumentWithOptions ¶ added in v1.4.1

func (r *Reader) DocumentWithOptions(opts ExtractOptions) (*model.Document, error)

DocumentWithOptions returns a model.Document with the specified extraction options.

func (*Reader) Markdown ¶

func (r *Reader) Markdown() (string, error)

Markdown returns the HTML content as Markdown. Uses default options which include NavigationExclusionStandard.

func (*Reader) MarkdownWithOptions ¶

func (r *Reader) MarkdownWithOptions(opts ExtractOptions) (string, error)

MarkdownWithOptions returns HTML content as Markdown with options.

func (*Reader) MarkdownWithRAGOptions ¶

func (r *Reader) MarkdownWithRAGOptions(extractOpts ExtractOptions, mdOpts rag.MarkdownOptions) (string, error)

MarkdownWithRAGOptions returns HTML content as Markdown with RAG options.

func (*Reader) Metadata ¶

func (r *Reader) Metadata() model.Metadata

Metadata returns document metadata.

func (*Reader) PageCount ¶

func (r *Reader) PageCount() (int, error)

PageCount returns 1 (HTML documents are single-page).

func (*Reader) Text ¶

func (r *Reader) Text() (string, error)

Text extracts and returns all text content from the HTML document. Uses default options which include NavigationExclusionStandard.

func (*Reader) TextWithOptions ¶

func (r *Reader) TextWithOptions(opts ExtractOptions) (string, error)

TextWithOptions extracts text content with the specified options.

type TableCell ¶

type TableCell struct {
	Text     string
	IsHeader bool
	RowSpan  int
	ColSpan  int
}

TableCell represents a cell in an HTML table.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL