Documentation
¶
Overview ¶
Package htmldoc provides HTML document parsing.
Package htmldoc provides HTML document parsing.
Index ¶
- type ElementType
- type ExtractOptions
- type NavigationExclusionMode
- type ParsedTable
- type Reader
- func (r *Reader) Close() error
- func (r *Reader) Document() (*model.Document, error)
- func (r *Reader) DocumentWithOptions(opts ExtractOptions) (*model.Document, error)
- func (r *Reader) Markdown() (string, error)
- func (r *Reader) MarkdownWithOptions(opts ExtractOptions) (string, error)
- func (r *Reader) MarkdownWithRAGOptions(extractOpts ExtractOptions, mdOpts rag.MarkdownOptions) (string, error)
- func (r *Reader) Metadata() model.Metadata
- func (r *Reader) PageCount() (int, error)
- func (r *Reader) Text() (string, error)
- func (r *Reader) TextWithOptions(opts ExtractOptions) (string, error)
- type TableCell
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ElementType ¶
type ElementType int
ElementType represents the type of HTML element.
const ( ElementParagraph ElementType = iota ElementHeading ElementList ElementTable ElementCode ElementBlockquote ElementLink )
type ExtractOptions ¶
type ExtractOptions struct {
IncludeLinks bool // Preserve link URLs in output
IncludeMetadata bool // Include meta tags
ExcludeHeaders bool // Exclude headers (not applicable for HTML)
// Default: NavigationExclusionStandard (filters semantic elements and common class/id patterns)
NavigationExclusion NavigationExclusionMode
}
ExtractOptions holds options for text extraction.
func DefaultExtractOptions ¶ added in v1.4.1
func DefaultExtractOptions() ExtractOptions
DefaultExtractOptions returns extract options with sensible defaults.
type NavigationExclusionMode ¶ added in v1.4.1
type NavigationExclusionMode int
NavigationExclusionMode controls how navigation, headers, and footers are filtered.
const ( NavigationExclusionNone NavigationExclusionMode = iota // <nav>, <aside>, and ARIA roles (role="navigation", role="complementary"). // <header> and <footer> are only skipped when they are direct children of <body> // or a single top-level wrapper element. NavigationExclusionExplicit // common class/id pattern matching. This catches navigation and boilerplate content // even when sites don't use semantic HTML5 elements. // Patterns matched include: nav, navbar, navigation, menu, footer, sidebar, etc. NavigationExclusionStandard // Sections with very high link-to-text ratios are excluded. This may occasionally // exclude legitimate content like link-heavy documentation or "related articles" sections. NavigationExclusionAggressive )
type ParsedTable ¶
ParsedTable represents a table extracted from HTML.
func (*ParsedTable) ToMarkdown ¶
func (t *ParsedTable) ToMarkdown() string
ToMarkdown converts the table to markdown format.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader provides access to HTML document content.
func OpenReader ¶
OpenReader parses HTML from an io.Reader.
func (*Reader) Document ¶
Document returns a model.Document representation of the HTML content. Uses default options which include NavigationExclusionStandard.
func (*Reader) DocumentWithOptions ¶ added in v1.4.1
func (r *Reader) DocumentWithOptions(opts ExtractOptions) (*model.Document, error)
DocumentWithOptions returns a model.Document with the specified extraction options.
func (*Reader) Markdown ¶
Markdown returns the HTML content as Markdown. Uses default options which include NavigationExclusionStandard.
func (*Reader) MarkdownWithOptions ¶
func (r *Reader) MarkdownWithOptions(opts ExtractOptions) (string, error)
MarkdownWithOptions returns HTML content as Markdown with options.
func (*Reader) MarkdownWithRAGOptions ¶
func (r *Reader) MarkdownWithRAGOptions(extractOpts ExtractOptions, mdOpts rag.MarkdownOptions) (string, error)
MarkdownWithRAGOptions returns HTML content as Markdown with RAG options.
func (*Reader) Text ¶
Text extracts and returns all text content from the HTML document. Uses default options which include NavigationExclusionStandard.
func (*Reader) TextWithOptions ¶
func (r *Reader) TextWithOptions(opts ExtractOptions) (string, error)
TextWithOptions extracts text content with the specified options.