layout

package

v1.6.6 Latest Latest Go to latest Published: Feb 4, 2026 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

Documentation ¶

Overview ¶

Package layout provides document layout analysis for extracting semantic structure from PDF pages. It includes the unified Layout Analyzer that orchestrates all detection components including column, line, block, paragraph, heading, list, and reading order detection.

Package layout provides document layout analysis including block detection, column detection, reading order determination, and structural element identification.

Package layout provides document layout analysis including column detection, reading order determination, and structural element identification.

Package layout provides document layout analysis for extracting semantic structure from PDF pages.

This package analyzes text fragments to detect document structure including lines, paragraphs, headings, lists, columns, and reading order.

Layout Analysis ¶

The Analyzer orchestrates all detection components:

analyzer := layout.NewAnalyzer()
result := analyzer.Analyze(fragments, pageWidth, pageHeight)

For faster analysis without heading/list detection:

result := analyzer.QuickAnalyze(fragments, pageWidth, pageHeight)

Analysis Results ¶

The AnalysisResult contains:

Elements - all detected elements in reading order
Columns - column layout information
ReadingOrder - proper reading sequence for multi-column layouts
Headings, Lists, Paragraphs - detected semantic elements
Blocks, Lines - lower-level text structure

Detectors ¶

The package includes specialized detectors:

LineDetector - groups fragments into text lines
ParagraphDetector - groups lines into paragraphs
HeadingDetector - identifies headings by font size and position
ListDetector - detects bulleted and numbered lists
BlockDetector - detects spatial text blocks
ColumnDetector - detects multi-column layouts
ReadingOrderDetector - determines proper reading sequence
HeaderFooterDetector - identifies repeated headers/footers

Configuration ¶

Each detector can be configured independently:

config := layout.DefaultAnalyzerConfig()
config.DetectHeadings = true
config.DetectLists = true
config.HeadingConfig.MinFontSizeRatio = 1.2
analyzer := layout.NewAnalyzerWithConfig(config)

Header/Footer Filtering ¶

For multi-page documents, headers and footers can be detected and filtered:

result := analyzer.AnalyzeWithHeaderFooterFiltering(pageFragments, pageIndex)

Package layout provides document layout analysis including heading detection, which identifies and classifies document headings by level (H1-H6).

Package layout provides document layout analysis including line detection, block detection, column detection, and structural element identification.

Package layout provides document layout analysis including list detection, which identifies and structures bulleted and numbered lists.

Package layout provides document layout analysis including paragraph detection, line detection, block detection, and structural element identification.

Package layout provides document layout analysis including reading order detection, which determines the correct sequence for reading text in complex layouts.

Index ¶

func IsListItemText(text string) bool
func ReorderForReading(fragments []text.TextFragment, pageWidth, pageHeight float64) []text.TextFragment
type AnalysisResult
- func (r *AnalysisResult) GetElements() []model.Element
- func (r *AnalysisResult) GetMarkdown() string
- func (r *AnalysisResult) GetText() string
type AnalysisStats
type Analyzer
- func NewAnalyzer() *Analyzer
- func NewAnalyzerWithConfig(config AnalyzerConfig) *Analyzer
- func (a *Analyzer) Analyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
- func (a *Analyzer) AnalyzeWithHeaderFooterFiltering(pageFragments []PageFragments, pageIndex int) *AnalysisResult
- func (a *Analyzer) QuickAnalyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
type AnalyzerConfig
- func DefaultAnalyzerConfig() AnalyzerConfig
type Block
- func (b *Block) AverageFontSize() float64
- func (b *Block) ContainsPoint(x, y float64) bool
- func (b *Block) FragmentCount() int
- func (b *Block) GetText() string
- func (b *Block) LineCount() int
type BlockConfig
- func DefaultBlockConfig() BlockConfig
type BlockDetector
- func NewBlockDetector() *BlockDetector
- func NewBlockDetectorWithConfig(config BlockConfig) *BlockDetector
- func (d *BlockDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *BlockLayout
type BlockLayout
- func (l *BlockLayout) BlockCount() int
- func (l *BlockLayout) GetAllFragments() []text.TextFragment
- func (l *BlockLayout) GetBlock(index int) *Block
- func (l *BlockLayout) GetText() string
type BulletStyle
- func (s BulletStyle) String() string
type Column
type ColumnConfig
- func DefaultColumnConfig() ColumnConfig
type ColumnDetector
- func NewColumnDetector() *ColumnDetector
- func NewColumnDetectorWithConfig(config ColumnConfig) *ColumnDetector
- func (d *ColumnDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ColumnLayout
type ColumnLayout
- func (l *ColumnLayout) ColumnCount() int
- func (l *ColumnLayout) GetColumn(index int) *Column
- func (l *ColumnLayout) GetFragmentsInReadingOrder() []text.TextFragment
- func (l *ColumnLayout) GetText() string
- func (l *ColumnLayout) IsMultiColumn() bool
- func (l *ColumnLayout) IsSingleColumn() bool
type Gap
- func (g Gap) Center() float64
- func (g Gap) Height() float64
- func (g Gap) Width() float64
type HeaderFooterConfig
- func DefaultHeaderFooterConfig() HeaderFooterConfig
type HeaderFooterDetector
- func NewHeaderFooterDetector() *HeaderFooterDetector
- func NewHeaderFooterDetectorWithConfig(config HeaderFooterConfig) *HeaderFooterDetector
- func (d *HeaderFooterDetector) Detect(pages []PageFragments) *HeaderFooterResult
type HeaderFooterRegion
type HeaderFooterResult
- func (r *HeaderFooterResult) FilterFragments(pageIndex int, fragments []text.TextFragment, pageHeight float64) []text.TextFragment
- func (r *HeaderFooterResult) GetFooterTexts() []string
- func (r *HeaderFooterResult) GetHeaderTexts() []string
- func (r *HeaderFooterResult) HasFooters() bool
- func (r *HeaderFooterResult) HasHeaders() bool
- func (r *HeaderFooterResult) HasHeadersOrFooters() bool
- func (r *HeaderFooterResult) Summary() string
type Heading
- func (h *Heading) ContainsPoint(x, y float64) bool
- func (h *Heading) GetAnchorID() string
- func (h *Heading) GetCleanText() string
- func (h *Heading) IsTopLevel() bool
- func (h *Heading) ToMarkdown() string
- func (h *Heading) WordCount() int
type HeadingConfig
- func DefaultHeadingConfig() HeadingConfig
type HeadingDetector
- func NewHeadingDetector() *HeadingDetector
- func NewHeadingDetectorWithConfig(config HeadingConfig) *HeadingDetector
- func (d *HeadingDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *HeadingLayout
- func (d *HeadingDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *HeadingLayout
- func (d *HeadingDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *HeadingLayout
type HeadingLayout
- func (l *HeadingLayout) FindHeadingBefore(y float64) *Heading
- func (l *HeadingLayout) FindHeadingsInRegion(bbox model.BBox) []Heading
- func (l *HeadingLayout) GetH1() []Heading
- func (l *HeadingLayout) GetH2() []Heading
- func (l *HeadingLayout) GetH3() []Heading
- func (l *HeadingLayout) GetHeading(index int) *Heading
- func (l *HeadingLayout) GetHeadingsAtLevel(level HeadingLevel) []Heading
- func (l *HeadingLayout) GetHeadingsInRange(minLevel, maxLevel HeadingLevel) []Heading
- func (l *HeadingLayout) GetMarkdownTOC() string
- func (l *HeadingLayout) GetOutline() []OutlineEntry
- func (l *HeadingLayout) GetTableOfContents() string
- func (l *HeadingLayout) HeadingCount() int
type HeadingLevel
- func (l HeadingLevel) HTMLTag() string
- func (l HeadingLevel) String() string
type LayoutElement
- func (le *LayoutElement) ToModelElement() model.Element
type Line
- func ReorderLinesForReading(lines []Line, pageWidth, pageHeight float64) []Line
- func (line *Line) ContainsPoint(x, y float64) bool
- func (line *Line) HasLargerFont(size float64) bool
- func (line *Line) IsEmpty() bool
- func (line *Line) IsIndented(margin, tolerance float64) bool
- func (line *Line) WordCount() int
type LineAlignment
- func (a LineAlignment) String() string
type LineConfig
- func DefaultLineConfig() LineConfig
type LineDetector
- func NewLineDetector() *LineDetector
- func NewLineDetectorWithConfig(config LineConfig) *LineDetector
- func (d *LineDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *LineLayout
type LineLayout
- func (l *LineLayout) FindLinesInRegion(bbox model.BBox) []Line
- func (l *LineLayout) GetAllFragments() []text.TextFragment
- func (l *LineLayout) GetLine(index int) *Line
- func (l *LineLayout) GetLinesByAlignment(alignment LineAlignment) []Line
- func (l *LineLayout) GetText() string
- func (l *LineLayout) IsParagraphBreak(lineIndex int) bool
- func (l *LineLayout) LineCount() int
type List
- func (list *List) GetAllItems() []ListItem
- func (list *List) GetText() string
- func (list *List) HasNesting() bool
- func (list *List) MaxDepth() int
- func (list *List) ToMarkdown() string
type ListConfig
- func DefaultListConfig() ListConfig
type ListDetector
- func NewListDetector() *ListDetector
- func NewListDetectorWithConfig(config ListConfig) *ListDetector
- func (d *ListDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ListLayout
- func (d *ListDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ListLayout
- func (d *ListDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *ListLayout
type ListItem
- func (item *ListItem) ChildCount() int
- func (item *ListItem) ContainsPoint(x, y float64) bool
- func (item *ListItem) GetFullText() string
- func (item *ListItem) HasChildren() bool
- func (item *ListItem) IsCheckbox() bool
- func (item *ListItem) IsChecked() bool
- func (item *ListItem) IsFirstInList() bool
- func (item *ListItem) WordCount() int
type ListLayout
- func (l *ListLayout) FindListsInRegion(bbox model.BBox) []List
- func (l *ListLayout) GetBulletLists() []List
- func (l *ListLayout) GetList(index int) *List
- func (l *ListLayout) GetListsByType(listType ListType) []List
- func (l *ListLayout) GetNumberedLists() []List
- func (l *ListLayout) ListCount() int
- func (l *ListLayout) TotalItemCount() int
type ListType
- func (t ListType) String() string
type OutlineEntry
type PageFragments
type Paragraph
- func (p *Paragraph) ContainsPoint(x, y float64) bool
- func (p *Paragraph) GetFirstLine() *Line
- func (p *Paragraph) GetLastLine() *Line
- func (p *Paragraph) HasFirstLineIndent() bool
- func (p *Paragraph) IsBlockQuote() bool
- func (p *Paragraph) IsHeading() bool
- func (p *Paragraph) IsListItem() bool
- func (p *Paragraph) LineCount() int
- func (p *Paragraph) WordCount() int
type ParagraphConfig
- func DefaultParagraphConfig() ParagraphConfig
type ParagraphDetector
- func NewParagraphDetector() *ParagraphDetector
- func NewParagraphDetectorWithConfig(config ParagraphConfig) *ParagraphDetector
- func (d *ParagraphDetector) Detect(lines []Line, pageWidth, pageHeight float64) *ParagraphLayout
- func (d *ParagraphDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ParagraphLayout
type ParagraphLayout
- func (l *ParagraphLayout) FindParagraphsInRegion(bbox model.BBox) []Paragraph
- func (l *ParagraphLayout) GetHeadings() []Paragraph
- func (l *ParagraphLayout) GetListItems() []Paragraph
- func (l *ParagraphLayout) GetParagraph(index int) *Paragraph
- func (l *ParagraphLayout) GetParagraphsByStyle(style ParagraphStyle) []Paragraph
- func (l *ParagraphLayout) GetText() string
- func (l *ParagraphLayout) ParagraphCount() int
type ParagraphStyle
- func (s ParagraphStyle) String() string
type ReadingDirection
- func (d ReadingDirection) String() string
type ReadingOrderConfig
- func DefaultReadingOrderConfig() ReadingOrderConfig
type ReadingOrderDetector
- func NewReadingOrderDetector() *ReadingOrderDetector
- func NewReadingOrderDetectorWithConfig(config ReadingOrderConfig) *ReadingOrderDetector
- func (d *ReadingOrderDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ReadingOrderResult
- func (d *ReadingOrderDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ReadingOrderResult
type ReadingOrderResult
- func (r *ReadingOrderResult) GetParagraphs() *ParagraphLayout
- func (r *ReadingOrderResult) GetSectionCount() int
- func (r *ReadingOrderResult) GetText() string
- func (r *ReadingOrderResult) IsMultiColumn() bool
type ReadingSection
type RegionType
- func (r RegionType) String() string
type SectionType
- func (t SectionType) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsListItemText ¶

func IsListItemText(text string) bool

IsListItemText checks if text appears to be a list item

func ReorderForReading ¶

func ReorderForReading(fragments []text.TextFragment, pageWidth, pageHeight float64) []text.TextFragment

ReorderForReading takes fragments and returns them in proper reading order This is a convenience function for simple use cases

Types ¶

type AnalysisResult ¶

type AnalysisResult struct {
	// Elements are all detected elements in reading order
	Elements []LayoutElement

	// Columns is the column layout analysis
	Columns *ColumnLayout

	// ReadingOrder is the reading order analysis
	ReadingOrder *ReadingOrderResult

	// Headings are all detected headings
	Headings *HeadingLayout

	// Lists are all detected lists
	Lists *ListLayout

	// Paragraphs are all detected paragraphs
	Paragraphs *ParagraphLayout

	// Blocks are all detected text blocks
	Blocks *BlockLayout

	// Lines are all detected lines
	Lines *LineLayout

	// PageWidth and PageHeight
	PageWidth  float64
	PageHeight float64

	// Statistics
	Stats AnalysisStats
}

AnalysisResult holds the complete results from layout analysis, including detected elements, intermediate analysis structures (columns, lines, blocks, paragraphs, headings, lists), and statistics about the analysis.

func (*AnalysisResult) GetElements ¶

func (r *AnalysisResult) GetElements() []model.Element

GetElements converts all layout elements to model.Element interfaces, returning them in reading order.

func (*AnalysisResult) GetMarkdown ¶

func (r *AnalysisResult) GetMarkdown() string

GetMarkdown returns a Markdown representation of the document with headings, lists, and paragraphs formatted appropriately.

func (*AnalysisResult) GetText ¶

func (r *AnalysisResult) GetText() string

GetText returns all extracted text concatenated in reading order. It prefers reading order text if available, falling back to paragraph text.

type AnalysisStats ¶

type AnalysisStats struct {
	FragmentCount  int
	LineCount      int
	BlockCount     int
	ParagraphCount int
	HeadingCount   int
	ListCount      int
	ColumnCount    int
	ElementCount   int
}

AnalysisStats contains counts of detected elements from the layout analysis.

type Analyzer ¶

type Analyzer struct {
	// contains filtered or unexported fields
}

Analyzer orchestrates all layout detection components to extract semantic structure from PDF pages. It combines column, line, block, paragraph, heading, list, and reading order detection into a unified analysis pipeline.

func NewAnalyzer ¶

func NewAnalyzer() *Analyzer

NewAnalyzer creates a new layout analyzer with default configuration.

func NewAnalyzerWithConfig ¶

func NewAnalyzerWithConfig(config AnalyzerConfig) *Analyzer

NewAnalyzerWithConfig creates a new layout analyzer with the specified configuration.

func (*Analyzer) Analyze ¶

func (a *Analyzer) Analyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult

Analyze performs complete layout analysis on the given text fragments. It runs through all detection phases: column detection, reading order analysis, line detection, block detection, paragraph detection, heading detection (if enabled), list detection (if enabled), and finally builds a unified element tree.

func (*Analyzer) AnalyzeWithHeaderFooterFiltering ¶

func (a *Analyzer) AnalyzeWithHeaderFooterFiltering(
	pageFragments []PageFragments,
	pageIndex int,
) *AnalysisResult

AnalyzeWithHeaderFooterFiltering performs layout analysis with automatic header and footer detection and removal. This requires multiple pages to identify repeated content at the top and bottom of pages.

func (*Analyzer) QuickAnalyze ¶

func (a *Analyzer) QuickAnalyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult

QuickAnalyze performs a fast analysis focusing on text structure without detailed heading or list detection. It only runs reading order and paragraph detection for better performance when fine-grained structure is not needed.

type AnalyzerConfig ¶

type AnalyzerConfig struct {
	// Column detection configuration
	ColumnConfig ColumnConfig

	// Line detection configuration
	LineConfig LineConfig

	// Paragraph detection configuration
	ParagraphConfig ParagraphConfig

	// Block detection configuration
	BlockConfig BlockConfig

	// Heading detection configuration
	HeadingConfig HeadingConfig

	// List detection configuration
	ListConfig ListConfig

	// Reading order configuration
	ReadingOrderConfig ReadingOrderConfig

	// DetectHeadings enables heading detection
	DetectHeadings bool

	// DetectLists enables list detection
	DetectLists bool

	// UseReadingOrder uses reading order for element ordering
	UseReadingOrder bool
}

AnalyzerConfig holds configuration options for the layout analyzer. Each detection component has its own sub-configuration, and there are flags to enable or disable optional analysis features.

func DefaultAnalyzerConfig ¶

func DefaultAnalyzerConfig() AnalyzerConfig

DefaultAnalyzerConfig returns a configuration with sensible defaults for typical document layout analysis, with all detection features enabled.

type Block ¶

type Block struct {
	// BBox is the bounding box of the block
	BBox model.BBox

	// Fragments are the text fragments contained in this block (in reading order)
	Fragments []text.TextFragment

	// Lines are the fragments grouped into horizontal lines
	Lines [][]text.TextFragment

	// Index is the block's position in reading order (0-based)
	Index int

	// Level indicates nesting depth (0 = top level)
	Level int
}

Block represents a contiguous rectangular region of text on a page. Blocks are spatially coherent groups of fragments separated by whitespace.

func (*Block) AverageFontSize ¶

func (b *Block) AverageFontSize() float64

AverageFontSize returns the average font size of fragments in this block

func (*Block) ContainsPoint ¶

func (b *Block) ContainsPoint(x, y float64) bool

ContainsPoint returns true if the given point is within this block's bounding box

func (*Block) FragmentCount ¶

func (b *Block) FragmentCount() int

FragmentCount returns the number of fragments in this block

func (*Block) GetText ¶

func (b *Block) GetText() string

GetText returns the text content of this block

func (*Block) LineCount ¶

func (b *Block) LineCount() int

LineCount returns the number of lines in this block

type BlockConfig ¶

type BlockConfig struct {
	// LineHeightTolerance is the Y-distance tolerance for grouping fragments into lines
	// as a fraction of fragment height (default: 0.5)
	LineHeightTolerance float64

	// HorizontalGapThreshold is the minimum horizontal gap to consider fragments separate
	// as a fraction of average font size (default: 3.0)
	HorizontalGapThreshold float64

	// VerticalGapThreshold is the minimum vertical gap to start a new block
	// as a fraction of average line height (default: 1.5)
	VerticalGapThreshold float64

	// MinBlockWidth is the minimum width for a valid block (default: 10 points)
	MinBlockWidth float64

	// MinBlockHeight is the minimum height for a valid block (default: 5 points)
	MinBlockHeight float64

	// MergeOverlappingBlocks controls whether overlapping blocks should be merged
	MergeOverlappingBlocks bool
}

BlockConfig holds configuration for block detection

func DefaultBlockConfig ¶

func DefaultBlockConfig() BlockConfig

DefaultBlockConfig returns sensible default configuration

type BlockDetector ¶

type BlockDetector struct {
	// contains filtered or unexported fields
}

BlockDetector detects text blocks on a page

func NewBlockDetector ¶

func NewBlockDetector() *BlockDetector

NewBlockDetector creates a new block detector with default configuration

func NewBlockDetectorWithConfig ¶

func NewBlockDetectorWithConfig(config BlockConfig) *BlockDetector

NewBlockDetectorWithConfig creates a block detector with custom configuration

func (*BlockDetector) Detect ¶

func (d *BlockDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *BlockLayout

Detect analyzes text fragments and detects block layout

type BlockLayout ¶

type BlockLayout struct {
	// Blocks are the detected text blocks (in reading order)
	Blocks []Block

	// PageWidth is the width of the page
	PageWidth float64

	// PageHeight is the height of the page
	PageHeight float64

	// Config is the configuration used for detection
	Config BlockConfig
}

BlockLayout represents the detected block structure of a page

func (*BlockLayout) BlockCount ¶

func (l *BlockLayout) BlockCount() int

BlockCount returns the number of detected blocks

func (*BlockLayout) GetAllFragments ¶

func (l *BlockLayout) GetAllFragments() []text.TextFragment

GetAllFragments returns all fragments in reading order

func (*BlockLayout) GetBlock ¶

func (l *BlockLayout) GetBlock(index int) *Block

GetBlock returns a specific block by index

func (*BlockLayout) GetText ¶

func (l *BlockLayout) GetText() string

GetText returns all text in reading order (block by block)

type BulletStyle ¶

type BulletStyle int

BulletStyle represents the specific bullet character used

const (
	BulletStyleUnknown     BulletStyle = iota
	BulletStyleDisc                    // • (filled circle)
	BulletStyleCircle                  // ○ (empty circle)
	BulletStyleSquare                  // ■ (filled square)
	BulletStyleDash                    // - (dash)
	BulletStyleAsterisk                // * (asterisk)
	BulletStyleArrow                   // → or ▶ (arrow)
	BulletStyleTriangle                // ▪ or ▸ (triangle)
	BulletStyleCheckEmpty              // ☐ (empty checkbox)
	BulletStyleCheckFilled             // ☑ or ✓ (checked)
)

func (BulletStyle) String ¶

func (s BulletStyle) String() string

String returns a string representation of the bullet style

type Column ¶

type Column struct {
	// Bounding box of the column
	BBox model.BBox

	// Fragments contained in this column (sorted top to bottom)
	Fragments []text.TextFragment

	// Index of the column (0-based, left to right)
	Index int
}

Column represents a detected text column on a page

type ColumnConfig ¶

type ColumnConfig struct {
	// MinColumnWidth is the minimum width for a region to be considered a column
	// Default: 50 points
	MinColumnWidth float64

	// MinGapWidth is the minimum whitespace gap to consider as column separator
	// Default: 20 points
	MinGapWidth float64

	// MinGapHeight is the minimum vertical extent of a gap to be significant
	// As a ratio of page height (0.0 to 1.0)
	// Default: 0.5 (50% of page height)
	MinGapHeightRatio float64

	// MaxColumns is the maximum number of columns to detect
	// Default: 6
	MaxColumns int

	// MergeThreshold is the maximum X distance between fragments to consider them same column
	// Default: 10 points
	MergeThreshold float64

	// SpanningThreshold is the minimum line width ratio (vs content width) for content
	// to be considered spanning. Lines with content in gaps but less than this width
	// are treated as column content (avoids false positives from stray fragments).
	// Default: 0.35 (line must span 35% of content width - allows centered titles)
	SpanningThreshold float64
}

ColumnConfig holds configuration for column detection

func DefaultColumnConfig ¶

func DefaultColumnConfig() ColumnConfig

DefaultColumnConfig returns sensible default configuration

type ColumnDetector ¶

type ColumnDetector struct {
	// contains filtered or unexported fields
}

ColumnDetector detects multi-column layouts in text

func NewColumnDetector ¶

func NewColumnDetector() *ColumnDetector

NewColumnDetector creates a new column detector with default configuration

func NewColumnDetectorWithConfig ¶

func NewColumnDetectorWithConfig(config ColumnConfig) *ColumnDetector

NewColumnDetectorWithConfig creates a column detector with custom configuration

func (*ColumnDetector) Detect ¶

func (d *ColumnDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ColumnLayout

Detect analyzes text fragments and detects column layout

type ColumnLayout ¶

type ColumnLayout struct {
	// Detected columns (sorted left to right)
	Columns []Column

	// SpanningFragments are fragments that span across column gaps
	// (e.g., centered titles, full-width headers)
	// These are excluded from column content and stored separately
	SpanningFragments []text.TextFragment

	// Page dimensions
	PageWidth  float64
	PageHeight float64

	// Configuration used for detection
	Config ColumnConfig
}

ColumnLayout represents the detected column structure of a page

func (*ColumnLayout) ColumnCount ¶

func (l *ColumnLayout) ColumnCount() int

ColumnCount returns the number of detected columns

func (*ColumnLayout) GetColumn ¶

func (l *ColumnLayout) GetColumn(index int) *Column

GetColumn returns a specific column by index

func (*ColumnLayout) GetFragmentsInReadingOrder ¶

func (l *ColumnLayout) GetFragmentsInReadingOrder() []text.TextFragment

GetFragmentsInReadingOrder returns all fragments ordered for reading (left column first, then right column, each column top-to-bottom)

func (*ColumnLayout) GetText ¶

func (l *ColumnLayout) GetText() string

GetText returns the text content in reading order (column by column, top to bottom)

func (*ColumnLayout) IsMultiColumn ¶

func (l *ColumnLayout) IsMultiColumn() bool

IsMultiColumn returns true if multiple columns were detected

func (*ColumnLayout) IsSingleColumn ¶

func (l *ColumnLayout) IsSingleColumn() bool

IsSingleColumn returns true if only one column was detected

type Gap ¶

type Gap struct {
	Left   float64 // Left edge of gap
	Right  float64 // Right edge of gap
	Top    float64 // Top of gap region
	Bottom float64 // Bottom of gap region
}

Gap represents a vertical whitespace gap

func (Gap) Center ¶

func (g Gap) Center() float64

Center returns the X center of the gap

func (Gap) Height ¶

func (g Gap) Height() float64

Height returns the height of the gap

func (Gap) Width ¶

func (g Gap) Width() float64

Width returns the width of the gap

type HeaderFooterConfig ¶

type HeaderFooterConfig struct {
	// HeaderRegionHeight is the height from top of page to consider as header zone
	// Default: 72 points (1 inch)
	HeaderRegionHeight float64

	// FooterRegionHeight is the height from bottom of page to consider as footer zone
	// Default: 72 points (1 inch)
	FooterRegionHeight float64

	// MinOccurrenceRatio is the minimum fraction of pages a text must appear on
	// to be considered a header/footer (0.0 to 1.0)
	// Default: 0.5 (50% of pages)
	MinOccurrenceRatio float64

	// PositionTolerance is the maximum Y difference for text to be considered same position
	// Default: 5 points
	PositionTolerance float64

	// XPositionTolerance is the maximum X difference for text to be considered same position
	// Default: 10 points
	XPositionTolerance float64

	// MinPages is the minimum number of pages required for header/footer detection
	// Default: 2
	MinPages int
}

HeaderFooterConfig holds configuration for header/footer detection

func DefaultHeaderFooterConfig ¶

func DefaultHeaderFooterConfig() HeaderFooterConfig

DefaultHeaderFooterConfig returns sensible default configuration

type HeaderFooterDetector ¶

type HeaderFooterDetector struct {
	// contains filtered or unexported fields
}

HeaderFooterDetector detects headers and footers across pages

func NewHeaderFooterDetector ¶

func NewHeaderFooterDetector() *HeaderFooterDetector

NewHeaderFooterDetector creates a new detector with default configuration

func NewHeaderFooterDetectorWithConfig ¶

func NewHeaderFooterDetectorWithConfig(config HeaderFooterConfig) *HeaderFooterDetector

NewHeaderFooterDetectorWithConfig creates a detector with custom configuration

func (*HeaderFooterDetector) Detect ¶

func (d *HeaderFooterDetector) Detect(pages []PageFragments) *HeaderFooterResult

Detect analyzes fragments from multiple pages to find headers and footers

type HeaderFooterRegion ¶

type HeaderFooterRegion struct {
	// Type indicates if this is a header or footer
	Type RegionType

	// BBox is the bounding box of the region
	BBox model.BBox

	// Text is the typical text content (may include page number placeholder)
	Text string

	// IsPageNumber indicates if this region contains page numbers
	IsPageNumber bool

	// Confidence is the detection confidence (0.0 to 1.0)
	Confidence float64

	// PageIndices lists which pages have this header/footer
	PageIndices []int
}

HeaderFooterRegion represents a detected header or footer region

type HeaderFooterResult ¶

type HeaderFooterResult struct {
	// Headers contains detected header regions
	Headers []HeaderFooterRegion

	// Footers contains detected footer regions
	Footers []HeaderFooterRegion

	// Config used for detection
	Config HeaderFooterConfig
}

HeaderFooterResult contains the detection results

func (*HeaderFooterResult) FilterFragments ¶

func (r *HeaderFooterResult) FilterFragments(pageIndex int, fragments []text.TextFragment, pageHeight float64) []text.TextFragment

FilterFragments removes header/footer fragments from a page

func (*HeaderFooterResult) GetFooterTexts ¶

func (r *HeaderFooterResult) GetFooterTexts() []string

GetFooterTexts returns all detected footer texts

func (*HeaderFooterResult) GetHeaderTexts ¶

func (r *HeaderFooterResult) GetHeaderTexts() []string

GetHeaderTexts returns all detected header texts

func (*HeaderFooterResult) HasFooters ¶

func (r *HeaderFooterResult) HasFooters() bool

HasFooters returns true if any footers were detected

func (*HeaderFooterResult) HasHeaders ¶

func (r *HeaderFooterResult) HasHeaders() bool

HasHeaders returns true if any headers were detected

func (*HeaderFooterResult) HasHeadersOrFooters ¶

func (r *HeaderFooterResult) HasHeadersOrFooters() bool

HasHeadersOrFooters returns true if any headers or footers were detected

func (*HeaderFooterResult) Summary ¶

func (r *HeaderFooterResult) Summary() string

Summary returns a human-readable summary of detection results

type Heading ¶

type Heading struct {
	// Level is the heading level (H1-H6)
	Level HeadingLevel

	// Text is the heading text content
	Text string

	// BBox is the bounding box of the heading
	BBox model.BBox

	// Lines are the lines that make up this heading
	Lines []Line

	// Fragments are the text fragments in this heading
	Fragments []text.TextFragment

	// Index is the heading's position in document order (0-based)
	Index int

	// PageIndex is the page number where this heading appears (0-based)
	PageIndex int

	// FontSize is the average font size of the heading
	FontSize float64

	// IsBold indicates if the heading appears to be bold
	IsBold bool

	// IsItalic indicates if the heading appears to be italic
	IsItalic bool

	// IsAllCaps indicates if the heading is in all capital letters
	IsAllCaps bool

	// IsNumbered indicates if the heading has a number prefix (e.g., "1.2.3")
	IsNumbered bool

	// NumberPrefix is the number prefix if present (e.g., "1.2.3")
	NumberPrefix string

	// Confidence is a score from 0-1 indicating detection confidence
	Confidence float64

	// Alignment is the text alignment of the heading
	Alignment LineAlignment

	// SpacingBefore is the vertical space before this heading
	SpacingBefore float64

	// SpacingAfter is the vertical space after this heading
	SpacingAfter float64
}

Heading represents a detected heading in a document

func (*Heading) ContainsPoint ¶

func (h *Heading) ContainsPoint(x, y float64) bool

ContainsPoint returns true if the point is within the heading's bounding box

func (*Heading) GetAnchorID ¶

func (h *Heading) GetAnchorID() string

GetAnchorID returns a URL-safe anchor ID for this heading

func (*Heading) GetCleanText ¶

func (h *Heading) GetCleanText() string

GetCleanText returns the heading text without number prefix

func (*Heading) IsTopLevel ¶

func (h *Heading) IsTopLevel() bool

IsTopLevel returns true if this is an H1 heading

func (*Heading) ToMarkdown ¶

func (h *Heading) ToMarkdown() string

ToMarkdown returns the heading as a markdown heading

func (*Heading) WordCount ¶

func (h *Heading) WordCount() int

WordCount returns the word count of the heading text

type HeadingConfig ¶

type HeadingConfig struct {
	// FontSizeRatios maps heading levels to minimum font size ratios relative to body text
	// Default: H1=1.8, H2=1.5, H3=1.3, H4=1.15, H5=1.1, H6=1.05
	FontSizeRatios map[HeadingLevel]float64

	// MaxHeadingLines is the maximum number of lines for a heading
	// Default: 3
	MaxHeadingLines int

	// MinSpacingRatio is the minimum spacing ratio (vs avg) to consider a heading
	// Default: 1.2 (20% more spacing before heading)
	MinSpacingRatio float64

	// BoldIndicatesHeading when true, bold text is more likely a heading
	// Default: true
	BoldIndicatesHeading bool

	// AllCapsIndicatesHeading when true, ALL CAPS text is more likely a heading
	// Default: true
	AllCapsIndicatesHeading bool

	// CenterAlignedBoost is the confidence boost for centered headings
	// Default: 0.1
	CenterAlignedBoost float64

	// NumberedPatterns are regex patterns for numbered headings
	// Default: "1.", "1.1", "1.1.1", "Chapter 1", etc.
	NumberedPatterns []*regexp.Regexp

	// MinConfidence is the minimum confidence to consider something a heading
	// Default: 0.5
	MinConfidence float64
}

HeadingConfig holds configuration for heading detection

func DefaultHeadingConfig ¶

func DefaultHeadingConfig() HeadingConfig

DefaultHeadingConfig returns sensible default configuration

type HeadingDetector ¶

type HeadingDetector struct {
	// contains filtered or unexported fields
}

HeadingDetector detects and classifies headings in document content

func NewHeadingDetector ¶

func NewHeadingDetector() *HeadingDetector

NewHeadingDetector creates a new heading detector with default configuration

func NewHeadingDetectorWithConfig ¶

func NewHeadingDetectorWithConfig(config HeadingConfig) *HeadingDetector

NewHeadingDetectorWithConfig creates a heading detector with custom configuration

func (*HeadingDetector) DetectFromFragments ¶

func (d *HeadingDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *HeadingLayout

DetectFromFragments analyzes fragments directly and detects headings

func (*HeadingDetector) DetectFromLines ¶

func (d *HeadingDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *HeadingLayout

DetectFromLines analyzes lines directly and detects headings

func (*HeadingDetector) DetectFromParagraphs ¶

func (d *HeadingDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *HeadingLayout

DetectFromParagraphs analyzes paragraphs and detects headings

type HeadingLayout ¶

type HeadingLayout struct {
	// Headings are all detected headings in document order
	Headings []Heading

	// PageWidth and PageHeight of the analyzed page/document
	PageWidth  float64
	PageHeight float64

	// BodyFontSize is the detected average body text font size
	BodyFontSize float64

	// Config is the configuration used for detection
	Config HeadingConfig
}

HeadingLayout represents all detected headings in a document or page

func (*HeadingLayout) FindHeadingBefore ¶

func (l *HeadingLayout) FindHeadingBefore(y float64) *Heading

FindHeadingBefore returns the most recent heading that appears before the given Y position in reading order. In standard PDF coordinates (Y increases upward), this returns the last heading whose Y coordinate is greater than the given Y. For example, if querying Y=450 with headings at Y=700, 500, 300, it returns the heading at Y=500 (the closest heading above Y=450).

func (*HeadingLayout) FindHeadingsInRegion ¶

func (l *HeadingLayout) FindHeadingsInRegion(bbox model.BBox) []Heading

FindHeadingsInRegion returns headings within a bounding box

func (*HeadingLayout) GetH1 ¶

func (l *HeadingLayout) GetH1() []Heading

GetH1 returns all H1 (top-level) headings

func (*HeadingLayout) GetH2 ¶

func (l *HeadingLayout) GetH2() []Heading

GetH2 returns all H2 headings

func (*HeadingLayout) GetH3 ¶

func (l *HeadingLayout) GetH3() []Heading

GetH3 returns all H3 headings

func (*HeadingLayout) GetHeading ¶

func (l *HeadingLayout) GetHeading(index int) *Heading

GetHeading returns a specific heading by index

func (*HeadingLayout) GetHeadingsAtLevel ¶

func (l *HeadingLayout) GetHeadingsAtLevel(level HeadingLevel) []Heading

GetHeadingsAtLevel returns all headings at a specific level

func (*HeadingLayout) GetHeadingsInRange ¶

func (l *HeadingLayout) GetHeadingsInRange(minLevel, maxLevel HeadingLevel) []Heading

GetHeadingsInRange returns headings within a specific level range (inclusive)

func (*HeadingLayout) GetMarkdownTOC ¶

func (l *HeadingLayout) GetMarkdownTOC() string

GetMarkdownTOC returns a markdown-formatted table of contents

func (*HeadingLayout) GetOutline ¶

func (l *HeadingLayout) GetOutline() []OutlineEntry

GetOutline returns a hierarchical outline of the document

func (*HeadingLayout) GetTableOfContents ¶

func (l *HeadingLayout) GetTableOfContents() string

GetTableOfContents returns a formatted table of contents string

func (*HeadingLayout) HeadingCount ¶

func (l *HeadingLayout) HeadingCount() int

HeadingCount returns the number of detected headings

type HeadingLevel ¶

type HeadingLevel int

HeadingLevel represents the hierarchical level of a heading (H1-H6)

const (
	HeadingLevelUnknown HeadingLevel = iota
	HeadingLevel1                    // H1 - Main title/chapter
	HeadingLevel2                    // H2 - Major section
	HeadingLevel3                    // H3 - Subsection
	HeadingLevel4                    // H4 - Sub-subsection
	HeadingLevel5                    // H5 - Minor heading
	HeadingLevel6                    // H6 - Lowest level heading
)

func (HeadingLevel) HTMLTag ¶

func (l HeadingLevel) HTMLTag() string

HTMLTag returns the HTML tag for this heading level

func (HeadingLevel) String ¶

func (l HeadingLevel) String() string

String returns a string representation of the heading level

type LayoutElement ¶

type LayoutElement struct {
	// Type is the element type (paragraph, heading, list, etc.)
	Type model.ElementType

	// BBox is the bounding box of the element
	BBox model.BBox

	// Text is the text content of the element
	Text string

	// Index is the element's position in reading order
	Index int

	// ZOrder is the visual stacking order (for overlapping elements)
	ZOrder int

	// Heading contains heading-specific data (if Type == ElementTypeHeading)
	Heading *Heading

	// List contains list-specific data (if Type == ElementTypeList)
	List *List

	// Paragraph contains paragraph-specific data (if Type == ElementTypeParagraph)
	Paragraph *Paragraph

	// Lines are the lines that make up this element
	Lines []Line

	// Children contains nested elements (for compound structures)
	Children []LayoutElement
}

LayoutElement represents a detected layout element such as a paragraph, heading, or list. It includes the element type, bounding box, text content, and type-specific metadata.

func (*LayoutElement) ToModelElement ¶

func (le *LayoutElement) ToModelElement() model.Element

ToModelElement converts the layout element to the appropriate model.Element implementation (Heading, List, or Paragraph) based on the element type.

type Line ¶

type Line struct {
	// BBox is the bounding box of the line
	BBox model.BBox

	// Fragments are the text fragments that make up this line (sorted left to right)
	Fragments []text.TextFragment

	// Text is the assembled text content of the line
	Text string

	// Index is the line's position on the page (0-based, top to bottom)
	Index int

	// Baseline is the Y coordinate of the text baseline
	Baseline float64

	// Height is the line height (max fragment height)
	Height float64

	// SpacingBefore is the vertical space from the previous line (0 for first line)
	SpacingBefore float64

	// SpacingAfter is the vertical space to the next line (0 for last line)
	SpacingAfter float64

	// Alignment is the detected horizontal alignment
	Alignment LineAlignment

	// Indentation is the left indentation relative to the page/column margin
	Indentation float64

	// AverageFontSize is the average font size of fragments in this line
	AverageFontSize float64

	// Direction is the dominant text direction (LTR/RTL)
	Direction text.Direction
}

Line represents a single line of text on a page

func ReorderLinesForReading ¶

func ReorderLinesForReading(lines []Line, pageWidth, pageHeight float64) []Line

ReorderLinesForReading takes lines and returns them in proper reading order

func (*Line) ContainsPoint ¶

func (line *Line) ContainsPoint(x, y float64) bool

ContainsPoint returns true if the point is within the line's bounding box

func (*Line) HasLargerFont ¶

func (line *Line) HasLargerFont(size float64) bool

HasLargerFont returns true if this line's font is larger than the given size

func (*Line) IsEmpty ¶

func (line *Line) IsEmpty() bool

IsEmpty returns true if the line has no text content

func (*Line) IsIndented ¶

func (line *Line) IsIndented(margin, tolerance float64) bool

IsIndented returns true if the line is indented relative to the margin

func (*Line) WordCount ¶

func (line *Line) WordCount() int

WordCount returns an approximate word count for the line

type LineAlignment ¶

type LineAlignment int

LineAlignment represents the horizontal alignment of a line

const (
	AlignUnknown LineAlignment = iota
	AlignLeft
	AlignCenter
	AlignRight
	AlignJustified
)

func (LineAlignment) String ¶

func (a LineAlignment) String() string

String returns a string representation of the alignment

type LineConfig ¶

type LineConfig struct {
	// LineHeightTolerance is the Y-distance tolerance for grouping fragments into lines
	// as a fraction of fragment height (default: 0.5)
	LineHeightTolerance float64

	// MinLineWidth is the minimum width for a valid line (default: 5 points)
	MinLineWidth float64

	// AlignmentTolerance is the tolerance for alignment detection (default: 10 points)
	AlignmentTolerance float64

	// JustificationThreshold is the minimum line width ratio to consider justified
	// (default: 0.9 = line must be 90% of max width)
	JustificationThreshold float64
}

LineConfig holds configuration for line detection

func DefaultLineConfig ¶

func DefaultLineConfig() LineConfig

DefaultLineConfig returns sensible default configuration

type LineDetector ¶

type LineDetector struct {
	// contains filtered or unexported fields
}

LineDetector detects text lines on a page

func NewLineDetector ¶

func NewLineDetector() *LineDetector

NewLineDetector creates a new line detector with default configuration

func NewLineDetectorWithConfig ¶

func NewLineDetectorWithConfig(config LineConfig) *LineDetector

NewLineDetectorWithConfig creates a line detector with custom configuration

func (*LineDetector) Detect ¶

func (d *LineDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *LineLayout

Detect analyzes text fragments and detects lines

type LineLayout ¶

type LineLayout struct {
	// Lines are the detected text lines (sorted top to bottom)
	Lines []Line

	// PageWidth is the width of the page/region
	PageWidth float64

	// PageHeight is the height of the page/region
	PageHeight float64

	// AverageLineSpacing is the average spacing between lines
	AverageLineSpacing float64

	// AverageLineHeight is the average line height
	AverageLineHeight float64

	// Config is the configuration used for detection
	Config LineConfig
}

LineLayout represents the detected line structure of a page or region

func (*LineLayout) FindLinesInRegion ¶

func (l *LineLayout) FindLinesInRegion(bbox model.BBox) []Line

FindLinesInRegion returns lines that fall within a bounding box

func (*LineLayout) GetAllFragments ¶

func (l *LineLayout) GetAllFragments() []text.TextFragment

GetAllFragments returns all fragments in reading order

func (*LineLayout) GetLine ¶

func (l *LineLayout) GetLine(index int) *Line

GetLine returns a specific line by index

func (*LineLayout) GetLinesByAlignment ¶

func (l *LineLayout) GetLinesByAlignment(alignment LineAlignment) []Line

GetLinesByAlignment returns lines with a specific alignment

func (*LineLayout) GetText ¶

func (l *LineLayout) GetText() string

GetText returns all text in line order

func (*LineLayout) IsParagraphBreak ¶

func (l *LineLayout) IsParagraphBreak(lineIndex int) bool

IsParagraphBreak returns true if there's a paragraph break after the given line index

func (*LineLayout) LineCount ¶

func (l *LineLayout) LineCount() int

LineCount returns the number of detected lines

type List ¶

type List struct {
	// Items are the list items (top level only; nested items are in Children)
	Items []ListItem

	// BBox is the bounding box of the entire list
	BBox model.BBox

	// Type is the primary list type
	Type ListType

	// BulletStyle is the bullet style (for bullet lists)
	BulletStyle BulletStyle

	// Index is the list's position in document order
	Index int

	// Level is the nesting level of this list
	Level int

	// IsMixed indicates if the list contains mixed types (bullet + numbered)
	IsMixed bool

	// ItemCount is the total number of items (including nested)
	ItemCount int
}

List represents a complete list structure

func (*List) GetAllItems ¶

func (list *List) GetAllItems() []ListItem

GetAllItems returns all items including nested (flattened)

func (*List) GetText ¶

func (list *List) GetText() string

GetText returns all list text as a formatted string

func (*List) HasNesting ¶

func (list *List) HasNesting() bool

HasNesting returns true if the list contains nested items

func (*List) MaxDepth ¶

func (list *List) MaxDepth() int

MaxDepth returns the maximum nesting depth

func (*List) ToMarkdown ¶

func (list *List) ToMarkdown() string

ToMarkdown returns the list as markdown

type ListConfig ¶

type ListConfig struct {
	// BulletCharacters are characters recognized as bullets
	BulletCharacters []rune

	// IndentThreshold is the minimum indentation increase to consider nested
	// Default: 15 points
	IndentThreshold float64

	// MaxListGap is the maximum vertical gap between items to consider same list
	// as a ratio of line height (default: 2.0)
	MaxListGap float64

	// NumberedPatterns are regex patterns for numbered list items
	NumberedPatterns []*regexp.Regexp

	// LetterPatterns are regex patterns for lettered list items
	LetterPatterns []*regexp.Regexp

	// RomanPatterns are regex patterns for roman numeral list items
	RomanPatterns []*regexp.Regexp

	// MinConsecutiveItems is minimum items to consider a list (default: 2)
	MinConsecutiveItems int
}

ListConfig holds configuration for list detection

func DefaultListConfig ¶

func DefaultListConfig() ListConfig

DefaultListConfig returns sensible default configuration

type ListDetector ¶

type ListDetector struct {
	// contains filtered or unexported fields
}

ListDetector detects and structures lists in document content

func NewListDetector ¶

func NewListDetector() *ListDetector

NewListDetector creates a new list detector with default configuration

func NewListDetectorWithConfig ¶

func NewListDetectorWithConfig(config ListConfig) *ListDetector

NewListDetectorWithConfig creates a list detector with custom configuration

func (*ListDetector) DetectFromFragments ¶

func (d *ListDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ListLayout

DetectFromFragments analyzes fragments directly and detects lists

func (*ListDetector) DetectFromLines ¶

func (d *ListDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ListLayout

DetectFromLines analyzes lines directly and detects lists

func (*ListDetector) DetectFromParagraphs ¶

func (d *ListDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *ListLayout

DetectFromParagraphs analyzes paragraphs and detects lists

type ListItem ¶

type ListItem struct {
	// Text is the item text (without the bullet/number prefix)
	Text string

	// RawText is the original text including prefix
	RawText string

	// Prefix is the bullet or number prefix (e.g., "•", "1.", "a)")
	Prefix string

	// BBox is the bounding box of this item
	BBox model.BBox

	// Lines are the lines that make up this item (may span multiple lines)
	Lines []Line

	// Index is the item's position within its parent list (0-based)
	Index int

	// Level is the nesting level (0 = top level, 1 = first nested, etc.)
	Level int

	// ListType is the type of list this item belongs to
	ListType ListType

	// BulletStyle is the bullet style (for bullet lists)
	BulletStyle BulletStyle

	// Number is the numeric value for numbered/lettered lists
	Number int

	// Children contains nested list items
	Children []ListItem
}

ListItem represents a single item in a list

func (*ListItem) ChildCount ¶

func (item *ListItem) ChildCount() int

ChildCount returns the number of direct children

func (*ListItem) ContainsPoint ¶

func (item *ListItem) ContainsPoint(x, y float64) bool

ContainsPoint returns true if the point is within the item's bounding box

func (*ListItem) GetFullText ¶

func (item *ListItem) GetFullText() string

GetFullText returns the raw text including prefix

func (*ListItem) HasChildren ¶

func (item *ListItem) HasChildren() bool

HasChildren returns true if this item has nested items

func (*ListItem) IsCheckbox ¶

func (item *ListItem) IsCheckbox() bool

IsCheckbox returns true if this is a checkbox item

func (*ListItem) IsChecked ¶

func (item *ListItem) IsChecked() bool

IsChecked returns true if this is a checked checkbox

func (*ListItem) IsFirstInList ¶

func (item *ListItem) IsFirstInList() bool

IsFirstInList returns true if this item has number/index 1 or 0

func (*ListItem) WordCount ¶

func (item *ListItem) WordCount() int

WordCount returns the word count of the item text

type ListLayout ¶

type ListLayout struct {
	// Lists are all detected lists in document order
	Lists []List

	// AllItems are all list items in reading order (flattened)
	AllItems []ListItem

	// PageWidth and PageHeight
	PageWidth  float64
	PageHeight float64

	// Config is the configuration used for detection
	Config ListConfig
}

ListLayout represents all detected lists on a page

func (*ListLayout) FindListsInRegion ¶

func (l *ListLayout) FindListsInRegion(bbox model.BBox) []List

FindListsInRegion returns lists within a bounding box

func (*ListLayout) GetBulletLists ¶

func (l *ListLayout) GetBulletLists() []List

GetBulletLists returns all bullet lists

func (*ListLayout) GetList ¶

func (l *ListLayout) GetList(index int) *List

GetList returns a specific list by index

func (*ListLayout) GetListsByType ¶

func (l *ListLayout) GetListsByType(listType ListType) []List

GetListsByType returns lists of a specific type

func (*ListLayout) GetNumberedLists ¶

func (l *ListLayout) GetNumberedLists() []List

GetNumberedLists returns all numbered lists

func (*ListLayout) ListCount ¶

func (l *ListLayout) ListCount() int

ListCount returns the number of detected lists

func (*ListLayout) TotalItemCount ¶

func (l *ListLayout) TotalItemCount() int

TotalItemCount returns the total number of list items

type ListType ¶

type ListType int

ListType represents the type of list

const (
	ListTypeUnknown  ListType = iota
	ListTypeBullet            // Bullet points (•, -, *, etc.)
	ListTypeNumbered          // Numbered (1., 2., 3.)
	ListTypeLettered          // Lettered (a., b., c. or A., B., C.)
	ListTypeRoman             // Roman numerals (i., ii., iii. or I., II., III.)
	ListTypeCheckbox          // Checkbox lists (☐, ☑, ✓)
)

func (ListType) String ¶

func (t ListType) String() string

String returns a string representation of the list type

type OutlineEntry ¶

type OutlineEntry struct {
	// Heading is the heading for this entry
	Heading Heading

	// Children are nested outline entries
	Children []OutlineEntry

	// Depth is the nesting depth (0 = top level)
	Depth int
}

OutlineEntry represents an entry in a document outline

type PageFragments ¶

type PageFragments struct {
	PageIndex  int
	PageHeight float64
	PageWidth  float64
	Fragments  []text.TextFragment
}

PageFragments represents text fragments from a single page

type Paragraph ¶

type Paragraph struct {
	// BBox is the bounding box of the paragraph
	BBox model.BBox

	// Lines are the text lines in this paragraph (in reading order)
	Lines []Line

	// Text is the assembled text content
	Text string

	// Index is the paragraph's position in reading order (0-based)
	Index int

	// Style is the detected paragraph style
	Style ParagraphStyle

	// Alignment is the dominant alignment of the paragraph
	Alignment LineAlignment

	// FirstLineIndent is the indentation of the first line relative to subsequent lines
	FirstLineIndent float64

	// LeftMargin is the left margin of the paragraph body
	LeftMargin float64

	// AverageFontSize is the average font size across all lines
	AverageFontSize float64

	// LineSpacing is the average spacing between lines within this paragraph
	LineSpacing float64

	// SpacingBefore is the space before this paragraph
	SpacingBefore float64

	// SpacingAfter is the space after this paragraph
	SpacingAfter float64
}

Paragraph represents a logical paragraph of text

func (*Paragraph) ContainsPoint ¶

func (p *Paragraph) ContainsPoint(x, y float64) bool

ContainsPoint returns true if the point is within the paragraph's bounding box

func (*Paragraph) GetFirstLine ¶

func (p *Paragraph) GetFirstLine() *Line

GetFirstLine returns the first line of the paragraph

func (*Paragraph) GetLastLine ¶

func (p *Paragraph) GetLastLine() *Line

GetLastLine returns the last line of the paragraph

func (*Paragraph) HasFirstLineIndent ¶

func (p *Paragraph) HasFirstLineIndent() bool

HasFirstLineIndent returns true if the paragraph has first-line indentation

func (*Paragraph) IsBlockQuote ¶

func (p *Paragraph) IsBlockQuote() bool

IsBlockQuote returns true if this paragraph is styled as a block quote

func (*Paragraph) IsHeading ¶

func (p *Paragraph) IsHeading() bool

IsHeading returns true if this paragraph is styled as a heading

func (*Paragraph) IsListItem ¶

func (p *Paragraph) IsListItem() bool

IsListItem returns true if this paragraph is styled as a list item

func (*Paragraph) LineCount ¶

func (p *Paragraph) LineCount() int

LineCount returns the number of lines in this paragraph

func (*Paragraph) WordCount ¶

func (p *Paragraph) WordCount() int

WordCount returns an approximate word count for the paragraph

type ParagraphConfig ¶

type ParagraphConfig struct {
	// SpacingThreshold is the multiplier for line spacing to detect paragraph breaks
	// If spacing > avgLineSpacing * SpacingThreshold, it's a paragraph break
	// Default: 1.5
	SpacingThreshold float64

	// IndentThreshold is the minimum indentation to consider as first-line indent
	// Default: 15 points
	IndentThreshold float64

	// HeadingFontSizeRatio is the font size ratio to consider as heading
	// If fontSize > avgFontSize * HeadingFontSizeRatio, it's a heading
	// Default: 1.2 (20% larger)
	HeadingFontSizeRatio float64

	// MinParagraphLines is the minimum number of lines for a paragraph
	// Default: 1
	MinParagraphLines int

	// BlockQuoteIndent is the minimum indentation to consider as block quote
	// Default: 30 points
	BlockQuoteIndent float64

	// ListItemPatterns are regex patterns that indicate list items
	// Default: bullet points, numbers, letters
	ListItemPatterns []string
}

ParagraphConfig holds configuration for paragraph detection

func DefaultParagraphConfig ¶

func DefaultParagraphConfig() ParagraphConfig

DefaultParagraphConfig returns sensible default configuration

type ParagraphDetector ¶

type ParagraphDetector struct {
	// contains filtered or unexported fields
}

ParagraphDetector detects paragraphs from lines

func NewParagraphDetector ¶

func NewParagraphDetector() *ParagraphDetector

NewParagraphDetector creates a new paragraph detector with default configuration

func NewParagraphDetectorWithConfig ¶

func NewParagraphDetectorWithConfig(config ParagraphConfig) *ParagraphDetector

NewParagraphDetectorWithConfig creates a paragraph detector with custom configuration

func (*ParagraphDetector) Detect ¶

func (d *ParagraphDetector) Detect(lines []Line, pageWidth, pageHeight float64) *ParagraphLayout

Detect analyzes lines and groups them into paragraphs

func (*ParagraphDetector) DetectFromFragments ¶

func (d *ParagraphDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ParagraphLayout

DetectFromFragments is a convenience method that first detects lines, then paragraphs

type ParagraphLayout ¶

type ParagraphLayout struct {
	// Paragraphs are the detected paragraphs (in reading order)
	Paragraphs []Paragraph

	// PageWidth is the width of the page
	PageWidth float64

	// PageHeight is the height of the page
	PageHeight float64

	// AverageParagraphSpacing is the average spacing between paragraphs
	AverageParagraphSpacing float64

	// Config is the configuration used for detection
	Config ParagraphConfig
}

ParagraphLayout represents the detected paragraph structure of a page

func (*ParagraphLayout) FindParagraphsInRegion ¶

func (l *ParagraphLayout) FindParagraphsInRegion(bbox model.BBox) []Paragraph

FindParagraphsInRegion returns paragraphs that fall within a bounding box

func (*ParagraphLayout) GetHeadings ¶

func (l *ParagraphLayout) GetHeadings() []Paragraph

GetHeadings returns all paragraphs detected as headings

func (*ParagraphLayout) GetListItems ¶

func (l *ParagraphLayout) GetListItems() []Paragraph

GetListItems returns all paragraphs detected as list items

func (*ParagraphLayout) GetParagraph ¶

func (l *ParagraphLayout) GetParagraph(index int) *Paragraph

GetParagraph returns a specific paragraph by index

func (*ParagraphLayout) GetParagraphsByStyle ¶

func (l *ParagraphLayout) GetParagraphsByStyle(style ParagraphStyle) []Paragraph

GetParagraphsByStyle returns paragraphs with a specific style

func (*ParagraphLayout) GetText ¶

func (l *ParagraphLayout) GetText() string

GetText returns all text with paragraph breaks

func (*ParagraphLayout) ParagraphCount ¶

func (l *ParagraphLayout) ParagraphCount() int

ParagraphCount returns the number of detected paragraphs

type ParagraphStyle ¶

type ParagraphStyle int

ParagraphStyle represents the detected style of a paragraph

const (
	StyleNormal ParagraphStyle = iota
	StyleHeading
	StyleBlockQuote
	StyleListItem
	StyleCode
	StyleCaption
)

func (ParagraphStyle) String ¶

func (s ParagraphStyle) String() string

String returns a string representation of the paragraph style

type ReadingDirection ¶

type ReadingDirection int

ReadingDirection indicates the primary reading direction of a document

const (
	// LeftToRight is the default for most Western languages
	LeftToRight ReadingDirection = iota
	// RightToLeft is used for Arabic, Hebrew, etc.
	RightToLeft
	// TopToBottom is used for traditional Chinese/Japanese
	TopToBottom
)

func (ReadingDirection) String ¶

func (d ReadingDirection) String() string

String returns a string representation of the reading direction

type ReadingOrderConfig ¶

type ReadingOrderConfig struct {
	// Direction is the primary reading direction
	Direction ReadingDirection

	// ColumnConfig is the configuration for column detection
	ColumnConfig ColumnConfig

	// LineConfig is the configuration for line detection
	LineConfig LineConfig

	// PreferColumnOrder when true, reads entire columns before moving to next
	// When false, may interleave if content appears to flow across columns
	PreferColumnOrder bool

	// SpanningThreshold is the minimum width ratio for content to be considered spanning
	// Default: 0.7 (content spanning 70%+ of page width is considered spanning)
	SpanningThreshold float64

	// InvertedY indicates that Y coordinates increase downward (Y=0 at top)
	// rather than the standard PDF convention where Y increases upward (Y=0 at bottom)
	// When true, lower Y values are at the top of the page
	// Default: false (auto-detect based on content)
	InvertedY *bool
}

ReadingOrderConfig holds configuration for reading order detection

func DefaultReadingOrderConfig ¶

func DefaultReadingOrderConfig() ReadingOrderConfig

DefaultReadingOrderConfig returns sensible default configuration

type ReadingOrderDetector ¶

type ReadingOrderDetector struct {
	// contains filtered or unexported fields
}

ReadingOrderDetector determines the correct reading order for page content

func NewReadingOrderDetector ¶

func NewReadingOrderDetector() *ReadingOrderDetector

NewReadingOrderDetector creates a new reading order detector with default configuration

func NewReadingOrderDetectorWithConfig ¶

func NewReadingOrderDetectorWithConfig(config ReadingOrderConfig) *ReadingOrderDetector

NewReadingOrderDetectorWithConfig creates a reading order detector with custom configuration

func (*ReadingOrderDetector) Detect ¶

func (d *ReadingOrderDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ReadingOrderResult

Detect analyzes fragments and returns them in proper reading order

func (*ReadingOrderDetector) DetectFromLines ¶

func (d *ReadingOrderDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ReadingOrderResult

DetectFromLines is a convenience method when you already have lines

type ReadingOrderResult ¶

type ReadingOrderResult struct {
	// Fragments in reading order
	Fragments []text.TextFragment

	// Lines in reading order
	Lines []Line

	// Sections represent logical sections of content (spanning + column content)
	Sections []ReadingSection

	// Direction is the detected or configured reading direction
	Direction ReadingDirection

	// ColumnCount is the number of columns detected
	ColumnCount int

	// PageWidth and PageHeight
	PageWidth  float64
	PageHeight float64
}

ReadingOrderResult holds the result of reading order analysis

func (*ReadingOrderResult) GetParagraphs ¶

func (r *ReadingOrderResult) GetParagraphs() *ParagraphLayout

GetParagraphs detects paragraphs from the ordered lines For multi-column documents, paragraphs are detected within each section separately to maintain proper spacing context, then combined in reading order

func (*ReadingOrderResult) GetSectionCount ¶

func (r *ReadingOrderResult) GetSectionCount() int

GetSectionCount returns the number of reading sections

func (*ReadingOrderResult) GetText ¶

func (r *ReadingOrderResult) GetText() string

GetText returns all text in reading order

func (*ReadingOrderResult) IsMultiColumn ¶

func (r *ReadingOrderResult) IsMultiColumn() bool

IsMultiColumn returns true if multiple columns were detected

type ReadingSection ¶

type ReadingSection struct {
	// Type indicates what kind of section this is
	Type SectionType

	// Lines in this section (in reading order)
	Lines []Line

	// Fragments in this section (in reading order)
	Fragments []text.TextFragment

	// ColumnIndex for column sections (-1 for spanning)
	ColumnIndex int

	// BBox is the bounding box of this section
	BBox struct {
		X, Y, Width, Height float64
	}
}

ReadingSection represents a section of content in reading order

type RegionType ¶

type RegionType int

RegionType indicates whether a region is a header or footer

const (
	Header RegionType = iota
	Footer
)

func (RegionType) String ¶

func (r RegionType) String() string

type SectionType ¶

type SectionType int

SectionType indicates the type of reading section

const (
	SectionSpanning SectionType = iota // Full-width content (titles, headers)
	SectionColumn                      // Column content
)

func (SectionType) String ¶

func (t SectionType) String() string

String returns a string representation of the section type

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL