Documentation
¶
Overview ¶
Package layout provides document layout analysis for extracting semantic structure from PDF pages. It includes the unified Layout Analyzer that orchestrates all detection components including column, line, block, paragraph, heading, list, and reading order detection.
Package layout provides document layout analysis including block detection, column detection, reading order determination, and structural element identification.
Package layout provides document layout analysis including column detection, reading order determination, and structural element identification.
Package layout provides document layout analysis for extracting semantic structure from PDF pages.
This package analyzes text fragments to detect document structure including lines, paragraphs, headings, lists, columns, and reading order.
Layout Analysis ¶
The Analyzer orchestrates all detection components:
analyzer := layout.NewAnalyzer() result := analyzer.Analyze(fragments, pageWidth, pageHeight)
For faster analysis without heading/list detection:
result := analyzer.QuickAnalyze(fragments, pageWidth, pageHeight)
Analysis Results ¶
The AnalysisResult contains:
- Elements - all detected elements in reading order
- Columns - column layout information
- ReadingOrder - proper reading sequence for multi-column layouts
- Headings, Lists, Paragraphs - detected semantic elements
- Blocks, Lines - lower-level text structure
Detectors ¶
The package includes specialized detectors:
- LineDetector - groups fragments into text lines
- ParagraphDetector - groups lines into paragraphs
- HeadingDetector - identifies headings by font size and position
- ListDetector - detects bulleted and numbered lists
- BlockDetector - detects spatial text blocks
- ColumnDetector - detects multi-column layouts
- ReadingOrderDetector - determines proper reading sequence
- HeaderFooterDetector - identifies repeated headers/footers
Configuration ¶
Each detector can be configured independently:
config := layout.DefaultAnalyzerConfig() config.DetectHeadings = true config.DetectLists = true config.HeadingConfig.MinFontSizeRatio = 1.2 analyzer := layout.NewAnalyzerWithConfig(config)
Header/Footer Filtering ¶
For multi-page documents, headers and footers can be detected and filtered:
result := analyzer.AnalyzeWithHeaderFooterFiltering(pageFragments, pageIndex)
Package layout provides document layout analysis including heading detection, which identifies and classifies document headings by level (H1-H6).
Package layout provides document layout analysis including line detection, block detection, column detection, and structural element identification.
Package layout provides document layout analysis including list detection, which identifies and structures bulleted and numbered lists.
Package layout provides document layout analysis including paragraph detection, line detection, block detection, and structural element identification.
Package layout provides document layout analysis including reading order detection, which determines the correct sequence for reading text in complex layouts.
Index ¶
- func IsListItemText(text string) bool
- func ReorderForReading(fragments []text.TextFragment, pageWidth, pageHeight float64) []text.TextFragment
- type AnalysisResult
- type AnalysisStats
- type Analyzer
- func (a *Analyzer) Analyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
- func (a *Analyzer) AnalyzeWithHeaderFooterFiltering(pageFragments []PageFragments, pageIndex int) *AnalysisResult
- func (a *Analyzer) QuickAnalyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
- type AnalyzerConfig
- type Block
- type BlockConfig
- type BlockDetector
- type BlockLayout
- type BulletStyle
- type Column
- type ColumnConfig
- type ColumnDetector
- type ColumnLayout
- type Gap
- type HeaderFooterConfig
- type HeaderFooterDetector
- type HeaderFooterRegion
- type HeaderFooterResult
- func (r *HeaderFooterResult) FilterFragments(pageIndex int, fragments []text.TextFragment, pageHeight float64) []text.TextFragment
- func (r *HeaderFooterResult) GetFooterTexts() []string
- func (r *HeaderFooterResult) GetHeaderTexts() []string
- func (r *HeaderFooterResult) HasFooters() bool
- func (r *HeaderFooterResult) HasHeaders() bool
- func (r *HeaderFooterResult) HasHeadersOrFooters() bool
- func (r *HeaderFooterResult) Summary() string
- type Heading
- type HeadingConfig
- type HeadingDetector
- func (d *HeadingDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *HeadingLayout
- func (d *HeadingDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *HeadingLayout
- func (d *HeadingDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *HeadingLayout
- type HeadingLayout
- func (l *HeadingLayout) FindHeadingBefore(y float64) *Heading
- func (l *HeadingLayout) FindHeadingsInRegion(bbox model.BBox) []Heading
- func (l *HeadingLayout) GetH1() []Heading
- func (l *HeadingLayout) GetH2() []Heading
- func (l *HeadingLayout) GetH3() []Heading
- func (l *HeadingLayout) GetHeading(index int) *Heading
- func (l *HeadingLayout) GetHeadingsAtLevel(level HeadingLevel) []Heading
- func (l *HeadingLayout) GetHeadingsInRange(minLevel, maxLevel HeadingLevel) []Heading
- func (l *HeadingLayout) GetMarkdownTOC() string
- func (l *HeadingLayout) GetOutline() []OutlineEntry
- func (l *HeadingLayout) GetTableOfContents() string
- func (l *HeadingLayout) HeadingCount() int
- type HeadingLevel
- type LayoutElement
- type Line
- type LineAlignment
- type LineConfig
- type LineDetector
- type LineLayout
- func (l *LineLayout) FindLinesInRegion(bbox model.BBox) []Line
- func (l *LineLayout) GetAllFragments() []text.TextFragment
- func (l *LineLayout) GetLine(index int) *Line
- func (l *LineLayout) GetLinesByAlignment(alignment LineAlignment) []Line
- func (l *LineLayout) GetText() string
- func (l *LineLayout) IsParagraphBreak(lineIndex int) bool
- func (l *LineLayout) LineCount() int
- type List
- type ListConfig
- type ListDetector
- func (d *ListDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ListLayout
- func (d *ListDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ListLayout
- func (d *ListDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *ListLayout
- type ListItem
- func (item *ListItem) ChildCount() int
- func (item *ListItem) ContainsPoint(x, y float64) bool
- func (item *ListItem) GetFullText() string
- func (item *ListItem) HasChildren() bool
- func (item *ListItem) IsCheckbox() bool
- func (item *ListItem) IsChecked() bool
- func (item *ListItem) IsFirstInList() bool
- func (item *ListItem) WordCount() int
- type ListLayout
- func (l *ListLayout) FindListsInRegion(bbox model.BBox) []List
- func (l *ListLayout) GetBulletLists() []List
- func (l *ListLayout) GetList(index int) *List
- func (l *ListLayout) GetListsByType(listType ListType) []List
- func (l *ListLayout) GetNumberedLists() []List
- func (l *ListLayout) ListCount() int
- func (l *ListLayout) TotalItemCount() int
- type ListType
- type OutlineEntry
- type PageFragments
- type Paragraph
- func (p *Paragraph) ContainsPoint(x, y float64) bool
- func (p *Paragraph) GetFirstLine() *Line
- func (p *Paragraph) GetLastLine() *Line
- func (p *Paragraph) HasFirstLineIndent() bool
- func (p *Paragraph) IsBlockQuote() bool
- func (p *Paragraph) IsHeading() bool
- func (p *Paragraph) IsListItem() bool
- func (p *Paragraph) LineCount() int
- func (p *Paragraph) WordCount() int
- type ParagraphConfig
- type ParagraphDetector
- type ParagraphLayout
- func (l *ParagraphLayout) FindParagraphsInRegion(bbox model.BBox) []Paragraph
- func (l *ParagraphLayout) GetHeadings() []Paragraph
- func (l *ParagraphLayout) GetListItems() []Paragraph
- func (l *ParagraphLayout) GetParagraph(index int) *Paragraph
- func (l *ParagraphLayout) GetParagraphsByStyle(style ParagraphStyle) []Paragraph
- func (l *ParagraphLayout) GetText() string
- func (l *ParagraphLayout) ParagraphCount() int
- type ParagraphStyle
- type ReadingDirection
- type ReadingOrderConfig
- type ReadingOrderDetector
- type ReadingOrderResult
- type ReadingSection
- type RegionType
- type SectionType
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IsListItemText ¶
IsListItemText checks if text appears to be a list item
func ReorderForReading ¶
func ReorderForReading(fragments []text.TextFragment, pageWidth, pageHeight float64) []text.TextFragment
ReorderForReading takes fragments and returns them in proper reading order This is a convenience function for simple use cases
Types ¶
type AnalysisResult ¶
type AnalysisResult struct {
// Elements are all detected elements in reading order
Elements []LayoutElement
// Columns is the column layout analysis
Columns *ColumnLayout
// ReadingOrder is the reading order analysis
ReadingOrder *ReadingOrderResult
// Headings are all detected headings
Headings *HeadingLayout
// Lists are all detected lists
Lists *ListLayout
// Paragraphs are all detected paragraphs
Paragraphs *ParagraphLayout
// Blocks are all detected text blocks
Blocks *BlockLayout
// Lines are all detected lines
Lines *LineLayout
// PageWidth and PageHeight
PageWidth float64
PageHeight float64
// Statistics
Stats AnalysisStats
}
AnalysisResult holds the complete results from layout analysis, including detected elements, intermediate analysis structures (columns, lines, blocks, paragraphs, headings, lists), and statistics about the analysis.
func (*AnalysisResult) GetElements ¶
func (r *AnalysisResult) GetElements() []model.Element
GetElements converts all layout elements to model.Element interfaces, returning them in reading order.
func (*AnalysisResult) GetMarkdown ¶
func (r *AnalysisResult) GetMarkdown() string
GetMarkdown returns a Markdown representation of the document with headings, lists, and paragraphs formatted appropriately.
func (*AnalysisResult) GetText ¶
func (r *AnalysisResult) GetText() string
GetText returns all extracted text concatenated in reading order. It prefers reading order text if available, falling back to paragraph text.
type AnalysisStats ¶
type AnalysisStats struct {
FragmentCount int
LineCount int
BlockCount int
ParagraphCount int
HeadingCount int
ListCount int
ColumnCount int
ElementCount int
}
AnalysisStats contains counts of detected elements from the layout analysis.
type Analyzer ¶
type Analyzer struct {
// contains filtered or unexported fields
}
Analyzer orchestrates all layout detection components to extract semantic structure from PDF pages. It combines column, line, block, paragraph, heading, list, and reading order detection into a unified analysis pipeline.
func NewAnalyzer ¶
func NewAnalyzer() *Analyzer
NewAnalyzer creates a new layout analyzer with default configuration.
func NewAnalyzerWithConfig ¶
func NewAnalyzerWithConfig(config AnalyzerConfig) *Analyzer
NewAnalyzerWithConfig creates a new layout analyzer with the specified configuration.
func (*Analyzer) Analyze ¶
func (a *Analyzer) Analyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
Analyze performs complete layout analysis on the given text fragments. It runs through all detection phases: column detection, reading order analysis, line detection, block detection, paragraph detection, heading detection (if enabled), list detection (if enabled), and finally builds a unified element tree.
func (*Analyzer) AnalyzeWithHeaderFooterFiltering ¶
func (a *Analyzer) AnalyzeWithHeaderFooterFiltering( pageFragments []PageFragments, pageIndex int, ) *AnalysisResult
AnalyzeWithHeaderFooterFiltering performs layout analysis with automatic header and footer detection and removal. This requires multiple pages to identify repeated content at the top and bottom of pages.
func (*Analyzer) QuickAnalyze ¶
func (a *Analyzer) QuickAnalyze(fragments []text.TextFragment, pageWidth, pageHeight float64) *AnalysisResult
QuickAnalyze performs a fast analysis focusing on text structure without detailed heading or list detection. It only runs reading order and paragraph detection for better performance when fine-grained structure is not needed.
type AnalyzerConfig ¶
type AnalyzerConfig struct {
// Column detection configuration
ColumnConfig ColumnConfig
// Line detection configuration
LineConfig LineConfig
// Paragraph detection configuration
ParagraphConfig ParagraphConfig
// Block detection configuration
BlockConfig BlockConfig
// Heading detection configuration
HeadingConfig HeadingConfig
// List detection configuration
ListConfig ListConfig
// Reading order configuration
ReadingOrderConfig ReadingOrderConfig
// DetectHeadings enables heading detection
DetectHeadings bool
// DetectLists enables list detection
DetectLists bool
// UseReadingOrder uses reading order for element ordering
UseReadingOrder bool
}
AnalyzerConfig holds configuration options for the layout analyzer. Each detection component has its own sub-configuration, and there are flags to enable or disable optional analysis features.
func DefaultAnalyzerConfig ¶
func DefaultAnalyzerConfig() AnalyzerConfig
DefaultAnalyzerConfig returns a configuration with sensible defaults for typical document layout analysis, with all detection features enabled.
type Block ¶
type Block struct {
// BBox is the bounding box of the block
BBox model.BBox
// Fragments are the text fragments contained in this block (in reading order)
Fragments []text.TextFragment
// Lines are the fragments grouped into horizontal lines
Lines [][]text.TextFragment
// Index is the block's position in reading order (0-based)
Index int
// Level indicates nesting depth (0 = top level)
Level int
}
Block represents a contiguous rectangular region of text on a page. Blocks are spatially coherent groups of fragments separated by whitespace.
func (*Block) AverageFontSize ¶
AverageFontSize returns the average font size of fragments in this block
func (*Block) ContainsPoint ¶
ContainsPoint returns true if the given point is within this block's bounding box
func (*Block) FragmentCount ¶
FragmentCount returns the number of fragments in this block
type BlockConfig ¶
type BlockConfig struct {
// LineHeightTolerance is the Y-distance tolerance for grouping fragments into lines
// as a fraction of fragment height (default: 0.5)
LineHeightTolerance float64
// HorizontalGapThreshold is the minimum horizontal gap to consider fragments separate
// as a fraction of average font size (default: 3.0)
HorizontalGapThreshold float64
// VerticalGapThreshold is the minimum vertical gap to start a new block
// as a fraction of average line height (default: 1.5)
VerticalGapThreshold float64
// MinBlockWidth is the minimum width for a valid block (default: 10 points)
MinBlockWidth float64
// MinBlockHeight is the minimum height for a valid block (default: 5 points)
MinBlockHeight float64
// MergeOverlappingBlocks controls whether overlapping blocks should be merged
MergeOverlappingBlocks bool
}
BlockConfig holds configuration for block detection
func DefaultBlockConfig ¶
func DefaultBlockConfig() BlockConfig
DefaultBlockConfig returns sensible default configuration
type BlockDetector ¶
type BlockDetector struct {
// contains filtered or unexported fields
}
BlockDetector detects text blocks on a page
func NewBlockDetector ¶
func NewBlockDetector() *BlockDetector
NewBlockDetector creates a new block detector with default configuration
func NewBlockDetectorWithConfig ¶
func NewBlockDetectorWithConfig(config BlockConfig) *BlockDetector
NewBlockDetectorWithConfig creates a block detector with custom configuration
func (*BlockDetector) Detect ¶
func (d *BlockDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *BlockLayout
Detect analyzes text fragments and detects block layout
type BlockLayout ¶
type BlockLayout struct {
// Blocks are the detected text blocks (in reading order)
Blocks []Block
// PageWidth is the width of the page
PageWidth float64
// PageHeight is the height of the page
PageHeight float64
// Config is the configuration used for detection
Config BlockConfig
}
BlockLayout represents the detected block structure of a page
func (*BlockLayout) BlockCount ¶
func (l *BlockLayout) BlockCount() int
BlockCount returns the number of detected blocks
func (*BlockLayout) GetAllFragments ¶
func (l *BlockLayout) GetAllFragments() []text.TextFragment
GetAllFragments returns all fragments in reading order
func (*BlockLayout) GetBlock ¶
func (l *BlockLayout) GetBlock(index int) *Block
GetBlock returns a specific block by index
func (*BlockLayout) GetText ¶
func (l *BlockLayout) GetText() string
GetText returns all text in reading order (block by block)
type BulletStyle ¶
type BulletStyle int
BulletStyle represents the specific bullet character used
const ( BulletStyleUnknown BulletStyle = iota BulletStyleDisc // • (filled circle) BulletStyleCircle // ○ (empty circle) BulletStyleSquare // ■ (filled square) BulletStyleDash // - (dash) BulletStyleAsterisk // * (asterisk) BulletStyleArrow // → or ▶ (arrow) BulletStyleTriangle // ▪ or ▸ (triangle) BulletStyleCheckEmpty // ☐ (empty checkbox) BulletStyleCheckFilled // ☑ or ✓ (checked) )
func (BulletStyle) String ¶
func (s BulletStyle) String() string
String returns a string representation of the bullet style
type Column ¶
type Column struct {
// Bounding box of the column
BBox model.BBox
// Fragments contained in this column (sorted top to bottom)
Fragments []text.TextFragment
// Index of the column (0-based, left to right)
Index int
}
Column represents a detected text column on a page
type ColumnConfig ¶
type ColumnConfig struct {
// MinColumnWidth is the minimum width for a region to be considered a column
// Default: 50 points
MinColumnWidth float64
// MinGapWidth is the minimum whitespace gap to consider as column separator
// Default: 20 points
MinGapWidth float64
// MinGapHeight is the minimum vertical extent of a gap to be significant
// As a ratio of page height (0.0 to 1.0)
// Default: 0.5 (50% of page height)
MinGapHeightRatio float64
// MaxColumns is the maximum number of columns to detect
// Default: 6
MaxColumns int
// MergeThreshold is the maximum X distance between fragments to consider them same column
// Default: 10 points
MergeThreshold float64
// SpanningThreshold is the minimum line width ratio (vs content width) for content
// to be considered spanning. Lines with content in gaps but less than this width
// are treated as column content (avoids false positives from stray fragments).
// Default: 0.35 (line must span 35% of content width - allows centered titles)
SpanningThreshold float64
}
ColumnConfig holds configuration for column detection
func DefaultColumnConfig ¶
func DefaultColumnConfig() ColumnConfig
DefaultColumnConfig returns sensible default configuration
type ColumnDetector ¶
type ColumnDetector struct {
// contains filtered or unexported fields
}
ColumnDetector detects multi-column layouts in text
func NewColumnDetector ¶
func NewColumnDetector() *ColumnDetector
NewColumnDetector creates a new column detector with default configuration
func NewColumnDetectorWithConfig ¶
func NewColumnDetectorWithConfig(config ColumnConfig) *ColumnDetector
NewColumnDetectorWithConfig creates a column detector with custom configuration
func (*ColumnDetector) Detect ¶
func (d *ColumnDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ColumnLayout
Detect analyzes text fragments and detects column layout
type ColumnLayout ¶
type ColumnLayout struct {
// Detected columns (sorted left to right)
Columns []Column
// SpanningFragments are fragments that span across column gaps
// (e.g., centered titles, full-width headers)
// These are excluded from column content and stored separately
SpanningFragments []text.TextFragment
// Page dimensions
PageWidth float64
PageHeight float64
// Configuration used for detection
Config ColumnConfig
}
ColumnLayout represents the detected column structure of a page
func (*ColumnLayout) ColumnCount ¶
func (l *ColumnLayout) ColumnCount() int
ColumnCount returns the number of detected columns
func (*ColumnLayout) GetColumn ¶
func (l *ColumnLayout) GetColumn(index int) *Column
GetColumn returns a specific column by index
func (*ColumnLayout) GetFragmentsInReadingOrder ¶
func (l *ColumnLayout) GetFragmentsInReadingOrder() []text.TextFragment
GetFragmentsInReadingOrder returns all fragments ordered for reading (left column first, then right column, each column top-to-bottom)
func (*ColumnLayout) GetText ¶
func (l *ColumnLayout) GetText() string
GetText returns the text content in reading order (column by column, top to bottom)
func (*ColumnLayout) IsMultiColumn ¶
func (l *ColumnLayout) IsMultiColumn() bool
IsMultiColumn returns true if multiple columns were detected
func (*ColumnLayout) IsSingleColumn ¶
func (l *ColumnLayout) IsSingleColumn() bool
IsSingleColumn returns true if only one column was detected
type Gap ¶
type Gap struct {
Left float64 // Left edge of gap
Right float64 // Right edge of gap
Top float64 // Top of gap region
Bottom float64 // Bottom of gap region
}
Gap represents a vertical whitespace gap
type HeaderFooterConfig ¶
type HeaderFooterConfig struct {
// Default: 72 points (1 inch)
HeaderRegionHeight float64
// Default: 72 points (1 inch)
FooterRegionHeight float64
// to be considered a header/footer (0.0 to 1.0)
// Default: 0.5 (50% of pages)
MinOccurrenceRatio float64
// Default: 5 points
PositionTolerance float64
// Default: 10 points
XPositionTolerance float64
// Default: 2
MinPages int
}
HeaderFooterConfig holds configuration for header/footer detection
func DefaultHeaderFooterConfig ¶
func DefaultHeaderFooterConfig() HeaderFooterConfig
DefaultHeaderFooterConfig returns sensible default configuration
type HeaderFooterDetector ¶
type HeaderFooterDetector struct {
// contains filtered or unexported fields
}
HeaderFooterDetector detects headers and footers across pages
func NewHeaderFooterDetector ¶
func NewHeaderFooterDetector() *HeaderFooterDetector
NewHeaderFooterDetector creates a new detector with default configuration
func NewHeaderFooterDetectorWithConfig ¶
func NewHeaderFooterDetectorWithConfig(config HeaderFooterConfig) *HeaderFooterDetector
NewHeaderFooterDetectorWithConfig creates a detector with custom configuration
func (*HeaderFooterDetector) Detect ¶
func (d *HeaderFooterDetector) Detect(pages []PageFragments) *HeaderFooterResult
Detect analyzes fragments from multiple pages to find headers and footers
type HeaderFooterRegion ¶
type HeaderFooterRegion struct {
Type RegionType
BBox model.BBox
Text string
IsPageNumber bool
Confidence float64
PageIndices []int
}
HeaderFooterRegion represents a detected header or footer region
type HeaderFooterResult ¶
type HeaderFooterResult struct {
Headers []HeaderFooterRegion
Footers []HeaderFooterRegion
Config HeaderFooterConfig
}
HeaderFooterResult contains the detection results
func (*HeaderFooterResult) FilterFragments ¶
func (r *HeaderFooterResult) FilterFragments(pageIndex int, fragments []text.TextFragment, pageHeight float64) []text.TextFragment
FilterFragments removes header/footer fragments from a page
func (*HeaderFooterResult) GetFooterTexts ¶
func (r *HeaderFooterResult) GetFooterTexts() []string
GetFooterTexts returns all detected footer texts
func (*HeaderFooterResult) GetHeaderTexts ¶
func (r *HeaderFooterResult) GetHeaderTexts() []string
GetHeaderTexts returns all detected header texts
func (*HeaderFooterResult) HasFooters ¶
func (r *HeaderFooterResult) HasFooters() bool
HasFooters returns true if any footers were detected
func (*HeaderFooterResult) HasHeaders ¶
func (r *HeaderFooterResult) HasHeaders() bool
HasHeaders returns true if any headers were detected
func (*HeaderFooterResult) HasHeadersOrFooters ¶
func (r *HeaderFooterResult) HasHeadersOrFooters() bool
HasHeadersOrFooters returns true if any headers or footers were detected
func (*HeaderFooterResult) Summary ¶
func (r *HeaderFooterResult) Summary() string
Summary returns a human-readable summary of detection results
type Heading ¶
type Heading struct {
// Level is the heading level (H1-H6)
Level HeadingLevel
// Text is the heading text content
Text string
// BBox is the bounding box of the heading
BBox model.BBox
// Lines are the lines that make up this heading
Lines []Line
// Fragments are the text fragments in this heading
Fragments []text.TextFragment
// Index is the heading's position in document order (0-based)
Index int
// PageIndex is the page number where this heading appears (0-based)
PageIndex int
// FontSize is the average font size of the heading
FontSize float64
// IsBold indicates if the heading appears to be bold
IsBold bool
// IsItalic indicates if the heading appears to be italic
IsItalic bool
// IsAllCaps indicates if the heading is in all capital letters
IsAllCaps bool
// IsNumbered indicates if the heading has a number prefix (e.g., "1.2.3")
IsNumbered bool
// NumberPrefix is the number prefix if present (e.g., "1.2.3")
NumberPrefix string
// Confidence is a score from 0-1 indicating detection confidence
Confidence float64
// Alignment is the text alignment of the heading
Alignment LineAlignment
// SpacingBefore is the vertical space before this heading
SpacingBefore float64
// SpacingAfter is the vertical space after this heading
SpacingAfter float64
}
Heading represents a detected heading in a document
func (*Heading) ContainsPoint ¶
ContainsPoint returns true if the point is within the heading's bounding box
func (*Heading) GetAnchorID ¶
GetAnchorID returns a URL-safe anchor ID for this heading
func (*Heading) GetCleanText ¶
GetCleanText returns the heading text without number prefix
func (*Heading) IsTopLevel ¶
IsTopLevel returns true if this is an H1 heading
func (*Heading) ToMarkdown ¶
ToMarkdown returns the heading as a markdown heading
type HeadingConfig ¶
type HeadingConfig struct {
// FontSizeRatios maps heading levels to minimum font size ratios relative to body text
// Default: H1=1.8, H2=1.5, H3=1.3, H4=1.15, H5=1.1, H6=1.05
FontSizeRatios map[HeadingLevel]float64
// MaxHeadingLines is the maximum number of lines for a heading
// Default: 3
MaxHeadingLines int
// MinSpacingRatio is the minimum spacing ratio (vs avg) to consider a heading
// Default: 1.2 (20% more spacing before heading)
MinSpacingRatio float64
// BoldIndicatesHeading when true, bold text is more likely a heading
// Default: true
BoldIndicatesHeading bool
// AllCapsIndicatesHeading when true, ALL CAPS text is more likely a heading
// Default: true
AllCapsIndicatesHeading bool
// CenterAlignedBoost is the confidence boost for centered headings
// Default: 0.1
CenterAlignedBoost float64
// NumberedPatterns are regex patterns for numbered headings
// Default: "1.", "1.1", "1.1.1", "Chapter 1", etc.
NumberedPatterns []*regexp.Regexp
// MinConfidence is the minimum confidence to consider something a heading
// Default: 0.5
MinConfidence float64
}
HeadingConfig holds configuration for heading detection
func DefaultHeadingConfig ¶
func DefaultHeadingConfig() HeadingConfig
DefaultHeadingConfig returns sensible default configuration
type HeadingDetector ¶
type HeadingDetector struct {
// contains filtered or unexported fields
}
HeadingDetector detects and classifies headings in document content
func NewHeadingDetector ¶
func NewHeadingDetector() *HeadingDetector
NewHeadingDetector creates a new heading detector with default configuration
func NewHeadingDetectorWithConfig ¶
func NewHeadingDetectorWithConfig(config HeadingConfig) *HeadingDetector
NewHeadingDetectorWithConfig creates a heading detector with custom configuration
func (*HeadingDetector) DetectFromFragments ¶
func (d *HeadingDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *HeadingLayout
DetectFromFragments analyzes fragments directly and detects headings
func (*HeadingDetector) DetectFromLines ¶
func (d *HeadingDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *HeadingLayout
DetectFromLines analyzes lines directly and detects headings
func (*HeadingDetector) DetectFromParagraphs ¶
func (d *HeadingDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *HeadingLayout
DetectFromParagraphs analyzes paragraphs and detects headings
type HeadingLayout ¶
type HeadingLayout struct {
// Headings are all detected headings in document order
Headings []Heading
// PageWidth and PageHeight of the analyzed page/document
PageWidth float64
PageHeight float64
// BodyFontSize is the detected average body text font size
BodyFontSize float64
// Config is the configuration used for detection
Config HeadingConfig
}
HeadingLayout represents all detected headings in a document or page
func (*HeadingLayout) FindHeadingBefore ¶
func (l *HeadingLayout) FindHeadingBefore(y float64) *Heading
FindHeadingBefore returns the most recent heading that appears before the given Y position in reading order. In standard PDF coordinates (Y increases upward), this returns the last heading whose Y coordinate is greater than the given Y. For example, if querying Y=450 with headings at Y=700, 500, 300, it returns the heading at Y=500 (the closest heading above Y=450).
func (*HeadingLayout) FindHeadingsInRegion ¶
func (l *HeadingLayout) FindHeadingsInRegion(bbox model.BBox) []Heading
FindHeadingsInRegion returns headings within a bounding box
func (*HeadingLayout) GetH1 ¶
func (l *HeadingLayout) GetH1() []Heading
GetH1 returns all H1 (top-level) headings
func (*HeadingLayout) GetH2 ¶
func (l *HeadingLayout) GetH2() []Heading
GetH2 returns all H2 headings
func (*HeadingLayout) GetH3 ¶
func (l *HeadingLayout) GetH3() []Heading
GetH3 returns all H3 headings
func (*HeadingLayout) GetHeading ¶
func (l *HeadingLayout) GetHeading(index int) *Heading
GetHeading returns a specific heading by index
func (*HeadingLayout) GetHeadingsAtLevel ¶
func (l *HeadingLayout) GetHeadingsAtLevel(level HeadingLevel) []Heading
GetHeadingsAtLevel returns all headings at a specific level
func (*HeadingLayout) GetHeadingsInRange ¶
func (l *HeadingLayout) GetHeadingsInRange(minLevel, maxLevel HeadingLevel) []Heading
GetHeadingsInRange returns headings within a specific level range (inclusive)
func (*HeadingLayout) GetMarkdownTOC ¶
func (l *HeadingLayout) GetMarkdownTOC() string
GetMarkdownTOC returns a markdown-formatted table of contents
func (*HeadingLayout) GetOutline ¶
func (l *HeadingLayout) GetOutline() []OutlineEntry
GetOutline returns a hierarchical outline of the document
func (*HeadingLayout) GetTableOfContents ¶
func (l *HeadingLayout) GetTableOfContents() string
GetTableOfContents returns a formatted table of contents string
func (*HeadingLayout) HeadingCount ¶
func (l *HeadingLayout) HeadingCount() int
HeadingCount returns the number of detected headings
type HeadingLevel ¶
type HeadingLevel int
HeadingLevel represents the hierarchical level of a heading (H1-H6)
const ( HeadingLevelUnknown HeadingLevel = iota HeadingLevel1 // H1 - Main title/chapter HeadingLevel2 // H2 - Major section HeadingLevel3 // H3 - Subsection HeadingLevel4 // H4 - Sub-subsection HeadingLevel5 // H5 - Minor heading HeadingLevel6 // H6 - Lowest level heading )
func (HeadingLevel) HTMLTag ¶
func (l HeadingLevel) HTMLTag() string
HTMLTag returns the HTML tag for this heading level
func (HeadingLevel) String ¶
func (l HeadingLevel) String() string
String returns a string representation of the heading level
type LayoutElement ¶
type LayoutElement struct {
// Type is the element type (paragraph, heading, list, etc.)
Type model.ElementType
// BBox is the bounding box of the element
BBox model.BBox
// Text is the text content of the element
Text string
// Index is the element's position in reading order
Index int
// ZOrder is the visual stacking order (for overlapping elements)
ZOrder int
// Heading contains heading-specific data (if Type == ElementTypeHeading)
Heading *Heading
// List contains list-specific data (if Type == ElementTypeList)
List *List
// Paragraph contains paragraph-specific data (if Type == ElementTypeParagraph)
Paragraph *Paragraph
// Lines are the lines that make up this element
Lines []Line
// Children contains nested elements (for compound structures)
Children []LayoutElement
}
LayoutElement represents a detected layout element such as a paragraph, heading, or list. It includes the element type, bounding box, text content, and type-specific metadata.
func (*LayoutElement) ToModelElement ¶
func (le *LayoutElement) ToModelElement() model.Element
ToModelElement converts the layout element to the appropriate model.Element implementation (Heading, List, or Paragraph) based on the element type.
type Line ¶
type Line struct {
// BBox is the bounding box of the line
BBox model.BBox
// Fragments are the text fragments that make up this line (sorted left to right)
Fragments []text.TextFragment
// Text is the assembled text content of the line
Text string
// Index is the line's position on the page (0-based, top to bottom)
Index int
// Baseline is the Y coordinate of the text baseline
Baseline float64
// Height is the line height (max fragment height)
Height float64
// SpacingBefore is the vertical space from the previous line (0 for first line)
SpacingBefore float64
// SpacingAfter is the vertical space to the next line (0 for last line)
SpacingAfter float64
// Alignment is the detected horizontal alignment
Alignment LineAlignment
// Indentation is the left indentation relative to the page/column margin
Indentation float64
// AverageFontSize is the average font size of fragments in this line
AverageFontSize float64
// Direction is the dominant text direction (LTR/RTL)
Direction text.Direction
}
Line represents a single line of text on a page
func ReorderLinesForReading ¶
ReorderLinesForReading takes lines and returns them in proper reading order
func (*Line) ContainsPoint ¶
ContainsPoint returns true if the point is within the line's bounding box
func (*Line) HasLargerFont ¶
HasLargerFont returns true if this line's font is larger than the given size
func (*Line) IsIndented ¶
IsIndented returns true if the line is indented relative to the margin
type LineAlignment ¶
type LineAlignment int
LineAlignment represents the horizontal alignment of a line
const ( AlignUnknown LineAlignment = iota AlignLeft AlignCenter AlignRight AlignJustified )
func (LineAlignment) String ¶
func (a LineAlignment) String() string
String returns a string representation of the alignment
type LineConfig ¶
type LineConfig struct {
// LineHeightTolerance is the Y-distance tolerance for grouping fragments into lines
// as a fraction of fragment height (default: 0.5)
LineHeightTolerance float64
// MinLineWidth is the minimum width for a valid line (default: 5 points)
MinLineWidth float64
// AlignmentTolerance is the tolerance for alignment detection (default: 10 points)
AlignmentTolerance float64
// JustificationThreshold is the minimum line width ratio to consider justified
// (default: 0.9 = line must be 90% of max width)
JustificationThreshold float64
}
LineConfig holds configuration for line detection
func DefaultLineConfig ¶
func DefaultLineConfig() LineConfig
DefaultLineConfig returns sensible default configuration
type LineDetector ¶
type LineDetector struct {
// contains filtered or unexported fields
}
LineDetector detects text lines on a page
func NewLineDetector ¶
func NewLineDetector() *LineDetector
NewLineDetector creates a new line detector with default configuration
func NewLineDetectorWithConfig ¶
func NewLineDetectorWithConfig(config LineConfig) *LineDetector
NewLineDetectorWithConfig creates a line detector with custom configuration
func (*LineDetector) Detect ¶
func (d *LineDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *LineLayout
Detect analyzes text fragments and detects lines
type LineLayout ¶
type LineLayout struct {
// Lines are the detected text lines (sorted top to bottom)
Lines []Line
// PageWidth is the width of the page/region
PageWidth float64
// PageHeight is the height of the page/region
PageHeight float64
// AverageLineSpacing is the average spacing between lines
AverageLineSpacing float64
// AverageLineHeight is the average line height
AverageLineHeight float64
// Config is the configuration used for detection
Config LineConfig
}
LineLayout represents the detected line structure of a page or region
func (*LineLayout) FindLinesInRegion ¶
func (l *LineLayout) FindLinesInRegion(bbox model.BBox) []Line
FindLinesInRegion returns lines that fall within a bounding box
func (*LineLayout) GetAllFragments ¶
func (l *LineLayout) GetAllFragments() []text.TextFragment
GetAllFragments returns all fragments in reading order
func (*LineLayout) GetLine ¶
func (l *LineLayout) GetLine(index int) *Line
GetLine returns a specific line by index
func (*LineLayout) GetLinesByAlignment ¶
func (l *LineLayout) GetLinesByAlignment(alignment LineAlignment) []Line
GetLinesByAlignment returns lines with a specific alignment
func (*LineLayout) GetText ¶
func (l *LineLayout) GetText() string
GetText returns all text in line order
func (*LineLayout) IsParagraphBreak ¶
func (l *LineLayout) IsParagraphBreak(lineIndex int) bool
IsParagraphBreak returns true if there's a paragraph break after the given line index
func (*LineLayout) LineCount ¶
func (l *LineLayout) LineCount() int
LineCount returns the number of detected lines
type List ¶
type List struct {
// Items are the list items (top level only; nested items are in Children)
Items []ListItem
// BBox is the bounding box of the entire list
BBox model.BBox
// Type is the primary list type
Type ListType
// BulletStyle is the bullet style (for bullet lists)
BulletStyle BulletStyle
// Index is the list's position in document order
Index int
// Level is the nesting level of this list
Level int
// IsMixed indicates if the list contains mixed types (bullet + numbered)
IsMixed bool
// ItemCount is the total number of items (including nested)
ItemCount int
}
List represents a complete list structure
func (*List) GetAllItems ¶
GetAllItems returns all items including nested (flattened)
func (*List) HasNesting ¶
HasNesting returns true if the list contains nested items
func (*List) ToMarkdown ¶
ToMarkdown returns the list as markdown
type ListConfig ¶
type ListConfig struct {
// BulletCharacters are characters recognized as bullets
BulletCharacters []rune
// IndentThreshold is the minimum indentation increase to consider nested
// Default: 15 points
IndentThreshold float64
// MaxListGap is the maximum vertical gap between items to consider same list
// as a ratio of line height (default: 2.0)
MaxListGap float64
// NumberedPatterns are regex patterns for numbered list items
NumberedPatterns []*regexp.Regexp
// LetterPatterns are regex patterns for lettered list items
LetterPatterns []*regexp.Regexp
// RomanPatterns are regex patterns for roman numeral list items
RomanPatterns []*regexp.Regexp
// MinConsecutiveItems is minimum items to consider a list (default: 2)
MinConsecutiveItems int
}
ListConfig holds configuration for list detection
func DefaultListConfig ¶
func DefaultListConfig() ListConfig
DefaultListConfig returns sensible default configuration
type ListDetector ¶
type ListDetector struct {
// contains filtered or unexported fields
}
ListDetector detects and structures lists in document content
func NewListDetector ¶
func NewListDetector() *ListDetector
NewListDetector creates a new list detector with default configuration
func NewListDetectorWithConfig ¶
func NewListDetectorWithConfig(config ListConfig) *ListDetector
NewListDetectorWithConfig creates a list detector with custom configuration
func (*ListDetector) DetectFromFragments ¶
func (d *ListDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ListLayout
DetectFromFragments analyzes fragments directly and detects lists
func (*ListDetector) DetectFromLines ¶
func (d *ListDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ListLayout
DetectFromLines analyzes lines directly and detects lists
func (*ListDetector) DetectFromParagraphs ¶
func (d *ListDetector) DetectFromParagraphs(paragraphs []Paragraph, pageWidth, pageHeight float64) *ListLayout
DetectFromParagraphs analyzes paragraphs and detects lists
type ListItem ¶
type ListItem struct {
// Text is the item text (without the bullet/number prefix)
Text string
// RawText is the original text including prefix
RawText string
// Prefix is the bullet or number prefix (e.g., "•", "1.", "a)")
Prefix string
// BBox is the bounding box of this item
BBox model.BBox
// Lines are the lines that make up this item (may span multiple lines)
Lines []Line
// Index is the item's position within its parent list (0-based)
Index int
// Level is the nesting level (0 = top level, 1 = first nested, etc.)
Level int
// ListType is the type of list this item belongs to
ListType ListType
// BulletStyle is the bullet style (for bullet lists)
BulletStyle BulletStyle
// Number is the numeric value for numbered/lettered lists
Number int
// Children contains nested list items
Children []ListItem
}
ListItem represents a single item in a list
func (*ListItem) ChildCount ¶
ChildCount returns the number of direct children
func (*ListItem) ContainsPoint ¶
ContainsPoint returns true if the point is within the item's bounding box
func (*ListItem) GetFullText ¶
GetFullText returns the raw text including prefix
func (*ListItem) HasChildren ¶
HasChildren returns true if this item has nested items
func (*ListItem) IsCheckbox ¶
IsCheckbox returns true if this is a checkbox item
func (*ListItem) IsFirstInList ¶
IsFirstInList returns true if this item has number/index 1 or 0
type ListLayout ¶
type ListLayout struct {
// Lists are all detected lists in document order
Lists []List
// AllItems are all list items in reading order (flattened)
AllItems []ListItem
// PageWidth and PageHeight
PageWidth float64
PageHeight float64
// Config is the configuration used for detection
Config ListConfig
}
ListLayout represents all detected lists on a page
func (*ListLayout) FindListsInRegion ¶
func (l *ListLayout) FindListsInRegion(bbox model.BBox) []List
FindListsInRegion returns lists within a bounding box
func (*ListLayout) GetBulletLists ¶
func (l *ListLayout) GetBulletLists() []List
GetBulletLists returns all bullet lists
func (*ListLayout) GetList ¶
func (l *ListLayout) GetList(index int) *List
GetList returns a specific list by index
func (*ListLayout) GetListsByType ¶
func (l *ListLayout) GetListsByType(listType ListType) []List
GetListsByType returns lists of a specific type
func (*ListLayout) GetNumberedLists ¶
func (l *ListLayout) GetNumberedLists() []List
GetNumberedLists returns all numbered lists
func (*ListLayout) ListCount ¶
func (l *ListLayout) ListCount() int
ListCount returns the number of detected lists
func (*ListLayout) TotalItemCount ¶
func (l *ListLayout) TotalItemCount() int
TotalItemCount returns the total number of list items
type OutlineEntry ¶
type OutlineEntry struct {
// Heading is the heading for this entry
Heading Heading
// Children are nested outline entries
Children []OutlineEntry
// Depth is the nesting depth (0 = top level)
Depth int
}
OutlineEntry represents an entry in a document outline
type PageFragments ¶
type PageFragments struct {
PageIndex int
PageHeight float64
PageWidth float64
Fragments []text.TextFragment
}
PageFragments represents text fragments from a single page
type Paragraph ¶
type Paragraph struct {
// BBox is the bounding box of the paragraph
BBox model.BBox
// Lines are the text lines in this paragraph (in reading order)
Lines []Line
// Text is the assembled text content
Text string
// Index is the paragraph's position in reading order (0-based)
Index int
// Style is the detected paragraph style
Style ParagraphStyle
// Alignment is the dominant alignment of the paragraph
Alignment LineAlignment
// FirstLineIndent is the indentation of the first line relative to subsequent lines
FirstLineIndent float64
// LeftMargin is the left margin of the paragraph body
LeftMargin float64
// AverageFontSize is the average font size across all lines
AverageFontSize float64
// LineSpacing is the average spacing between lines within this paragraph
LineSpacing float64
// SpacingBefore is the space before this paragraph
SpacingBefore float64
// SpacingAfter is the space after this paragraph
SpacingAfter float64
}
Paragraph represents a logical paragraph of text
func (*Paragraph) ContainsPoint ¶
ContainsPoint returns true if the point is within the paragraph's bounding box
func (*Paragraph) GetFirstLine ¶
GetFirstLine returns the first line of the paragraph
func (*Paragraph) GetLastLine ¶
GetLastLine returns the last line of the paragraph
func (*Paragraph) HasFirstLineIndent ¶
HasFirstLineIndent returns true if the paragraph has first-line indentation
func (*Paragraph) IsBlockQuote ¶
IsBlockQuote returns true if this paragraph is styled as a block quote
func (*Paragraph) IsListItem ¶
IsListItem returns true if this paragraph is styled as a list item
type ParagraphConfig ¶
type ParagraphConfig struct {
// SpacingThreshold is the multiplier for line spacing to detect paragraph breaks
// If spacing > avgLineSpacing * SpacingThreshold, it's a paragraph break
// Default: 1.5
SpacingThreshold float64
// IndentThreshold is the minimum indentation to consider as first-line indent
// Default: 15 points
IndentThreshold float64
// HeadingFontSizeRatio is the font size ratio to consider as heading
// If fontSize > avgFontSize * HeadingFontSizeRatio, it's a heading
// Default: 1.2 (20% larger)
HeadingFontSizeRatio float64
// MinParagraphLines is the minimum number of lines for a paragraph
// Default: 1
MinParagraphLines int
// BlockQuoteIndent is the minimum indentation to consider as block quote
// Default: 30 points
BlockQuoteIndent float64
// ListItemPatterns are regex patterns that indicate list items
// Default: bullet points, numbers, letters
ListItemPatterns []string
}
ParagraphConfig holds configuration for paragraph detection
func DefaultParagraphConfig ¶
func DefaultParagraphConfig() ParagraphConfig
DefaultParagraphConfig returns sensible default configuration
type ParagraphDetector ¶
type ParagraphDetector struct {
// contains filtered or unexported fields
}
ParagraphDetector detects paragraphs from lines
func NewParagraphDetector ¶
func NewParagraphDetector() *ParagraphDetector
NewParagraphDetector creates a new paragraph detector with default configuration
func NewParagraphDetectorWithConfig ¶
func NewParagraphDetectorWithConfig(config ParagraphConfig) *ParagraphDetector
NewParagraphDetectorWithConfig creates a paragraph detector with custom configuration
func (*ParagraphDetector) Detect ¶
func (d *ParagraphDetector) Detect(lines []Line, pageWidth, pageHeight float64) *ParagraphLayout
Detect analyzes lines and groups them into paragraphs
func (*ParagraphDetector) DetectFromFragments ¶
func (d *ParagraphDetector) DetectFromFragments(fragments []text.TextFragment, pageWidth, pageHeight float64) *ParagraphLayout
DetectFromFragments is a convenience method that first detects lines, then paragraphs
type ParagraphLayout ¶
type ParagraphLayout struct {
// Paragraphs are the detected paragraphs (in reading order)
Paragraphs []Paragraph
// PageWidth is the width of the page
PageWidth float64
// PageHeight is the height of the page
PageHeight float64
// AverageParagraphSpacing is the average spacing between paragraphs
AverageParagraphSpacing float64
// Config is the configuration used for detection
Config ParagraphConfig
}
ParagraphLayout represents the detected paragraph structure of a page
func (*ParagraphLayout) FindParagraphsInRegion ¶
func (l *ParagraphLayout) FindParagraphsInRegion(bbox model.BBox) []Paragraph
FindParagraphsInRegion returns paragraphs that fall within a bounding box
func (*ParagraphLayout) GetHeadings ¶
func (l *ParagraphLayout) GetHeadings() []Paragraph
GetHeadings returns all paragraphs detected as headings
func (*ParagraphLayout) GetListItems ¶
func (l *ParagraphLayout) GetListItems() []Paragraph
GetListItems returns all paragraphs detected as list items
func (*ParagraphLayout) GetParagraph ¶
func (l *ParagraphLayout) GetParagraph(index int) *Paragraph
GetParagraph returns a specific paragraph by index
func (*ParagraphLayout) GetParagraphsByStyle ¶
func (l *ParagraphLayout) GetParagraphsByStyle(style ParagraphStyle) []Paragraph
GetParagraphsByStyle returns paragraphs with a specific style
func (*ParagraphLayout) GetText ¶
func (l *ParagraphLayout) GetText() string
GetText returns all text with paragraph breaks
func (*ParagraphLayout) ParagraphCount ¶
func (l *ParagraphLayout) ParagraphCount() int
ParagraphCount returns the number of detected paragraphs
type ParagraphStyle ¶
type ParagraphStyle int
ParagraphStyle represents the detected style of a paragraph
const ( StyleNormal ParagraphStyle = iota StyleHeading StyleBlockQuote StyleListItem StyleCode StyleCaption )
func (ParagraphStyle) String ¶
func (s ParagraphStyle) String() string
String returns a string representation of the paragraph style
type ReadingDirection ¶
type ReadingDirection int
ReadingDirection indicates the primary reading direction of a document
const ( // LeftToRight is the default for most Western languages LeftToRight ReadingDirection = iota // RightToLeft is used for Arabic, Hebrew, etc. RightToLeft // TopToBottom is used for traditional Chinese/Japanese TopToBottom )
func (ReadingDirection) String ¶
func (d ReadingDirection) String() string
String returns a string representation of the reading direction
type ReadingOrderConfig ¶
type ReadingOrderConfig struct {
// Direction is the primary reading direction
Direction ReadingDirection
// ColumnConfig is the configuration for column detection
ColumnConfig ColumnConfig
// LineConfig is the configuration for line detection
LineConfig LineConfig
// PreferColumnOrder when true, reads entire columns before moving to next
// When false, may interleave if content appears to flow across columns
PreferColumnOrder bool
// SpanningThreshold is the minimum width ratio for content to be considered spanning
// Default: 0.7 (content spanning 70%+ of page width is considered spanning)
SpanningThreshold float64
// InvertedY indicates that Y coordinates increase downward (Y=0 at top)
// rather than the standard PDF convention where Y increases upward (Y=0 at bottom)
// When true, lower Y values are at the top of the page
// Default: false (auto-detect based on content)
InvertedY *bool
}
ReadingOrderConfig holds configuration for reading order detection
func DefaultReadingOrderConfig ¶
func DefaultReadingOrderConfig() ReadingOrderConfig
DefaultReadingOrderConfig returns sensible default configuration
type ReadingOrderDetector ¶
type ReadingOrderDetector struct {
// contains filtered or unexported fields
}
ReadingOrderDetector determines the correct reading order for page content
func NewReadingOrderDetector ¶
func NewReadingOrderDetector() *ReadingOrderDetector
NewReadingOrderDetector creates a new reading order detector with default configuration
func NewReadingOrderDetectorWithConfig ¶
func NewReadingOrderDetectorWithConfig(config ReadingOrderConfig) *ReadingOrderDetector
NewReadingOrderDetectorWithConfig creates a reading order detector with custom configuration
func (*ReadingOrderDetector) Detect ¶
func (d *ReadingOrderDetector) Detect(fragments []text.TextFragment, pageWidth, pageHeight float64) *ReadingOrderResult
Detect analyzes fragments and returns them in proper reading order
func (*ReadingOrderDetector) DetectFromLines ¶
func (d *ReadingOrderDetector) DetectFromLines(lines []Line, pageWidth, pageHeight float64) *ReadingOrderResult
DetectFromLines is a convenience method when you already have lines
type ReadingOrderResult ¶
type ReadingOrderResult struct {
// Fragments in reading order
Fragments []text.TextFragment
// Lines in reading order
Lines []Line
// Sections represent logical sections of content (spanning + column content)
Sections []ReadingSection
// Direction is the detected or configured reading direction
Direction ReadingDirection
// ColumnCount is the number of columns detected
ColumnCount int
// PageWidth and PageHeight
PageWidth float64
PageHeight float64
}
ReadingOrderResult holds the result of reading order analysis
func (*ReadingOrderResult) GetParagraphs ¶
func (r *ReadingOrderResult) GetParagraphs() *ParagraphLayout
GetParagraphs detects paragraphs from the ordered lines For multi-column documents, paragraphs are detected within each section separately to maintain proper spacing context, then combined in reading order
func (*ReadingOrderResult) GetSectionCount ¶
func (r *ReadingOrderResult) GetSectionCount() int
GetSectionCount returns the number of reading sections
func (*ReadingOrderResult) GetText ¶
func (r *ReadingOrderResult) GetText() string
GetText returns all text in reading order
func (*ReadingOrderResult) IsMultiColumn ¶
func (r *ReadingOrderResult) IsMultiColumn() bool
IsMultiColumn returns true if multiple columns were detected
type ReadingSection ¶
type ReadingSection struct {
// Type indicates what kind of section this is
Type SectionType
// Lines in this section (in reading order)
Lines []Line
// Fragments in this section (in reading order)
Fragments []text.TextFragment
// ColumnIndex for column sections (-1 for spanning)
ColumnIndex int
// BBox is the bounding box of this section
BBox struct {
X, Y, Width, Height float64
}
}
ReadingSection represents a section of content in reading order
type RegionType ¶
type RegionType int
RegionType indicates whether a region is a header or footer
const ( Header RegionType = iota )
func (RegionType) String ¶
func (r RegionType) String() string
type SectionType ¶
type SectionType int
SectionType indicates the type of reading section
const ( SectionSpanning SectionType = iota // Full-width content (titles, headers) SectionColumn // Column content )
func (SectionType) String ¶
func (t SectionType) String() string
String returns a string representation of the section type