model

package
v1.6.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package model provides the intermediate representation (IR) for extracted document content.

This package defines the user-facing data structures that represent the semantic structure of documents. All parsing and extraction operations ultimately produce these types, making them the primary API for consuming extracted content.

Document Structure

The Document type represents a complete document with metadata and pages:

doc := model.NewDocument()
doc.Metadata.Title = "My Document"
doc.AddPage(page)

Each Page contains dimensions, rotation, and a list of Element objects representing the page content.

Elements

All page content implements the Element interface. The concrete types are:

  • Paragraph - text paragraphs
  • Heading - headings (levels 1-6)
  • List - ordered or unordered lists
  • Table - tables with cells, row/column spans
  • Image - embedded images

Tables

The Table type provides a complete table representation with:

  • Rows and columns of Cell values
  • Row and column spanning
  • Export methods: ToMarkdown() and ToCSV()

Geometry

Geometric primitives support position and layout calculations:

  • BBox - bounding box with intersection, union, and overlap calculations
  • Point - 2D point with distance calculation
  • Matrix - 2D affine transformation matrix

Layout Information

When layout analysis is performed, pages contain additional structure:

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Alignment

type Alignment int

Alignment represents text alignment within a block

const (
	AlignmentUnknown Alignment = iota
	AlignmentLeft
	AlignmentCenter
	AlignmentRight
	AlignmentJustified
)

func (Alignment) String

func (a Alignment) String() string

type BBox

type BBox struct {
	X      float64 // Left
	Y      float64 // Bottom (PDF coordinate system)
	Width  float64
	Height float64
}

BBox represents an axis-aligned bounding box (rectangle). In PDF coordinates, Y increases upward, so Y is the bottom edge.

func NewBBox

func NewBBox(x, y, width, height float64) BBox

NewBBox creates a bounding box from the given coordinates.

func NewBBoxFromPoints

func NewBBoxFromPoints(p1, p2 Point) BBox

NewBBoxFromPoints creates a bounding box that encloses the two given points.

func (BBox) Area

func (b BBox) Area() float64

Area returns the area of the bounding box (Width * Height).

func (BBox) Bottom

func (b BBox) Bottom() float64

Bottom returns the Y coordinate of the bottom edge.

func (BBox) Center

func (b BBox) Center() Point

Center returns the center point of the bounding box.

func (BBox) Contains

func (b BBox) Contains(p Point) bool

Contains reports whether the point p is inside the bounding box.

func (BBox) Expand

func (b BBox) Expand(margin float64) BBox

Expand returns a new bounding box expanded by margin on all four sides.

func (BBox) Intersection

func (b BBox) Intersection(other BBox) BBox

Intersection returns the bounding box of the overlapping region, or an empty BBox if the boxes do not intersect.

func (BBox) Intersects

func (b BBox) Intersects(other BBox) bool

Intersects reports whether b and other have any overlapping area.

func (BBox) IsEmpty

func (b BBox) IsEmpty() bool

IsEmpty reports whether the bounding box has zero or negative dimensions.

func (BBox) IsValid

func (b BBox) IsValid() bool

IsValid reports whether the bounding box has positive width and height.

func (BBox) Left

func (b BBox) Left() float64

Left returns the X coordinate of the left edge.

func (BBox) OverlapRatio

func (b BBox) OverlapRatio(other BBox) float64

OverlapRatio returns the ratio of intersection area to the smaller box's area. Returns a value between 0 (no overlap) and 1 (complete overlap).

func (BBox) Right

func (b BBox) Right() float64

Right returns the X coordinate of the right edge.

func (BBox) Top

func (b BBox) Top() float64

Top returns the Y coordinate of the top edge.

func (BBox) Union

func (b BBox) Union(other BBox) BBox

Union returns the smallest bounding box that contains both b and other.

type BlockInfo

type BlockInfo struct {
	Index     int       // Block index
	BBox      BBox      // Bounding box
	LineCount int       // Number of lines in block
	Text      string    // Combined text content
	Column    int       // Column index this block belongs to (-1 if unknown)
	FontSize  float64   // Average font size
	Alignment Alignment // Text alignment
}

BlockInfo contains information about a detected text block

type Cell

type Cell struct {
	Text     string
	BBox     BBox
	RowSpan  int
	ColSpan  int
	IsHeader bool
	// Cell styling
	Style CellStyle
}

Cell represents a table cell

type CellStyle

type CellStyle struct {
	BackgroundColor Color
	BorderColor     Color
	BorderWidth     float64
	TextStyle       TextStyle
	Alignment       TextAlignment
	VerticalAlign   VerticalAlignment
}

CellStyle represents cell styling

type Color

type Color struct {
	R, G, B uint8
}

Color represents an RGB color with 8-bit components.

type ColumnInfo

type ColumnInfo struct {
	Index int     // Column index (0-based, left to right)
	Left  float64 // Left edge X coordinate
	Right float64 // Right edge X coordinate
	Width float64 // Column width
	BBox  BBox    // Bounding box of column content
}

ColumnInfo contains information about a detected column

type Document

type Document struct {
	Metadata Metadata
	Pages    []*Page
}

Document represents a complete PDF document with extracted semantic structure. It contains document-level metadata and an ordered list of pages.

func NewDocument

func NewDocument() *Document

NewDocument creates a new empty document with initialized fields.

func (*Document) AddPage

func (d *Document) AddPage(page *Page)

AddPage appends a page to the document and assigns its page number (1-indexed).

func (*Document) AllHeadings

func (d *Document) AllHeadings() []HeadingInfo

AllHeadings returns all detected headings across all pages. Requires layout analysis to have been performed.

func (*Document) AllLists

func (d *Document) AllLists() []ListInfo

AllLists returns all detected lists across all pages. Requires layout analysis to have been performed.

func (*Document) AllParagraphs

func (d *Document) AllParagraphs() []ParagraphInfo

AllParagraphs returns all detected paragraphs across all pages. Requires layout analysis to have been performed.

func (*Document) ExtractTables

func (d *Document) ExtractTables() []*Table

ExtractTables returns all tables extracted from all pages of the document.

func (*Document) ExtractText

func (d *Document) ExtractText() string

ExtractText returns all text content from all pages, concatenated with double newlines between pages.

func (*Document) GetPage

func (d *Document) GetPage(number int) *Page

GetPage returns a page by its 1-indexed page number, or nil if out of range.

func (*Document) HasLayout

func (d *Document) HasLayout() bool

HasLayout reports whether layout analysis has been performed on any page.

func (*Document) LayoutStats

func (d *Document) LayoutStats() LayoutStats

LayoutStats returns aggregated layout statistics across all pages.

func (*Document) PageCount

func (d *Document) PageCount() int

PageCount returns the total number of pages in the document.

func (*Document) TableOfContents

func (d *Document) TableOfContents() []TOCEntry

TableOfContents returns headings organized as a document outline with page references.

type Element

type Element interface {
	Type() ElementType
	BoundingBox() BBox
	ZIndex() int
}

Element is the interface implemented by all page elements such as paragraphs, headings, lists, tables, and images.

type ElementType

type ElementType int

ElementType identifies the type of a page element.

const (
	ElementTypeUnknown ElementType = iota
	ElementTypeParagraph
	ElementTypeHeading
	ElementTypeList
	ElementTypeTable
	ElementTypeImage
	ElementTypeFigure
	ElementTypeCaption
)

func (ElementType) String

func (et ElementType) String() string

String returns the name of the element type.

type Heading

type Heading struct {
	Text     string
	Level    int // 1-6
	BBox     BBox
	FontSize float64
	FontName string
	Style    TextStyle
	ZOrder   int
}

Heading represents a heading with level (1-6), position, and style information.

func (*Heading) BoundingBox

func (h *Heading) BoundingBox() BBox

func (*Heading) GetText

func (h *Heading) GetText() string

func (*Heading) Type

func (h *Heading) Type() ElementType

Type returns ElementTypeHeading.

func (*Heading) ZIndex

func (h *Heading) ZIndex() int

type HeadingInfo

type HeadingInfo struct {
	Level      int     // Heading level (1-6)
	Text       string  // Heading text
	BBox       BBox    // Bounding box
	FontSize   float64 // Font size
	FontName   string  // Font name
	Confidence float64 // Detection confidence (0-1)
}

HeadingInfo contains information about a detected heading

type Image

type Image struct {
	Data   []byte
	Format ImageFormat
	BBox   BBox
	DPI    float64
	ZOrder int
	// Alt text if available
	AltText string
}

Image represents an embedded image with its binary data and format.

func (*Image) BoundingBox

func (i *Image) BoundingBox() BBox

func (*Image) Type

func (i *Image) Type() ElementType

Type returns ElementTypeImage.

func (*Image) ZIndex

func (i *Image) ZIndex() int

type ImageFormat

type ImageFormat int

ImageFormat identifies the format of an embedded image.

const (
	ImageFormatUnknown ImageFormat = iota
	ImageFormatJPEG
	ImageFormatPNG
	ImageFormatTIFF
	ImageFormatJPEG2000
	ImageFormatJBIG2
)

type LayoutStats

type LayoutStats struct {
	FragmentCount  int // Number of text fragments processed
	LineCount      int // Number of text lines detected
	BlockCount     int // Number of text blocks detected
	ParagraphCount int // Number of paragraphs detected
	HeadingCount   int // Number of headings detected
	ListCount      int // Number of lists detected
}

LayoutStats contains statistics about the layout analysis

type Line

type Line struct {
	Start    Point
	End      Point
	Width    float64
	Color    Color
	IsRect   bool
	RectFill bool
}

Line represents a geometric line or rectangle from PDF graphics operations.

type LineInfo

type LineInfo struct {
	Index     int       // Line index
	BBox      BBox      // Bounding box
	Text      string    // Text content
	FontSize  float64   // Average font size
	Alignment Alignment // Detected alignment
	IsIndent  bool      // Whether line appears indented
}

LineInfo contains information about a detected text line

type List

type List struct {
	Items   []ListItem
	Ordered bool
	BBox    BBox
	ZOrder  int
}

List represents an ordered or unordered list with items.

func (*List) BoundingBox

func (l *List) BoundingBox() BBox

func (*List) GetText

func (l *List) GetText() string

func (*List) Type

func (l *List) Type() ElementType

Type returns ElementTypeList.

func (*List) ZIndex

func (l *List) ZIndex() int

type ListInfo

type ListInfo struct {
	Type       ListType   // Type of list
	Items      []ListItem // List items
	BBox       BBox       // Bounding box
	Nested     bool       // Whether list contains nested items
	StartValue int        // Starting value for numbered lists
}

ListInfo contains information about a detected list

type ListItem

type ListItem struct {
	Text   string
	BBox   BBox
	Bullet string
	Level  int
}

ListItem represents a single item within a list.

type ListType

type ListType int

ListType represents the type of list

const (
	ListTypeUnknown  ListType = iota
	ListTypeBullet            // Bullet points (•, -, *, etc.)
	ListTypeNumbered          // Numbered (1, 2, 3)
	ListTypeLettered          // Lettered (a, b, c or A, B, C)
	ListTypeRoman             // Roman numerals (i, ii, iii or I, II, III)
	ListTypeCheckbox          // Checkboxes (☐, ☑, ✓)
)

func (ListType) String

func (lt ListType) String() string

type Matrix

type Matrix [6]float64

Matrix represents a 2D affine transformation matrix [a, b, c, d, e, f]. This is stored in row-major order as used by PDF: [a b c d e f].

func Identity

func Identity() Matrix

Identity returns the identity transformation matrix.

func Rotate

func Rotate(angle float64) Matrix

Rotate returns a rotation matrix for the given angle in radians.

func Scale

func Scale(sx, sy float64) Matrix

Scale returns a scaling matrix with scale factors (sx, sy).

func Translate

func Translate(tx, ty float64) Matrix

Translate returns a translation matrix that moves by (tx, ty).

func (Matrix) IsIdentity

func (m Matrix) IsIdentity() bool

IsIdentity reports whether m is the identity matrix.

func (Matrix) Multiply

func (m Matrix) Multiply(other Matrix) Matrix

Multiply returns the product of m and other (m * other).

func (Matrix) Transform

func (m Matrix) Transform(p Point) Point

Transform applies the affine transformation to a point and returns the result.

type Metadata

type Metadata struct {
	Title        string
	Author       string
	Subject      string
	Keywords     []string
	Creator      string
	Producer     string
	CreationDate time.Time
	ModDate      time.Time
	// Custom metadata
	Custom map[string]string
}

Metadata contains document-level metadata extracted from the PDF's document information dictionary and XMP metadata streams.

type Page

type Page struct {
	Number   int       // 1-indexed page number
	Width    float64   // Page width in points
	Height   float64   // Page height in points
	Rotation int       // Rotation angle (0, 90, 180, 270)
	Elements []Element // Ordered list of page elements

	// Raw data for debugging/advanced use
	RawText  []TextFragment // All text fragments with positions
	RawLines []Line         // All detected lines/rectangles

	// Layout analysis results (populated by AnalyzeLayout)
	Layout *PageLayout // Layout analysis results, nil if not analyzed
}

Page represents a single page in a PDF document

func NewPage

func NewPage(width, height float64) *Page

NewPage creates a new page with given dimensions

func (*Page) AddElement

func (p *Page) AddElement(elem Element)

AddElement adds an element to the page

func (*Page) ColumnCount

func (p *Page) ColumnCount() int

ColumnCount returns the number of columns detected on this page

func (*Page) ContentBBox

func (p *Page) ContentBBox() BBox

ContentBBox returns the bounding box of all content on the page, excluding headers and footers if detected

func (*Page) ElementsInReadingOrder

func (p *Page) ElementsInReadingOrder() []Element

ElementsInReadingOrder returns elements sorted by reading order If layout analysis hasn't been performed, returns elements in original order

func (*Page) ExtractTables

func (p *Page) ExtractTables() []*Table

ExtractTables returns all table elements on the page

func (*Page) ExtractText

func (p *Page) ExtractText() string

ExtractText concatenates all text elements

func (*Page) GetBlocks

func (p *Page) GetBlocks() []BlockInfo

GetBlocks returns all text blocks on this page (requires layout analysis)

func (*Page) GetElementsInRegion

func (p *Page) GetElementsInRegion(bbox BBox) []Element

GetElementsInRegion returns elements within a bounding box

func (*Page) GetHeadings

func (p *Page) GetHeadings() []HeadingInfo

GetHeadings returns all headings on this page (requires layout analysis)

func (*Page) GetLists

func (p *Page) GetLists() []ListInfo

GetLists returns all lists on this page (requires layout analysis)

func (*Page) GetParagraphs

func (p *Page) GetParagraphs() []ParagraphInfo

GetParagraphs returns all paragraphs on this page (requires layout analysis)

func (*Page) HasLayout

func (p *Page) HasLayout() bool

HasLayout returns true if layout analysis has been performed on this page

func (*Page) IsMultiColumn

func (p *Page) IsMultiColumn() bool

IsMultiColumn returns true if the page has multiple columns

type PageLayout

type PageLayout struct {
	// Column structure
	Columns     []ColumnInfo // Detected columns
	ColumnCount int          // Number of columns detected

	// Text structure
	TextBlocks []BlockInfo     // Detected text blocks
	Paragraphs []ParagraphInfo // Detected paragraphs
	Lines      []LineInfo      // Detected text lines

	// Semantic elements
	Headings []HeadingInfo // Detected headings (H1-H6)
	Lists    []ListInfo    // Detected lists

	// Reading order
	ReadingOrder []int // Indices into Elements in reading order

	// Header/footer detection
	HasHeader    bool    // Whether this page has a detected header
	HasFooter    bool    // Whether this page has a detected footer
	HeaderHeight float64 // Height of header region
	FooterHeight float64 // Height of footer region

	// Statistics
	Stats LayoutStats
}

PageLayout contains the results of layout analysis for a page

type Paragraph

type Paragraph struct {
	Text      string
	BBox      BBox
	FontSize  float64
	FontName  string
	Style     TextStyle
	Alignment TextAlignment
	ZOrder    int
}

Paragraph represents a paragraph of text with position, style, and alignment.

func (*Paragraph) BoundingBox

func (p *Paragraph) BoundingBox() BBox

func (*Paragraph) GetText

func (p *Paragraph) GetText() string

func (*Paragraph) Type

func (p *Paragraph) Type() ElementType

Type returns ElementTypeParagraph.

func (*Paragraph) ZIndex

func (p *Paragraph) ZIndex() int

type ParagraphInfo

type ParagraphInfo struct {
	Index      int       // Paragraph index
	BBox       BBox      // Bounding box
	Text       string    // Text content
	FontSize   float64   // Average font size
	FontName   string    // Primary font name
	LineCount  int       // Number of lines
	Alignment  Alignment // Text alignment
	FirstLine  float64   // First line indent (positive = indented)
	LineHeight float64   // Average line height
}

ParagraphInfo contains information about a detected paragraph

type Point

type Point struct {
	X, Y float64
}

Point represents a 2D point with X and Y coordinates.

func (Point) Distance

func (p Point) Distance(other Point) float64

Distance returns the Euclidean distance between p and another point.

type TOCEntry

type TOCEntry struct {
	Level    int     // Heading level (1-6)
	Text     string  // Heading text
	Page     int     // Page number (1-indexed)
	BBox     BBox    // Position on page
	FontSize float64 // Font size of heading
}

TOCEntry represents an entry in the generated table of contents.

type Table

type Table struct {
	Rows       [][]Cell
	BBox       BBox
	HasGrid    bool    // Whether table has visible gridlines
	Confidence float64 // Detection confidence (0-1)
	ZOrder     int
}

Table represents a table with cells organized in rows and columns

func NewTable

func NewTable(rows, cols int) *Table

NewTable creates a new table with given dimensions

func (*Table) BoundingBox

func (t *Table) BoundingBox() BBox

func (*Table) ColCount

func (t *Table) ColCount() int

ColCount returns the number of columns in the first row

func (*Table) GetCell

func (t *Table) GetCell(row, col int) *Cell

GetCell returns the cell at the given row and column (0-indexed)

func (*Table) GetText

func (t *Table) GetText() string

func (*Table) RowCount

func (t *Table) RowCount() int

RowCount returns the number of rows

func (*Table) SetCell

func (t *Table) SetCell(row, col int, cell Cell) error

SetCell sets the cell at the given position

func (*Table) ToCSV

func (t *Table) ToCSV() string

ToCSV converts the table to CSV format

func (*Table) ToMarkdown

func (t *Table) ToMarkdown() string

ToMarkdown converts the table to markdown format

func (*Table) Type

func (t *Table) Type() ElementType

func (*Table) ZIndex

func (t *Table) ZIndex() int

type TableGrid

type TableGrid struct {
	Rows      []float64 // Y-coordinates of row boundaries
	Cols      []float64 // X-coordinates of column boundaries
	HasHLines []bool    // Horizontal line presence
	HasVLines []bool    // Vertical line presence
}

TableGrid represents the detected grid structure

func NewTableGrid

func NewTableGrid() *TableGrid

NewTableGrid creates a new empty grid

func (*TableGrid) ColCount

func (g *TableGrid) ColCount() int

ColCount returns the number of columns

func (*TableGrid) GetCellBBox

func (g *TableGrid) GetCellBBox(row, col int) BBox

GetCellBBox returns the bounding box for a cell

func (*TableGrid) RowCount

func (g *TableGrid) RowCount() int

RowCount returns the number of rows

type TextAlignment

type TextAlignment int

TextAlignment represents horizontal text alignment.

const (
	AlignLeft TextAlignment = iota
	AlignCenter
	AlignRight
	AlignJustify
)

type TextElement

type TextElement interface {
	Element
	GetText() string
}

TextElement is the interface for elements that contain text content.

type TextFragment

type TextFragment struct {
	Text     string
	BBox     BBox
	FontSize float64
	FontName string
	Style    TextStyle
	Matrix   [6]float64 // Text transformation matrix
}

TextFragment represents a positioned piece of text extracted from a PDF page, including its position, font, and transformation matrix.

type TextStyle

type TextStyle struct {
	Bold      bool
	Italic    bool
	Underline bool
	Color     Color
}

TextStyle represents text styling attributes.

type VerticalAlignment

type VerticalAlignment int

VerticalAlignment represents vertical alignment

const (
	VAlignTop VerticalAlignment = iota
	VAlignMiddle
	VAlignBottom
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL