tabula

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 27, 2025 License: MIT Imports: 10 Imported by: 0

README

Tabula

A pure-Go PDF text extraction library with a fluent API, designed for RAG (Retrieval-Augmented Generation) workflows.

Features

  • Fluent API - Chain methods for clean, readable code
  • Layout Analysis - Detect headings, paragraphs, lists, and columns
  • Header/Footer Detection - Automatically identify and exclude repeating content
  • RAG-Ready Chunking - Semantic document chunking with metadata
  • Markdown Export - Convert extracted content to markdown
  • PDF 1.0-1.7 Support - Including modern XRef streams (PDF 1.5+)
  • Pure Go - No CGO dependencies

Installation

go get github.com/tsawler/tabula

Quick Start

Extract Text
package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    text, warnings, err := tabula.Open("document.pdf").Text()
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(text)

    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}
Extract with Options
text, warnings, err := tabula.Open("document.pdf").
    Pages(1, 2, 3).              // Specific pages
    ExcludeHeadersAndFooters().  // Remove repeating headers/footers
    JoinParagraphs().            // Join text into paragraphs
    Text()
Extract as Markdown
markdown, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()
RAG Chunking
package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    chunks, warnings, err := tabula.Open("document.pdf").
        ExcludeHeadersAndFooters().
        Chunks()
    if err != nil {
        log.Fatal(err)
    }

    for i, chunk := range chunks.Chunks {
        fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
            i+1,
            chunk.Metadata.SectionTitle,
            chunk.Metadata.PageStart,
            chunk.Metadata.PageEnd,
            chunk.Metadata.EstimatedTokens)
        fmt.Println(chunk.Text)
        fmt.Println("---")
    }
}
Chunks as Markdown (for Vector DBs)
chunks, _, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}

// Get each chunk as separate markdown strings
mdChunks := chunks.ToMarkdownChunks()

for i, md := range mdChunks {
    // Store each chunk in your vector database
    embedding := embedModel.Embed(md)
    vectorDB.Store(chunks.Chunks[i].ID, embedding, md)
}

API Reference

Opening a PDF
// From file path
ext := tabula.Open("document.pdf")

// From existing reader
r, _ := reader.Open("document.pdf")
ext := tabula.FromReader(r)
Fluent Options
Method Description
Pages(1, 2, 3) Extract specific pages (1-indexed)
PageRange(1, 10) Extract page range (inclusive)
ExcludeHeaders() Exclude detected headers
ExcludeFooters() Exclude detected footers
ExcludeHeadersAndFooters() Exclude both
JoinParagraphs() Join text fragments into paragraphs
ByColumn() Process multi-column layouts column by column
PreserveLayout() Maintain spatial positioning
Terminal Operations
Method Returns Description
Text() string Plain text content
ToMarkdown() string Markdown-formatted content
ToMarkdownWithOptions(opts) string Markdown with custom options
Fragments() []text.TextFragment Raw text fragments with positions
Lines() []layout.Line Detected text lines
Paragraphs() []layout.Paragraph Detected paragraphs
Headings() []layout.Heading Detected headings (H1-H6)
Lists() []layout.List Detected lists
Blocks() []layout.Block Text blocks
Elements() []layout.LayoutElement All elements in reading order
Document() *model.Document Full document structure
Chunks() *rag.ChunkCollection Semantic chunks for RAG
ChunksWithConfig(config, sizeConfig) *rag.ChunkCollection Chunks with custom sizing
Analyze() *layout.AnalysisResult Complete layout analysis
PageCount() int Number of pages
Inspection Methods (non-terminal)
ext := tabula.Open("document.pdf")
defer ext.Close()

isCharLevel, _ := ext.IsCharacterLevel()  // Detect character-level PDFs
isMultiCol, _ := ext.IsMultiColumn()      // Detect multi-column layouts
pageCount, _ := ext.PageCount()           // Get page count

RAG Integration

Chunk Filtering
chunks, _, _ := tabula.Open("doc.pdf").Chunks()

// Filter by content type
tablesOnly := chunks.FilterWithTables()
listsOnly := chunks.FilterWithLists()

// Filter by location
section := chunks.FilterBySection("Introduction")
page5 := chunks.FilterByPage(5)
pages1to10 := chunks.FilterByPageRange(1, 10)

// Filter by size
smallChunks := chunks.FilterByMaxTokens(500)
largeChunks := chunks.FilterByMinTokens(100)

// Search
matches := chunks.Search("keyword")

// Chain filters
result := chunks.
    FilterBySection("Methods").
    FilterByMinTokens(100).
    Search("algorithm")
Markdown Options
import "github.com/tsawler/tabula/rag"

opts := rag.MarkdownOptions{
    IncludeMetadata:        true,   // YAML front matter
    IncludeTableOfContents: true,   // Generated TOC
    IncludeChunkSeparators: true,   // --- between chunks
    IncludePageNumbers:     true,   // Page references
    IncludeChunkIDs:        true,   // HTML comments with chunk IDs
}

markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)

// Or use preset for RAG
opts := rag.RAGOptimizedMarkdownOptions()
Custom Chunk Sizing
import "github.com/tsawler/tabula/rag"

config := rag.ChunkerConfig{
    TargetChunkSize: 500,   // Target characters per chunk
    MaxChunkSize:    1000,  // Maximum characters
    MinChunkSize:    100,   // Minimum characters
    OverlapSize:     50,    // Overlap between chunks
}
sizeConfig := rag.DefaultSizeConfig()

chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)

Working with Results

Chunk Metadata
for _, chunk := range chunks.Chunks {
    fmt.Println("ID:", chunk.ID)
    fmt.Println("Section:", chunk.Metadata.SectionTitle)
    fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
    fmt.Println("Words:", chunk.Metadata.WordCount)
    fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
    fmt.Println("Has Table:", chunk.Metadata.HasTable)
    fmt.Println("Has List:", chunk.Metadata.HasList)
}
Collection Statistics
stats := chunks.Statistics()
fmt.Println("Total chunks:", stats.TotalChunks)
fmt.Println("Total words:", stats.TotalWords)
fmt.Println("Average tokens:", stats.AvgTokens)
fmt.Println("Chunks with tables:", stats.ChunksWithTables)

Warnings

The library returns warnings for non-fatal issues:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    log.Fatal(err)  // Fatal error
}

for _, w := range warnings {
    log.Println("Warning:", w.Message)  // Non-fatal issues
}

// Format all warnings
formatted := tabula.FormatWarnings(warnings)

Common warnings:

  • "Detected messy/display-oriented PDF traits" - PDF may have unusual text layout
  • High fragmentation warnings - Text is split into many small fragments

Error Handling Helpers

// Panic on error (for scripts/tests)
text := tabula.MustText(tabula.Open("doc.pdf").Text())
count := tabula.Must(tabula.Open("doc.pdf").PageCount())

Testing

go test ./...

License

MIT License

Documentation

Overview

integration.go provides methods to integrate layout analysis with the Document/Page models

Package tabula provides a fluent API for extracting text, tables, and other content from PDF files.

Basic usage:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    // handle error
}
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

With options:

text, _, err := tabula.Open("report.pdf").
    Pages(1, 2, 3).
    ExcludeHeaders().
    ExcludeFooters().
    Text()

For advanced use cases, the lower-level reader package is also available.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AnalyzeDocument

func AnalyzeDocument(path string) (*model.Document, error)

AnalyzeDocument performs layout analysis on all pages of a document and populates the Layout field of each page. This enables access to detected headings, lists, paragraphs, columns, and other structural elements.

Example:

doc, err := tabula.AnalyzeDocument("document.pdf")
if err != nil {
    log.Fatal(err)
}
for _, page := range doc.Pages {
    fmt.Printf("Page %d: %d headings, %d paragraphs\n",
        page.Number, len(page.GetHeadings()), len(page.GetParagraphs()))
}

func AnalyzeDocumentWithConfig

func AnalyzeDocumentWithConfig(path string, config layout.AnalyzerConfig) (*model.Document, error)

AnalyzeDocumentWithConfig performs layout analysis with custom configuration

func FormatWarnings

func FormatWarnings(warnings []Warning) string

FormatWarnings returns a human-readable string of all warnings. Returns empty string if there are no warnings.

func Must

func Must[T any](val T, err error) T

Must is a helper that wraps a call to a function returning (T, error) and panics if the error is non-nil. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

count := tabula.Must(tabula.Open("document.pdf").PageCount())

func MustText

func MustText[T any](val T, _ []Warning, err error) T

MustText is a helper that wraps a call to Text() or Fragments() and panics if the error is non-nil. It discards warnings and returns just the value. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

text := tabula.MustText(tabula.Open("document.pdf").Text())

func PopulatePageLayout

func PopulatePageLayout(page *model.Page, fragments []text.TextFragment)

PopulatePageLayout performs layout analysis on a single page and populates its Layout field

func PopulatePageLayoutWithConfig

func PopulatePageLayoutWithConfig(page *model.Page, fragments []text.TextFragment, config layout.AnalyzerConfig)

PopulatePageLayoutWithConfig performs layout analysis with custom configuration

Types

type ExtractOptions

type ExtractOptions struct {
	// contains filtered or unexported fields
}

ExtractOptions holds configuration for text extraction.

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor provides a fluent interface for extracting content from PDFs. Each configuration method returns a new Extractor instance, making it safe for concurrent use and allowing method chaining.

func FromReader

func FromReader(r *reader.Reader) *Extractor

FromReader creates an Extractor from an already-opened reader.Reader. This is useful when you need more control over the reader lifecycle. Note: The caller is responsible for closing the reader.

Example:

r, err := reader.Open("document.pdf")
if err != nil {
    // handle error
}
defer r.Close()
text, warnings, err := tabula.FromReader(r).Text()

func Open

func Open(filename string) *Extractor

Open opens a PDF file and returns an Extractor for fluent configuration. The returned Extractor must be closed when done, either explicitly via Close() or implicitly when calling a terminal operation like Text().

Example:

text, warnings, err := tabula.Open("document.pdf").Text()

func (*Extractor) Analyze

func (e *Extractor) Analyze() (*layout.AnalysisResult, error)

Analyze performs complete layout analysis and returns all detected elements. This is the most comprehensive extraction method, combining columns, lines, paragraphs, headings, lists, and reading order into a unified result. This is a terminal operation that closes the underlying reader.

Example:

result, err := tabula.Open("document.pdf").Pages(1).Analyze()
for _, elem := range result.Elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) Blocks

func (e *Extractor) Blocks() ([]layout.Block, error)

Blocks extracts and returns detected text blocks from the document. Blocks are spatially grouped regions of text, useful for understanding document layout structure. This is a terminal operation that closes the underlying reader.

Example:

blocks, err := tabula.Open("document.pdf").Blocks()
for _, block := range blocks {
    fmt.Printf("Block at (%.1f, %.1f): %s\n", block.BBox.X, block.BBox.Y, block.GetText())
}

func (*Extractor) ByColumn

func (e *Extractor) ByColumn() *Extractor

ByColumn configures the extractor to process text column by column in reading order, rather than line by line across the full page width. This is useful for multi-column documents like newspapers or academic papers.

Example:

text, _, err := tabula.Open("newspaper.pdf").ByColumn().Text()

func (*Extractor) Chunks

func (e *Extractor) Chunks() (*rag.ChunkCollection, []Warning, error)

Chunks extracts content and returns semantic chunks for RAG workflows. This method combines document extraction with RAG chunking in a single call. This is a terminal operation that closes the underlying reader.

Example:

chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}
for _, chunk := range chunks.Chunks {
    fmt.Printf("[%s] %s\n", chunk.Metadata.SectionTitle, chunk.Text[:50])
}

func (*Extractor) ChunksWithConfig

func (e *Extractor) ChunksWithConfig(config rag.ChunkerConfig, sizeConfig rag.SizeConfig) (*rag.ChunkCollection, []Warning, error)

ChunksWithConfig extracts content and returns semantic chunks using custom configuration. This allows fine-tuning of chunk sizes, overlap, and other parameters. This is a terminal operation that closes the underlying reader.

Example:

config := rag.ChunkerConfig{
    TargetChunkSize: 500,
    MaxChunkSize:    1000,
    OverlapSize:     50,
}
sizeConfig := rag.DefaultSizeConfig()
chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ChunksWithConfig(config, sizeConfig)

func (*Extractor) Close

func (e *Extractor) Close() error

Close releases resources associated with the Extractor. It is safe to call Close multiple times.

func (*Extractor) Document

func (e *Extractor) Document() (*model.Document, []Warning, error)

Document extracts content and returns a model.Document structure suitable for RAG chunking and other document processing workflows. This is a terminal operation that closes the underlying reader.

Example:

doc, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Document()
if err != nil {
    log.Fatal(err)
}
// Use doc for chunking or other processing

func (*Extractor) Elements

func (e *Extractor) Elements() ([]layout.LayoutElement, error)

Elements extracts and returns all detected elements in reading order. Elements include paragraphs, headings, and lists, unified into a single ordered list. This is useful for document reconstruction or RAG workflows. This is a terminal operation that closes the underlying reader.

Example:

elements, err := tabula.Open("document.pdf").Elements()
for _, elem := range elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) ExcludeFooters

func (e *Extractor) ExcludeFooters() *Extractor

ExcludeFooters configures the extractor to exclude detected footers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeFooters().Text()

func (*Extractor) ExcludeHeaders

func (e *Extractor) ExcludeHeaders() *Extractor

ExcludeHeaders configures the extractor to exclude detected headers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeaders().Text()

func (*Extractor) ExcludeHeadersAndFooters

func (e *Extractor) ExcludeHeadersAndFooters() *Extractor

ExcludeHeadersAndFooters configures the extractor to exclude both detected headers and footers. This is a convenience method equivalent to calling ExcludeHeaders().ExcludeFooters().

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().Text()

func (*Extractor) Fragments

func (e *Extractor) Fragments() ([]text.TextFragment, []Warning, error)

Fragments extracts and returns text fragments with position information. This is a terminal operation that closes the underlying reader.

Returns the fragments, any warnings encountered during processing, and an error if extraction failed.

Example:

fragments, warnings, err := tabula.Open("document.pdf").Pages(1).Fragments()

func (*Extractor) Headings

func (e *Extractor) Headings() ([]layout.Heading, error)

Headings extracts and returns detected headings (H1-H6) from the document. This is a terminal operation that closes the underlying reader.

Example:

headings, err := tabula.Open("document.pdf").Headings()
for _, h := range headings {
    fmt.Printf("[%s] %s\n", h.Level, h.Text)
}

func (*Extractor) IsCharacterLevel

func (e *Extractor) IsCharacterLevel() (bool, error)

IsCharacterLevel checks if the first page of the PDF uses character-level text fragments (one character per fragment). This requires special handling for proper text extraction. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
isCharLevel, err := ext.IsCharacterLevel()

func (*Extractor) IsMultiColumn

func (e *Extractor) IsMultiColumn() (bool, error)

IsMultiColumn checks if the first page of the PDF appears to have a multi-column layout. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("newspaper.pdf")
defer ext.Close()
multiCol, err := ext.IsMultiColumn()

func (*Extractor) JoinParagraphs

func (e *Extractor) JoinParagraphs() *Extractor

JoinParagraphs configures the extractor to join lines within paragraphs using spaces instead of newlines. This produces cleaner text output where paragraph breaks are preserved but soft line breaks within paragraphs are removed.

Example:

text, _, err := tabula.Open("doc.pdf").JoinParagraphs().Text()
text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().JoinParagraphs().Text()

func (*Extractor) Lines

func (e *Extractor) Lines() ([]layout.Line, error)

Lines extracts and returns detected text lines with position and alignment info. This is a terminal operation that closes the underlying reader.

Example:

lines, err := tabula.Open("document.pdf").Lines()
for _, line := range lines {
    fmt.Printf("%s (align: %s)\n", line.Text, line.Alignment)
}

func (*Extractor) Lists

func (e *Extractor) Lists() ([]layout.List, error)

Lists extracts and returns detected lists (bulleted, numbered, etc.) from the document. This is a terminal operation that closes the underlying reader.

Example:

lists, err := tabula.Open("document.pdf").Lists()
for _, list := range lists {
    fmt.Printf("List type: %s, items: %d\n", list.Type, len(list.Items))
}

func (*Extractor) PageCount

func (e *Extractor) PageCount() (int, error)

PageCount returns the total number of pages in the PDF. Note: This does NOT close the reader, allowing further operations.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
count, err := ext.PageCount()

func (*Extractor) PageRange

func (e *Extractor) PageRange(start, end int) *Extractor

PageRange specifies a range of pages to extract (1-indexed, inclusive).

Example:

text, _, err := tabula.Open("doc.pdf").PageRange(5, 10).Text()

func (*Extractor) Pages

func (e *Extractor) Pages(pages ...int) *Extractor

Pages specifies which pages to extract from (1-indexed). Multiple calls are cumulative.

Example:

text, _, err := tabula.Open("doc.pdf").Pages(1, 3, 5).Text()

func (*Extractor) Paragraphs

func (e *Extractor) Paragraphs() ([]layout.Paragraph, error)

Paragraphs extracts and returns detected paragraphs with style information. This uses reading order detection to handle multi-column layouts correctly. This is a terminal operation that closes the underlying reader.

Example:

paragraphs, err := tabula.Open("document.pdf").
    ExcludeHeaders().
    ExcludeFooters().
    Paragraphs()
for _, para := range paragraphs {
    fmt.Printf("[%s] %s\n", para.Style, para.Text)
}

func (*Extractor) PreserveLayout

func (e *Extractor) PreserveLayout() *Extractor

PreserveLayout maintains spatial positioning by inserting spaces to approximate the visual layout of the original document.

Example:

text, _, err := tabula.Open("form.pdf").PreserveLayout().Text()

func (*Extractor) ReadingOrder

func (e *Extractor) ReadingOrder() (*layout.ReadingOrderResult, error)

ReadingOrder extracts and returns detailed reading order analysis. This includes column detection, section boundaries, and proper text ordering for multi-column documents. This is a terminal operation that closes the underlying reader.

Example:

ro, err := tabula.Open("newspaper.pdf").Pages(1).ReadingOrder()
fmt.Printf("Columns: %d\n", ro.ColumnCount)
for _, section := range ro.Sections {
    fmt.Printf("Section: %s\n", section.Type)
}

func (*Extractor) Text

func (e *Extractor) Text() (string, []Warning, error)

Text extracts and returns the text content from the configured pages. This is a terminal operation that closes the underlying reader.

Returns the extracted text, any warnings encountered during processing, and an error if extraction failed. Warnings indicate non-fatal issues (e.g., messy PDF detected) where extraction succeeded but results may be imperfect.

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

func (*Extractor) ToMarkdown

func (e *Extractor) ToMarkdown() (string, []Warning, error)

ToMarkdown extracts content and returns it as a markdown-formatted string. This preserves document structure including headings, paragraphs, and lists. This is a terminal operation that closes the underlying reader.

Returns the markdown text, any warnings encountered during processing, and an error if extraction failed.

Example:

md, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

func (*Extractor) ToMarkdownWithOptions

func (e *Extractor) ToMarkdownWithOptions(opts rag.MarkdownOptions) (string, []Warning, error)

ToMarkdownWithOptions extracts content and returns it as markdown with custom options. This is a terminal operation that closes the underlying reader.

Example:

opts := rag.MarkdownOptions{
    IncludeTableOfContents: true,
    IncludePageNumbers:     true,
}
md, warnings, err := tabula.Open("document.pdf").ToMarkdownWithOptions(opts)

type Warning

type Warning struct {
	Code    WarningCode
	Message string
}

Warning represents a non-fatal issue encountered during PDF processing. Unlike errors, warnings indicate that extraction succeeded but the results may be imperfect or require attention.

func (Warning) String

func (w Warning) String() string

String returns the warning message.

type WarningCode

type WarningCode int

WarningCode identifies the type of warning encountered during PDF processing.

const (
	// WarningMessyPDF indicates the PDF exhibits traits of being "messy" or
	// display-oriented (e.g., generated by Word, Quartz, or highly fragmented).
	// Text extraction may still succeed but results might have ordering issues.
	WarningMessyPDF WarningCode = iota
)

Directories

Path Synopsis
internal
Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.
Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.
Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.
Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL