tabula

package module
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2025 License: MIT Imports: 15 Imported by: 0

README

Built with GoLang Version License Go Report Card

Tabula

A pure-Go text extraction library with a fluent API, designed for RAG (Retrieval-Augmented Generation) workflows. Supports PDF, DOCX, ODT, XLSX, and PPTX files.

Features

  • Fluent API - Chain methods for clean, readable code
  • Multi-Format Support - PDF (.pdf), Word (.docx), OpenDocument (.odt), Excel (.xlsx), and PowerPoint (.pptx) files
  • Layout Analysis - Detect headings, paragraphs, lists, and tables
  • Header/Footer Detection - Automatically identify and exclude repeating content
  • RAG-Ready Chunking - Semantic document chunking with metadata
  • Markdown Export - Convert extracted content to markdown
  • PDF 1.0-1.7 Support - Including modern XRef streams (PDF 1.5+)
  • Pure Go - No CGO dependencies

Installation

go get github.com/tsawler/tabula

Quick Start

Extract Text
package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    // Works with PDF, DOCX, ODT, XLSX, and PPTX files
    text, warnings, err := tabula.Open("document.pdf").Text()
    // text, warnings, err := tabula.Open("document.docx").Text()
    // text, warnings, err := tabula.Open("document.odt").Text()
    // text, warnings, err := tabula.Open("spreadsheet.xlsx").Text()
    // text, warnings, err := tabula.Open("presentation.pptx").Text()
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(text)

    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}
Extract with Options
// PDF with all options
text, warnings, err := tabula.Open("document.pdf").
    Pages(1, 2, 3).              // Specific pages (PDF only)
    ExcludeHeadersAndFooters().  // Remove headers/footers
    JoinParagraphs().            // Join text into paragraphs (PDF only)
    Text()

// DOCX/ODT/XLSX/PPTX with header/footer exclusion
text, warnings, err := tabula.Open("document.docx").
    ExcludeHeadersAndFooters().  // Remove headers/footers
    Text()

// XLSX (each sheet extracted as tab-separated values)
text, warnings, err := tabula.Open("spreadsheet.xlsx").Text()

// PPTX (each slide extracted with title and content)
text, warnings, err := tabula.Open("presentation.pptx").
    ExcludeHeadersAndFooters().  // Remove slide footers and numbers
    Text()
Extract as Markdown
// PDF with header/footer exclusion
markdown, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// DOCX with header/footer exclusion (preserves headings, lists, tables)
markdown, warnings, err := tabula.Open("document.docx").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// ODT with header/footer exclusion (preserves headings, lists, tables)
markdown, warnings, err := tabula.Open("document.odt").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// XLSX (each sheet as a markdown table)
markdown, warnings, err := tabula.Open("spreadsheet.xlsx").ToMarkdown()

// PPTX (each slide with title as heading, content, and tables)
markdown, warnings, err := tabula.Open("presentation.pptx").
    ExcludeHeadersAndFooters().
    ToMarkdown()
RAG Chunking
package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    // Works with PDF, DOCX, ODT, XLSX, and PPTX
    chunks, warnings, err := tabula.Open("document.pdf").Chunks()
    // chunks, warnings, err := tabula.Open("document.docx").Chunks()
    // chunks, warnings, err := tabula.Open("document.odt").Chunks()
    // chunks, warnings, err := tabula.Open("spreadsheet.xlsx").Chunks()
    // chunks, warnings, err := tabula.Open("presentation.pptx").Chunks()
    if err != nil {
        log.Fatal(err)
    }

    for i, chunk := range chunks.Chunks {
        fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
            i+1,
            chunk.Metadata.SectionTitle,
            chunk.Metadata.PageStart,
            chunk.Metadata.PageEnd,
            chunk.Metadata.EstimatedTokens)
        fmt.Println(chunk.Text)
        fmt.Println("---")
    }

    // Warnings are non-fatal issues
    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}
Chunks as Markdown (for Vector DBs)
chunks, _, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}

// Get each chunk as separate markdown strings
mdChunks := chunks.ToMarkdownChunks()

for i, md := range mdChunks {
    // Store each chunk in your vector database
    embedding := embedModel.Embed(md)
    vectorDB.Store(chunks.Chunks[i].ID, embedding, md)
}

API Reference

Opening Documents
// From file path (format auto-detected by extension)
ext := tabula.Open("document.pdf")
ext := tabula.Open("document.docx")
ext := tabula.Open("document.odt")
ext := tabula.Open("spreadsheet.xlsx")
ext := tabula.Open("presentation.pptx")

// From existing PDF reader (PDF only)
r, _ := reader.Open("document.pdf")
ext := tabula.FromReader(r)
Fluent Options
Method Description Formats
Pages(1, 2, 3) Extract specific pages (1-indexed) PDF
PageRange(1, 10) Extract page range (inclusive) PDF
ExcludeHeaders() Exclude detected headers PDF, DOCX, ODT, XLSX, PPTX
ExcludeFooters() Exclude detected footers PDF, DOCX, ODT, XLSX, PPTX
ExcludeHeadersAndFooters() Exclude both PDF, DOCX, ODT, XLSX, PPTX
JoinParagraphs() Join text fragments into paragraphs PDF
ByColumn() Process multi-column layouts column by column PDF
PreserveLayout() Maintain spatial positioning PDF
Terminal Operations
Method Returns Description Formats
Text() string Plain text content PDF, DOCX, ODT, XLSX, PPTX
ToMarkdown() string Markdown-formatted content PDF, DOCX, ODT, XLSX, PPTX
ToMarkdownWithOptions(opts) string Markdown with custom options PDF, DOCX, ODT, XLSX, PPTX
Document() *model.Document Full document structure PDF, DOCX, ODT, XLSX, PPTX
Chunks() *rag.ChunkCollection Semantic chunks for RAG PDF, DOCX, ODT, XLSX, PPTX
ChunksWithConfig(config, sizeConfig) *rag.ChunkCollection Chunks with custom sizing PDF, DOCX, ODT, XLSX, PPTX
PageCount() int Number of pages/sheets/slides PDF, DOCX, ODT, XLSX, PPTX
Fragments() []text.TextFragment Raw text fragments with positions PDF
Lines() []layout.Line Detected text lines PDF
Paragraphs() []layout.Paragraph Detected paragraphs PDF
Headings() []layout.Heading Detected headings (H1-H6) PDF
Lists() []layout.List Detected lists PDF
Blocks() []layout.Block Text blocks PDF
Elements() []layout.LayoutElement All elements in reading order PDF
Analyze() *layout.AnalysisResult Complete layout analysis PDF

Note on PDF-only methods: The methods marked "PDF" in the tables above (Pages, PageRange, JoinParagraphs, ByColumn, PreserveLayout, Fragments, Lines, Paragraphs, Headings, Lists, Blocks, Elements, Analyze) exist because PDFs lack semantic structure - they store raw text fragments at arbitrary positions, requiring layout analysis to reconstruct document structure. DOCX, ODT, XLSX, and PPTX files already contain explicit semantic markup in their XML, so these detection methods aren't needed. Use Document() to access the semantic structure for all formats.

Note on XLSX: For Excel files, each sheet becomes a page, and the sheet data is represented as a table element. PageCount() returns the number of sheets. Text() returns tab-separated values, while ToMarkdown() formats each sheet as a markdown table.

Note on PPTX: For PowerPoint files, each slide becomes a page. PageCount() returns the number of slides. Slide titles are extracted as headings, bullet points as lists, and tables are preserved. Use ExcludeHeadersAndFooters() to remove slide footers, dates, and slide numbers.

Inspection Methods (non-terminal, PDF only)
ext := tabula.Open("document.pdf")
defer ext.Close()

isCharLevel, _ := ext.IsCharacterLevel()  // Detect character-level PDFs
isMultiCol, _ := ext.IsMultiColumn()      // Detect multi-column layouts
pageCount, _ := ext.PageCount()           // Get page count (works with DOCX and ODT too)

RAG Integration

Chunk Filtering
chunks, _, _ := tabula.Open("doc.pdf").Chunks()

// Filter by content type
tablesOnly := chunks.FilterWithTables()
listsOnly := chunks.FilterWithLists()

// Filter by location
section := chunks.FilterBySection("Introduction")
page5 := chunks.FilterByPage(5)
pages1to10 := chunks.FilterByPageRange(1, 10)

// Filter by size
smallChunks := chunks.FilterByMaxTokens(500)
largeChunks := chunks.FilterByMinTokens(100)

// Search
matches := chunks.Search("keyword")

// Chain filters
result := chunks.
    FilterBySection("Methods").
    FilterByMinTokens(100).
    Search("algorithm")
Markdown Options
import "github.com/tsawler/tabula/rag"

// Options supported by all formats (PDF, DOCX, ODT, XLSX, PPTX)
opts := rag.MarkdownOptions{
    IncludeMetadata:        true,   // YAML front matter with document metadata
    IncludeTableOfContents: true,   // Generated TOC from headings
    HeadingLevelOffset:     0,      // Adjust heading levels (1 makes H1 -> H2)
    MaxHeadingLevel:        6,      // Cap heading depth
}

// Works with all formats
markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("doc.docx").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("doc.odt").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("spreadsheet.xlsx").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("presentation.pptx").ToMarkdownWithOptions(opts)

// PDF-only options (used via RAG chunking pipeline)
pdfOpts := rag.MarkdownOptions{
    IncludeMetadata:        true,
    IncludeChunkSeparators: true,   // --- between chunks (PDF only)
    IncludePageNumbers:     true,   // Page references (PDF only)
    IncludeChunkIDs:        true,   // HTML comments with chunk IDs (PDF only)
}

// Or use preset for RAG
opts := rag.RAGOptimizedMarkdownOptions()
Custom Chunk Sizing
import "github.com/tsawler/tabula/rag"

config := rag.ChunkerConfig{
    TargetChunkSize: 500,   // Target characters per chunk
    MaxChunkSize:    1000,  // Maximum characters
    MinChunkSize:    100,   // Minimum characters
    OverlapSize:     50,    // Overlap between chunks
}
sizeConfig := rag.DefaultSizeConfig()

chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)

Working with Results

Chunk Metadata
for _, chunk := range chunks.Chunks {
    fmt.Println("ID:", chunk.ID)
    fmt.Println("Section:", chunk.Metadata.SectionTitle)
    fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
    fmt.Println("Words:", chunk.Metadata.WordCount)
    fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
    fmt.Println("Has Table:", chunk.Metadata.HasTable)
    fmt.Println("Has List:", chunk.Metadata.HasList)
}
Collection Statistics
stats := chunks.Statistics()
fmt.Println("Total chunks:", stats.TotalChunks)
fmt.Println("Total words:", stats.TotalWords)
fmt.Println("Average tokens:", stats.AvgTokens)
fmt.Println("Chunks with tables:", stats.ChunksWithTables)

Warnings

The library returns warnings for non-fatal issues:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    log.Fatal(err)  // Fatal error
}

for _, w := range warnings {
    log.Println("Warning:", w.Message)  // Non-fatal issues
}

// Format all warnings
formatted := tabula.FormatWarnings(warnings)

Common warnings:

  • "Detected messy/display-oriented PDF traits" - PDF may have unusual text layout
  • High fragmentation warnings - Text is split into many small fragments

Error Handling Helpers

// Panic on error (for scripts/tests)
text := tabula.MustText(tabula.Open("doc.pdf").Text())
count := tabula.Must(tabula.Open("doc.pdf").PageCount())

Testing

go test ./...

License

MIT License

Documentation

Overview

integration.go provides methods to integrate layout analysis with the Document/Page models

Package tabula provides a fluent API for extracting text, tables, and other content from PDF and DOCX files.

Basic usage:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    // handle error
}
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

DOCX files work the same way:

text, warnings, err := tabula.Open("document.docx").Text()

With options:

text, _, err := tabula.Open("report.pdf").
    Pages(1, 2, 3).
    ExcludeHeaders().
    ExcludeFooters().
    Text()

For advanced use cases, the lower-level reader package is also available.

Example (ChunkFiltering)
package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	// Filter by content type
	tablesOnly := chunks.FilterWithTables()
	listsOnly := chunks.FilterWithLists()
	_ = tablesOnly
	_ = listsOnly

	// Filter by location
	section := chunks.FilterBySection("Introduction")
	page5 := chunks.FilterByPage(5)
	pages1to10 := chunks.FilterByPageRange(1, 10)
	_ = section
	_ = page5
	_ = pages1to10

	// Filter by size
	smallChunks := chunks.FilterByMaxTokens(500)
	largeChunks := chunks.FilterByMinTokens(100)
	_ = smallChunks
	_ = largeChunks

	// Search
	matches := chunks.Search("keyword")
	_ = matches

	// Chain filters
	result := chunks.
		FilterBySection("Methods").
		FilterByMinTokens(100).
		Search("algorithm")
	_ = result
}
Example (ChunkMetadata)
package main

import (
	"fmt"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	for _, chunk := range chunks.Chunks {
		fmt.Println("ID:", chunk.ID)
		fmt.Println("Section:", chunk.Metadata.SectionTitle)
		fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
		fmt.Println("Words:", chunk.Metadata.WordCount)
		fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
		fmt.Println("Has Table:", chunk.Metadata.HasTable)
		fmt.Println("Has List:", chunk.Metadata.HasList)
	}
}
Example (ChunksAsMarkdown)
package main

import (
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, err := tabula.Open("document.pdf").
		ExcludeHeadersAndFooters().
		Chunks()
	if err != nil {
		log.Fatal(err)
	}

	// Get each chunk as separate markdown strings
	mdChunks := chunks.ToMarkdownChunks()

	for i, md := range mdChunks {
		// Example: store each chunk in your vector database
		_ = chunks.Chunks[i].ID
		_ = md
	}
}
Example (CollectionStatistics)
package main

import (
	"fmt"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	stats := chunks.Statistics()
	fmt.Println("Total chunks:", stats.TotalChunks)
	fmt.Println("Total words:", stats.TotalWords)
	fmt.Println("Average tokens:", stats.AvgTokens)
	fmt.Println("Chunks with tables:", stats.ChunksWithTables)
}
Example (CustomChunkSizing)
package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/rag"
)

func main() {
	config := rag.ChunkerConfig{
		TargetChunkSize: 500,  // Target characters per chunk
		MaxChunkSize:    1000, // Maximum characters
		MinChunkSize:    100,  // Minimum characters
		OverlapSize:     50,   // Overlap between chunks
	}
	sizeConfig := rag.DefaultSizeConfig()

	chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)
	_ = chunks
}
Example (ErrorHandling)
package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	// Panic on error (for scripts/tests)
	text := tabula.MustText(tabula.Open("doc.pdf").Text())
	count := tabula.Must(tabula.Open("doc.pdf").PageCount())
	_ = text
	_ = count
}
Example (ExtractMarkdown)
package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	// PDF with header/footer exclusion
	markdown, warnings, err := tabula.Open("document.pdf").
		ExcludeHeadersAndFooters().
		ToMarkdown()
	_ = markdown
	_ = warnings
	_ = err

	// DOCX (preserves headings, lists, tables)
	markdown, warnings, err = tabula.Open("document.docx").ToMarkdown()
	_ = markdown
	_ = warnings
	_ = err
}
Example (ExtractText)
package main

import (
	"fmt"
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	// Works with both PDF and DOCX files
	text, warnings, err := tabula.Open("document.pdf").Text()
	// text, warnings, err := tabula.Open("document.docx").Text()
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(text)

	for _, w := range warnings {
		fmt.Println("Warning:", w.Message)
	}
}
Example (ExtractWithOptions)
package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	text, warnings, err := tabula.Open("document.pdf").
		Pages(1, 2, 3).             // Specific pages (PDF only)
		ExcludeHeadersAndFooters(). // Remove repeating headers/footers (PDF only)
		JoinParagraphs().           // Join text into paragraphs (PDF only)
		Text()
	_ = text
	_ = warnings
	_ = err
}
Example (InspectionMethods)
package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	ext := tabula.Open("document.pdf")
	defer ext.Close()

	isCharLevel, _ := ext.IsCharacterLevel() // Detect character-level PDFs
	isMultiCol, _ := ext.IsMultiColumn()     // Detect multi-column layouts
	pageCount, _ := ext.PageCount()          // Get page count (works with DOCX too)
	_ = isCharLevel
	_ = isMultiCol
	_ = pageCount
}
Example (MarkdownOptions)
package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/rag"
)

func main() {
	opts := rag.MarkdownOptions{
		IncludeMetadata:        true, // YAML front matter
		IncludeTableOfContents: true, // Generated TOC
		IncludeChunkSeparators: true, // --- between chunks
		IncludePageNumbers:     true, // Page references
		IncludeChunkIDs:        true, // HTML comments with chunk IDs
	}

	markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)
	_ = markdown

	// Or use preset for RAG
	opts = rag.RAGOptimizedMarkdownOptions()
	_ = opts
}
Example (OpenDocuments)
package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/reader"
)

func main() {
	// From file path (format auto-detected by extension)
	ext := tabula.Open("document.pdf")
	_ = ext
	ext = tabula.Open("document.docx")
	_ = ext

	// From existing PDF reader (PDF only)
	r, _ := reader.Open("document.pdf")
	ext = tabula.FromReader(r)
	_ = ext
}
Example (RagChunking)
package main

import (
	"fmt"
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	// Works with both PDF and DOCX
	chunks, warnings, err := tabula.Open("document.pdf").Chunks()
	// chunks, warnings, err := tabula.Open("document.docx").Chunks()
	if err != nil {
		log.Fatal(err)
	}

	for i, chunk := range chunks.Chunks {
		fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
			i+1,
			chunk.Metadata.SectionTitle,
			chunk.Metadata.PageStart,
			chunk.Metadata.PageEnd,
			chunk.Metadata.EstimatedTokens)
		fmt.Println(chunk.Text)
		fmt.Println("---")
	}

	// Warnings are non-fatal issues
	for _, w := range warnings {
		fmt.Println("Warning:", w.Message)
	}
}
Example (Warnings)
package main

import (
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	text, warnings, err := tabula.Open("document.pdf").Text()
	if err != nil {
		log.Fatal(err) // Fatal error
	}
	_ = text

	for _, w := range warnings {
		log.Println("Warning:", w.Message) // Non-fatal issues
	}

	// Format all warnings
	formatted := tabula.FormatWarnings(warnings)
	_ = formatted
}

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func AnalyzeDocument

func AnalyzeDocument(path string) (*model.Document, error)

AnalyzeDocument performs layout analysis on all pages of a document and populates the Layout field of each page. This enables access to detected headings, lists, paragraphs, columns, and other structural elements.

Example:

doc, err := tabula.AnalyzeDocument("document.pdf")
if err != nil {
    log.Fatal(err)
}
for _, page := range doc.Pages {
    fmt.Printf("Page %d: %d headings, %d paragraphs\n",
        page.Number, len(page.GetHeadings()), len(page.GetParagraphs()))
}

func AnalyzeDocumentWithConfig

func AnalyzeDocumentWithConfig(path string, config layout.AnalyzerConfig) (*model.Document, error)

AnalyzeDocumentWithConfig performs layout analysis with custom configuration

func FormatWarnings

func FormatWarnings(warnings []Warning) string

FormatWarnings returns a human-readable string of all warnings. Returns empty string if there are no warnings.

func Must

func Must[T any](val T, err error) T

Must is a helper that wraps a call to a function returning (T, error) and panics if the error is non-nil. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

count := tabula.Must(tabula.Open("document.pdf").PageCount())

func MustText

func MustText[T any](val T, _ []Warning, err error) T

MustText is a helper that wraps a call to Text() or Fragments() and panics if the error is non-nil. It discards warnings and returns just the value. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

text := tabula.MustText(tabula.Open("document.pdf").Text())

func PopulatePageLayout

func PopulatePageLayout(page *model.Page, fragments []text.TextFragment)

PopulatePageLayout performs layout analysis on a single page and populates its Layout field

func PopulatePageLayoutWithConfig

func PopulatePageLayoutWithConfig(page *model.Page, fragments []text.TextFragment, config layout.AnalyzerConfig)

PopulatePageLayoutWithConfig performs layout analysis with custom configuration

Types

type ExtractOptions

type ExtractOptions struct {
	// contains filtered or unexported fields
}

ExtractOptions holds configuration for text extraction.

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor provides a fluent interface for extracting content from PDFs, DOCX, ODT, XLSX, and PPTX files. Each configuration method returns a new Extractor instance, making it safe for concurrent use and allowing method chaining.

func FromReader

func FromReader(r *reader.Reader) *Extractor

FromReader creates an Extractor from an already-opened PDF reader.Reader. This is useful when you need more control over the PDF reader lifecycle. Note: The caller is responsible for closing the reader. For DOCX files, use Open() instead which handles format detection automatically.

Example:

r, err := reader.Open("document.pdf")
if err != nil {
    // handle error
}
defer r.Close()
text, warnings, err := tabula.FromReader(r).Text()

func Open

func Open(filename string) *Extractor

Open opens a PDF or DOCX file and returns an Extractor for fluent configuration. The file format is automatically detected based on the file extension. The returned Extractor must be closed when done, either explicitly via Close() or implicitly when calling a terminal operation like Text().

Supported formats:

  • PDF (.pdf)
  • DOCX (.docx)

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
text, warnings, err := tabula.Open("document.docx").Text()

func (*Extractor) Analyze

func (e *Extractor) Analyze() (*layout.AnalysisResult, error)

Analyze performs complete layout analysis and returns all detected elements. This is the most comprehensive extraction method, combining columns, lines, paragraphs, headings, lists, and reading order into a unified result. This is a terminal operation that closes the underlying reader.

Example:

result, err := tabula.Open("document.pdf").Pages(1).Analyze()
for _, elem := range result.Elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) Blocks

func (e *Extractor) Blocks() ([]layout.Block, error)

Blocks extracts and returns detected text blocks from the document. Blocks are spatially grouped regions of text, useful for understanding document layout structure. This is a terminal operation that closes the underlying reader.

Example:

blocks, err := tabula.Open("document.pdf").Blocks()
for _, block := range blocks {
    fmt.Printf("Block at (%.1f, %.1f): %s\n", block.BBox.X, block.BBox.Y, block.GetText())
}

func (*Extractor) ByColumn

func (e *Extractor) ByColumn() *Extractor

ByColumn configures the extractor to process text column by column in reading order, rather than line by line across the full page width. This is useful for multi-column documents like newspapers or academic papers.

Example:

text, _, err := tabula.Open("newspaper.pdf").ByColumn().Text()

func (*Extractor) Chunks

func (e *Extractor) Chunks() (*rag.ChunkCollection, []Warning, error)

Chunks extracts content and returns semantic chunks for RAG workflows. This method combines document extraction with RAG chunking in a single call. This is a terminal operation that closes the underlying reader.

Example:

chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}
for _, chunk := range chunks.Chunks {
    fmt.Printf("[%s] %s\n", chunk.Metadata.SectionTitle, chunk.Text[:50])
}

func (*Extractor) ChunksWithConfig

func (e *Extractor) ChunksWithConfig(config rag.ChunkerConfig, sizeConfig rag.SizeConfig) (*rag.ChunkCollection, []Warning, error)

ChunksWithConfig extracts content and returns semantic chunks using custom configuration. This allows fine-tuning of chunk sizes, overlap, and other parameters. This is a terminal operation that closes the underlying reader.

Example:

config := rag.ChunkerConfig{
    TargetChunkSize: 500,
    MaxChunkSize:    1000,
    OverlapSize:     50,
}
sizeConfig := rag.DefaultSizeConfig()
chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ChunksWithConfig(config, sizeConfig)

func (*Extractor) Close

func (e *Extractor) Close() error

Close releases resources associated with the Extractor. It is safe to call Close multiple times.

func (*Extractor) Document

func (e *Extractor) Document() (*model.Document, []Warning, error)

Document extracts content and returns a model.Document structure suitable for RAG chunking and other document processing workflows. This is a terminal operation that closes the underlying reader.

Example:

doc, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Document()
doc, warnings, err := tabula.Open("document.docx").Document()
if err != nil {
    log.Fatal(err)
}
// Use doc for chunking or other processing

func (*Extractor) Elements

func (e *Extractor) Elements() ([]layout.LayoutElement, error)

Elements extracts and returns all detected elements in reading order. Elements include paragraphs, headings, and lists, unified into a single ordered list. This is useful for document reconstruction or RAG workflows. This is a terminal operation that closes the underlying reader.

Example:

elements, err := tabula.Open("document.pdf").Elements()
for _, elem := range elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) ExcludeFooters

func (e *Extractor) ExcludeFooters() *Extractor

ExcludeFooters configures the extractor to exclude detected footers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeFooters().Text()

func (*Extractor) ExcludeHeaders

func (e *Extractor) ExcludeHeaders() *Extractor

ExcludeHeaders configures the extractor to exclude detected headers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeaders().Text()

func (*Extractor) ExcludeHeadersAndFooters

func (e *Extractor) ExcludeHeadersAndFooters() *Extractor

ExcludeHeadersAndFooters configures the extractor to exclude both detected headers and footers. This is a convenience method equivalent to calling ExcludeHeaders().ExcludeFooters().

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().Text()

func (*Extractor) Fragments

func (e *Extractor) Fragments() ([]text.TextFragment, []Warning, error)

Fragments extracts and returns text fragments with position information. This is a terminal operation that closes the underlying reader.

Returns the fragments, any warnings encountered during processing, and an error if extraction failed.

Example:

fragments, warnings, err := tabula.Open("document.pdf").Pages(1).Fragments()

func (*Extractor) Headings

func (e *Extractor) Headings() ([]layout.Heading, error)

Headings extracts and returns detected headings (H1-H6) from the document. This is a terminal operation that closes the underlying reader.

Example:

headings, err := tabula.Open("document.pdf").Headings()
for _, h := range headings {
    fmt.Printf("[%s] %s\n", h.Level, h.Text)
}

func (*Extractor) IsCharacterLevel

func (e *Extractor) IsCharacterLevel() (bool, error)

IsCharacterLevel checks if the first page of the PDF uses character-level text fragments (one character per fragment). This requires special handling for proper text extraction. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
isCharLevel, err := ext.IsCharacterLevel()

func (*Extractor) IsMultiColumn

func (e *Extractor) IsMultiColumn() (bool, error)

IsMultiColumn checks if the first page of the PDF appears to have a multi-column layout. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("newspaper.pdf")
defer ext.Close()
multiCol, err := ext.IsMultiColumn()

func (*Extractor) JoinParagraphs

func (e *Extractor) JoinParagraphs() *Extractor

JoinParagraphs configures the extractor to join lines within paragraphs using spaces instead of newlines. This produces cleaner text output where paragraph breaks are preserved but soft line breaks within paragraphs are removed.

Example:

text, _, err := tabula.Open("doc.pdf").JoinParagraphs().Text()
text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().JoinParagraphs().Text()

func (*Extractor) Lines

func (e *Extractor) Lines() ([]layout.Line, error)

Lines extracts and returns detected text lines with position and alignment info. This is a terminal operation that closes the underlying reader.

Example:

lines, err := tabula.Open("document.pdf").Lines()
for _, line := range lines {
    fmt.Printf("%s (align: %s)\n", line.Text, line.Alignment)
}

func (*Extractor) Lists

func (e *Extractor) Lists() ([]layout.List, error)

Lists extracts and returns detected lists (bulleted, numbered, etc.) from the document. This is a terminal operation that closes the underlying reader.

Example:

lists, err := tabula.Open("document.pdf").Lists()
for _, list := range lists {
    fmt.Printf("List type: %s, items: %d\n", list.Type, len(list.Items))
}

func (*Extractor) PageCount

func (e *Extractor) PageCount() (int, error)

PageCount returns the total number of pages in the document. For DOCX files, this returns 1 (the entire document is treated as a single page). Note: This does NOT close the reader, allowing further operations.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
count, err := ext.PageCount()

func (*Extractor) PageRange

func (e *Extractor) PageRange(start, end int) *Extractor

PageRange specifies a range of pages to extract (1-indexed, inclusive).

Example:

text, _, err := tabula.Open("doc.pdf").PageRange(5, 10).Text()

func (*Extractor) Pages

func (e *Extractor) Pages(pages ...int) *Extractor

Pages specifies which pages to extract from (1-indexed). Multiple calls are cumulative.

Example:

text, _, err := tabula.Open("doc.pdf").Pages(1, 3, 5).Text()

func (*Extractor) Paragraphs

func (e *Extractor) Paragraphs() ([]layout.Paragraph, error)

Paragraphs extracts and returns detected paragraphs with style information. This uses reading order detection to handle multi-column layouts correctly. This is a terminal operation that closes the underlying reader.

Example:

paragraphs, err := tabula.Open("document.pdf").
    ExcludeHeaders().
    ExcludeFooters().
    Paragraphs()
for _, para := range paragraphs {
    fmt.Printf("[%s] %s\n", para.Style, para.Text)
}

func (*Extractor) PreserveLayout

func (e *Extractor) PreserveLayout() *Extractor

PreserveLayout maintains spatial positioning by inserting spaces to approximate the visual layout of the original document.

Example:

text, _, err := tabula.Open("form.pdf").PreserveLayout().Text()

func (*Extractor) ReadingOrder

func (e *Extractor) ReadingOrder() (*layout.ReadingOrderResult, error)

ReadingOrder extracts and returns detailed reading order analysis. This includes column detection, section boundaries, and proper text ordering for multi-column documents. This is a terminal operation that closes the underlying reader.

Example:

ro, err := tabula.Open("newspaper.pdf").Pages(1).ReadingOrder()
fmt.Printf("Columns: %d\n", ro.ColumnCount)
for _, section := range ro.Sections {
    fmt.Printf("Section: %s\n", section.Type)
}

func (*Extractor) Text

func (e *Extractor) Text() (string, []Warning, error)

Text extracts and returns the text content from the configured pages. This is a terminal operation that closes the underlying reader.

Returns the extracted text, any warnings encountered during processing, and an error if extraction failed. Warnings indicate non-fatal issues (e.g., messy PDF detected) where extraction succeeded but results may be imperfect.

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
text, warnings, err := tabula.Open("document.docx").Text()
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

func (*Extractor) ToMarkdown

func (e *Extractor) ToMarkdown() (string, []Warning, error)

ToMarkdown extracts content and returns it as a markdown-formatted string. This preserves document structure including headings, paragraphs, and lists. This is a terminal operation that closes the underlying reader.

Returns the markdown text, any warnings encountered during processing, and an error if extraction failed.

Example:

md, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

func (*Extractor) ToMarkdownWithOptions

func (e *Extractor) ToMarkdownWithOptions(opts rag.MarkdownOptions) (string, []Warning, error)

ToMarkdownWithOptions extracts content and returns it as markdown with custom options. This is a terminal operation that closes the underlying reader.

Supported options for all formats:

  • IncludeMetadata: adds YAML front matter with document metadata
  • IncludeTableOfContents: generates a table of contents from headings
  • HeadingLevelOffset: adjusts heading levels (e.g., 1 makes H1 -> H2)
  • MaxHeadingLevel: caps heading depth (default: 6)

PDF-only options (used via RAG chunking):

  • IncludeChunkSeparators: adds horizontal rules between chunks
  • IncludePageNumbers: adds page references
  • IncludeChunkIDs: adds chunk IDs as HTML comments

Example:

opts := rag.MarkdownOptions{
    IncludeTableOfContents: true,
    IncludeMetadata:        true,
}
md, warnings, err := tabula.Open("document.pdf").ToMarkdownWithOptions(opts)
md, warnings, err := tabula.Open("document.docx").ToMarkdownWithOptions(opts)

type Warning

type Warning struct {
	Code    WarningCode
	Message string
}

Warning represents a non-fatal issue encountered during PDF processing. Unlike errors, warnings indicate that extraction succeeded but the results may be imperfect or require attention.

func (Warning) String

func (w Warning) String() string

String returns the warning message.

type WarningCode

type WarningCode int

WarningCode identifies the type of warning encountered during PDF processing.

const (
	// WarningMessyPDF indicates the PDF exhibits traits of being "messy" or
	// display-oriented (e.g., generated by Word, Quartz, or highly fragmented).
	// Text extraction may still succeed but results might have ordering issues.
	WarningMessyPDF WarningCode = iota
)

Directories

Path Synopsis
Package docx provides DOCX (Office Open XML) document parsing.
Package docx provides DOCX (Office Open XML) document parsing.
Package format provides file format detection for the tabula library.
Package format provides file format detection for the tabula library.
internal
Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.
Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.
Package odt provides ODT (OpenDocument Text) document parsing.
Package odt provides ODT (OpenDocument Text) document parsing.
Package pptx provides PPTX (Office Open XML Presentation) document parsing.
Package pptx provides PPTX (Office Open XML Presentation) document parsing.
Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.
Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.
Package xlsx provides XLSX (Office Open XML Spreadsheet) document parsing.
Package xlsx provides XLSX (Office Open XML Spreadsheet) document parsing.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL