tabula

package module

v1.0.0 Latest Latest Go to latest Published: Nov 27, 2025 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

README ¶

Tabula

A pure-Go PDF text extraction library with a fluent API, designed for RAG (Retrieval-Augmented Generation) workflows.

Features

Fluent API - Chain methods for clean, readable code
Layout Analysis - Detect headings, paragraphs, lists, and columns
Header/Footer Detection - Automatically identify and exclude repeating content
RAG-Ready Chunking - Semantic document chunking with metadata
Markdown Export - Convert extracted content to markdown
PDF 1.0-1.7 Support - Including modern XRef streams (PDF 1.5+)
Pure Go - No CGO dependencies

Installation

go get github.com/tsawler/tabula

Quick Start

Extract Text

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    text, warnings, err := tabula.Open("document.pdf").Text()
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(text)

    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}

Extract with Options

text, warnings, err := tabula.Open("document.pdf").
    Pages(1, 2, 3).              // Specific pages
    ExcludeHeadersAndFooters().  // Remove repeating headers/footers
    JoinParagraphs().            // Join text into paragraphs
    Text()

Extract as Markdown

markdown, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

RAG Chunking

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    chunks, warnings, err := tabula.Open("document.pdf").
        ExcludeHeadersAndFooters().
        Chunks()
    if err != nil {
        log.Fatal(err)
    }

    for i, chunk := range chunks.Chunks {
        fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
            i+1,
            chunk.Metadata.SectionTitle,
            chunk.Metadata.PageStart,
            chunk.Metadata.PageEnd,
            chunk.Metadata.EstimatedTokens)
        fmt.Println(chunk.Text)
        fmt.Println("---")
    }
}

Chunks as Markdown (for Vector DBs)

chunks, _, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}

// Get each chunk as separate markdown strings
mdChunks := chunks.ToMarkdownChunks()

for i, md := range mdChunks {
    // Store each chunk in your vector database
    embedding := embedModel.Embed(md)
    vectorDB.Store(chunks.Chunks[i].ID, embedding, md)
}

API Reference

Opening a PDF

// From file path
ext := tabula.Open("document.pdf")

// From existing reader
r, _ := reader.Open("document.pdf")
ext := tabula.FromReader(r)

Fluent Options

Method	Description
`Pages(1, 2, 3)`	Extract specific pages (1-indexed)
`PageRange(1, 10)`	Extract page range (inclusive)
`ExcludeHeaders()`	Exclude detected headers
`ExcludeFooters()`	Exclude detected footers
`ExcludeHeadersAndFooters()`	Exclude both
`JoinParagraphs()`	Join text fragments into paragraphs
`ByColumn()`	Process multi-column layouts column by column
`PreserveLayout()`	Maintain spatial positioning

Terminal Operations

Method	Returns	Description
`Text()`	`string`	Plain text content
`ToMarkdown()`	`string`	Markdown-formatted content
`ToMarkdownWithOptions(opts)`	`string`	Markdown with custom options
`Fragments()`	`[]text.TextFragment`	Raw text fragments with positions
`Lines()`	`[]layout.Line`	Detected text lines
`Paragraphs()`	`[]layout.Paragraph`	Detected paragraphs
`Headings()`	`[]layout.Heading`	Detected headings (H1-H6)
`Lists()`	`[]layout.List`	Detected lists
`Blocks()`	`[]layout.Block`	Text blocks
`Elements()`	`[]layout.LayoutElement`	All elements in reading order
`Document()`	`*model.Document`	Full document structure
`Chunks()`	`*rag.ChunkCollection`	Semantic chunks for RAG
`ChunksWithConfig(config, sizeConfig)`	`*rag.ChunkCollection`	Chunks with custom sizing
`Analyze()`	`*layout.AnalysisResult`	Complete layout analysis
`PageCount()`	`int`	Number of pages

Inspection Methods (non-terminal)

ext := tabula.Open("document.pdf")
defer ext.Close()

isCharLevel, _ := ext.IsCharacterLevel()  // Detect character-level PDFs
isMultiCol, _ := ext.IsMultiColumn()      // Detect multi-column layouts
pageCount, _ := ext.PageCount()           // Get page count

RAG Integration

Chunk Filtering

chunks, _, _ := tabula.Open("doc.pdf").Chunks()

// Filter by content type
tablesOnly := chunks.FilterWithTables()
listsOnly := chunks.FilterWithLists()

// Filter by location
section := chunks.FilterBySection("Introduction")
page5 := chunks.FilterByPage(5)
pages1to10 := chunks.FilterByPageRange(1, 10)

// Filter by size
smallChunks := chunks.FilterByMaxTokens(500)
largeChunks := chunks.FilterByMinTokens(100)

// Search
matches := chunks.Search("keyword")

// Chain filters
result := chunks.
    FilterBySection("Methods").
    FilterByMinTokens(100).
    Search("algorithm")

Markdown Options

import "github.com/tsawler/tabula/rag"

opts := rag.MarkdownOptions{
    IncludeMetadata:        true,   // YAML front matter
    IncludeTableOfContents: true,   // Generated TOC
    IncludeChunkSeparators: true,   // --- between chunks
    IncludePageNumbers:     true,   // Page references
    IncludeChunkIDs:        true,   // HTML comments with chunk IDs
}

markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)

// Or use preset for RAG
opts := rag.RAGOptimizedMarkdownOptions()

Custom Chunk Sizing

import "github.com/tsawler/tabula/rag"

config := rag.ChunkerConfig{
    TargetChunkSize: 500,   // Target characters per chunk
    MaxChunkSize:    1000,  // Maximum characters
    MinChunkSize:    100,   // Minimum characters
    OverlapSize:     50,    // Overlap between chunks
}
sizeConfig := rag.DefaultSizeConfig()

chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)

Working with Results

Chunk Metadata

for _, chunk := range chunks.Chunks {
    fmt.Println("ID:", chunk.ID)
    fmt.Println("Section:", chunk.Metadata.SectionTitle)
    fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
    fmt.Println("Words:", chunk.Metadata.WordCount)
    fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
    fmt.Println("Has Table:", chunk.Metadata.HasTable)
    fmt.Println("Has List:", chunk.Metadata.HasList)
}

Collection Statistics

stats := chunks.Statistics()
fmt.Println("Total chunks:", stats.TotalChunks)
fmt.Println("Total words:", stats.TotalWords)
fmt.Println("Average tokens:", stats.AvgTokens)
fmt.Println("Chunks with tables:", stats.ChunksWithTables)

Warnings

The library returns warnings for non-fatal issues:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    log.Fatal(err)  // Fatal error
}

for _, w := range warnings {
    log.Println("Warning:", w.Message)  // Non-fatal issues
}

// Format all warnings
formatted := tabula.FormatWarnings(warnings)

Common warnings:

"Detected messy/display-oriented PDF traits" - PDF may have unusual text layout
High fragmentation warnings - Text is split into many small fragments

Error Handling Helpers

// Panic on error (for scripts/tests)
text := tabula.MustText(tabula.Open("doc.pdf").Text())
count := tabula.Must(tabula.Open("doc.pdf").PageCount())

Testing

go test ./...

License

MIT License

ARCHITECTURE.md - System architecture
PDF_PARSING_GUIDE.md - PDF internals
RAG_INTEGRATION.md - RAG pipeline details

Documentation ¶

Overview ¶

integration.go provides methods to integrate layout analysis with the Document/Page models

Package tabula provides a fluent API for extracting text, tables, and other content from PDF files.

Basic usage:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    // handle error
}
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

With options:

text, _, err := tabula.Open("report.pdf").
    Pages(1, 2, 3).
    ExcludeHeaders().
    ExcludeFooters().
    Text()

For advanced use cases, the lower-level reader package is also available.

Index ¶

func AnalyzeDocument(path string) (*model.Document, error)
func AnalyzeDocumentWithConfig(path string, config layout.AnalyzerConfig) (*model.Document, error)
func FormatWarnings(warnings []Warning) string
func Must[T any](val T, err error) T
func MustText[T any](val T, _ []Warning, err error) T
func PopulatePageLayout(page *model.Page, fragments []text.TextFragment)
func PopulatePageLayoutWithConfig(page *model.Page, fragments []text.TextFragment, config layout.AnalyzerConfig)
type ExtractOptions
type Extractor
- func FromReader(r *reader.Reader) *Extractor
- func Open(filename string) *Extractor
type Warning
- func (w Warning) String() string
type WarningCode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AnalyzeDocument ¶

func AnalyzeDocument(path string) (*model.Document, error)

AnalyzeDocument performs layout analysis on all pages of a document and populates the Layout field of each page. This enables access to detected headings, lists, paragraphs, columns, and other structural elements.

Example:

doc, err := tabula.AnalyzeDocument("document.pdf")
if err != nil {
    log.Fatal(err)
}
for _, page := range doc.Pages {
    fmt.Printf("Page %d: %d headings, %d paragraphs\n",
        page.Number, len(page.GetHeadings()), len(page.GetParagraphs()))
}

func AnalyzeDocumentWithConfig ¶

func AnalyzeDocumentWithConfig(path string, config layout.AnalyzerConfig) (*model.Document, error)

AnalyzeDocumentWithConfig performs layout analysis with custom configuration

func FormatWarnings ¶

func FormatWarnings(warnings []Warning) string

FormatWarnings returns a human-readable string of all warnings. Returns empty string if there are no warnings.

func Must ¶

func Must[T any](val T, err error) T

Must is a helper that wraps a call to a function returning (T, error) and panics if the error is non-nil. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

count := tabula.Must(tabula.Open("document.pdf").PageCount())

func MustText ¶

func MustText[T any](val T, _ []Warning, err error) T

MustText is a helper that wraps a call to Text() or Fragments() and panics if the error is non-nil. It discards warnings and returns just the value. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

text := tabula.MustText(tabula.Open("document.pdf").Text())

func PopulatePageLayout ¶

func PopulatePageLayout(page *model.Page, fragments []text.TextFragment)

PopulatePageLayout performs layout analysis on a single page and populates its Layout field

func PopulatePageLayoutWithConfig ¶

func PopulatePageLayoutWithConfig(page *model.Page, fragments []text.TextFragment, config layout.AnalyzerConfig)

PopulatePageLayoutWithConfig performs layout analysis with custom configuration

Types ¶

type ExtractOptions ¶

type ExtractOptions struct {
	// contains filtered or unexported fields
}

ExtractOptions holds configuration for text extraction.

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor provides a fluent interface for extracting content from PDFs. Each configuration method returns a new Extractor instance, making it safe for concurrent use and allowing method chaining.

func FromReader ¶

func FromReader(r *reader.Reader) *Extractor

FromReader creates an Extractor from an already-opened reader.Reader. This is useful when you need more control over the reader lifecycle. Note: The caller is responsible for closing the reader.

Example:

r, err := reader.Open("document.pdf")
if err != nil {
    // handle error
}
defer r.Close()
text, warnings, err := tabula.FromReader(r).Text()

func Open ¶

func Open(filename string) *Extractor

Open opens a PDF file and returns an Extractor for fluent configuration. The returned Extractor must be closed when done, either explicitly via Close() or implicitly when calling a terminal operation like Text().

Example:

text, warnings, err := tabula.Open("document.pdf").Text()

func (*Extractor) Analyze ¶

func (e *Extractor) Analyze() (*layout.AnalysisResult, error)

Analyze performs complete layout analysis and returns all detected elements. This is the most comprehensive extraction method, combining columns, lines, paragraphs, headings, lists, and reading order into a unified result. This is a terminal operation that closes the underlying reader.

Example:

result, err := tabula.Open("document.pdf").Pages(1).Analyze()
for _, elem := range result.Elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) Blocks ¶

func (e *Extractor) Blocks() ([]layout.Block, error)

Blocks extracts and returns detected text blocks from the document. Blocks are spatially grouped regions of text, useful for understanding document layout structure. This is a terminal operation that closes the underlying reader.

Example:

blocks, err := tabula.Open("document.pdf").Blocks()
for _, block := range blocks {
    fmt.Printf("Block at (%.1f, %.1f): %s\n", block.BBox.X, block.BBox.Y, block.GetText())
}

func (*Extractor) ByColumn ¶

func (e *Extractor) ByColumn() *Extractor

ByColumn configures the extractor to process text column by column in reading order, rather than line by line across the full page width. This is useful for multi-column documents like newspapers or academic papers.

Example:

text, _, err := tabula.Open("newspaper.pdf").ByColumn().Text()

func (*Extractor) Chunks ¶

func (e *Extractor) Chunks() (*rag.ChunkCollection, []Warning, error)

Chunks extracts content and returns semantic chunks for RAG workflows. This method combines document extraction with RAG chunking in a single call. This is a terminal operation that closes the underlying reader.

Example:

chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}
for _, chunk := range chunks.Chunks {
    fmt.Printf("[%s] %s\n", chunk.Metadata.SectionTitle, chunk.Text[:50])
}

func (*Extractor) ChunksWithConfig ¶

func (e *Extractor) ChunksWithConfig(config rag.ChunkerConfig, sizeConfig rag.SizeConfig) (*rag.ChunkCollection, []Warning, error)

ChunksWithConfig extracts content and returns semantic chunks using custom configuration. This allows fine-tuning of chunk sizes, overlap, and other parameters. This is a terminal operation that closes the underlying reader.

Example:

config := rag.ChunkerConfig{
    TargetChunkSize: 500,
    MaxChunkSize:    1000,
    OverlapSize:     50,
}
sizeConfig := rag.DefaultSizeConfig()
chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ChunksWithConfig(config, sizeConfig)

func (*Extractor) Close ¶

func (e *Extractor) Close() error

Close releases resources associated with the Extractor. It is safe to call Close multiple times.

func (*Extractor) Document ¶

func (e *Extractor) Document() (*model.Document, []Warning, error)

Document extracts content and returns a model.Document structure suitable for RAG chunking and other document processing workflows. This is a terminal operation that closes the underlying reader.

Example:

doc, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Document()
if err != nil {
    log.Fatal(err)
}
// Use doc for chunking or other processing

func (*Extractor) Elements ¶

func (e *Extractor) Elements() ([]layout.LayoutElement, error)

Elements extracts and returns all detected elements in reading order. Elements include paragraphs, headings, and lists, unified into a single ordered list. This is useful for document reconstruction or RAG workflows. This is a terminal operation that closes the underlying reader.

Example:

elements, err := tabula.Open("document.pdf").Elements()
for _, elem := range elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) ExcludeFooters ¶

func (e *Extractor) ExcludeFooters() *Extractor

ExcludeFooters configures the extractor to exclude detected footers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeFooters().Text()

func (*Extractor) ExcludeHeaders ¶

func (e *Extractor) ExcludeHeaders() *Extractor

ExcludeHeaders configures the extractor to exclude detected headers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeaders().Text()

func (*Extractor) ExcludeHeadersAndFooters ¶

func (e *Extractor) ExcludeHeadersAndFooters() *Extractor

ExcludeHeadersAndFooters configures the extractor to exclude both detected headers and footers. This is a convenience method equivalent to calling ExcludeHeaders().ExcludeFooters().

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().Text()

func (*Extractor) Fragments ¶

func (e *Extractor) Fragments() ([]text.TextFragment, []Warning, error)

Fragments extracts and returns text fragments with position information. This is a terminal operation that closes the underlying reader.

Returns the fragments, any warnings encountered during processing, and an error if extraction failed.

Example:

fragments, warnings, err := tabula.Open("document.pdf").Pages(1).Fragments()

func (*Extractor) Headings ¶

func (e *Extractor) Headings() ([]layout.Heading, error)

Headings extracts and returns detected headings (H1-H6) from the document. This is a terminal operation that closes the underlying reader.

Example:

headings, err := tabula.Open("document.pdf").Headings()
for _, h := range headings {
    fmt.Printf("[%s] %s\n", h.Level, h.Text)
}

func (*Extractor) IsCharacterLevel ¶

func (e *Extractor) IsCharacterLevel() (bool, error)

IsCharacterLevel checks if the first page of the PDF uses character-level text fragments (one character per fragment). This requires special handling for proper text extraction. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
isCharLevel, err := ext.IsCharacterLevel()

func (*Extractor) IsMultiColumn ¶

func (e *Extractor) IsMultiColumn() (bool, error)

IsMultiColumn checks if the first page of the PDF appears to have a multi-column layout. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("newspaper.pdf")
defer ext.Close()
multiCol, err := ext.IsMultiColumn()

func (*Extractor) JoinParagraphs ¶

func (e *Extractor) JoinParagraphs() *Extractor

JoinParagraphs configures the extractor to join lines within paragraphs using spaces instead of newlines. This produces cleaner text output where paragraph breaks are preserved but soft line breaks within paragraphs are removed.

Example:

text, _, err := tabula.Open("doc.pdf").JoinParagraphs().Text()
text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().JoinParagraphs().Text()

func (*Extractor) Lines ¶

func (e *Extractor) Lines() ([]layout.Line, error)

Lines extracts and returns detected text lines with position and alignment info. This is a terminal operation that closes the underlying reader.

Example:

lines, err := tabula.Open("document.pdf").Lines()
for _, line := range lines {
    fmt.Printf("%s (align: %s)\n", line.Text, line.Alignment)
}

func (*Extractor) Lists ¶

func (e *Extractor) Lists() ([]layout.List, error)

Lists extracts and returns detected lists (bulleted, numbered, etc.) from the document. This is a terminal operation that closes the underlying reader.

Example:

lists, err := tabula.Open("document.pdf").Lists()
for _, list := range lists {
    fmt.Printf("List type: %s, items: %d\n", list.Type, len(list.Items))
}

func (*Extractor) PageCount ¶

func (e *Extractor) PageCount() (int, error)

PageCount returns the total number of pages in the PDF. Note: This does NOT close the reader, allowing further operations.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
count, err := ext.PageCount()

func (*Extractor) PageRange ¶

func (e *Extractor) PageRange(start, end int) *Extractor

PageRange specifies a range of pages to extract (1-indexed, inclusive).

Example:

text, _, err := tabula.Open("doc.pdf").PageRange(5, 10).Text()

func (*Extractor) Pages ¶

func (e *Extractor) Pages(pages ...int) *Extractor

Pages specifies which pages to extract from (1-indexed). Multiple calls are cumulative.

Example:

text, _, err := tabula.Open("doc.pdf").Pages(1, 3, 5).Text()

func (*Extractor) Paragraphs ¶

func (e *Extractor) Paragraphs() ([]layout.Paragraph, error)

Paragraphs extracts and returns detected paragraphs with style information. This uses reading order detection to handle multi-column layouts correctly. This is a terminal operation that closes the underlying reader.

Example:

paragraphs, err := tabula.Open("document.pdf").
    ExcludeHeaders().
    ExcludeFooters().
    Paragraphs()
for _, para := range paragraphs {
    fmt.Printf("[%s] %s\n", para.Style, para.Text)
}

func (*Extractor) PreserveLayout ¶

func (e *Extractor) PreserveLayout() *Extractor

PreserveLayout maintains spatial positioning by inserting spaces to approximate the visual layout of the original document.

Example:

text, _, err := tabula.Open("form.pdf").PreserveLayout().Text()

func (*Extractor) ReadingOrder ¶

func (e *Extractor) ReadingOrder() (*layout.ReadingOrderResult, error)

ReadingOrder extracts and returns detailed reading order analysis. This includes column detection, section boundaries, and proper text ordering for multi-column documents. This is a terminal operation that closes the underlying reader.

Example:

ro, err := tabula.Open("newspaper.pdf").Pages(1).ReadingOrder()
fmt.Printf("Columns: %d\n", ro.ColumnCount)
for _, section := range ro.Sections {
    fmt.Printf("Section: %s\n", section.Type)
}

func (*Extractor) Text ¶

func (e *Extractor) Text() (string, []Warning, error)

Text extracts and returns the text content from the configured pages. This is a terminal operation that closes the underlying reader.

Returns the extracted text, any warnings encountered during processing, and an error if extraction failed. Warnings indicate non-fatal issues (e.g., messy PDF detected) where extraction succeeded but results may be imperfect.

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

func (*Extractor) ToMarkdown ¶

func (e *Extractor) ToMarkdown() (string, []Warning, error)

ToMarkdown extracts content and returns it as a markdown-formatted string. This preserves document structure including headings, paragraphs, and lists. This is a terminal operation that closes the underlying reader.

Returns the markdown text, any warnings encountered during processing, and an error if extraction failed.

Example:

md, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

func (*Extractor) ToMarkdownWithOptions ¶

func (e *Extractor) ToMarkdownWithOptions(opts rag.MarkdownOptions) (string, []Warning, error)

ToMarkdownWithOptions extracts content and returns it as markdown with custom options. This is a terminal operation that closes the underlying reader.

Example:

opts := rag.MarkdownOptions{
    IncludeTableOfContents: true,
    IncludePageNumbers:     true,
}
md, warnings, err := tabula.Open("document.pdf").ToMarkdownWithOptions(opts)

type Warning ¶

type Warning struct {
	Code    WarningCode
	Message string
}

Warning represents a non-fatal issue encountered during PDF processing. Unlike errors, warnings indicate that extraction succeeded but the results may be imperfect or require attention.

func (Warning) String ¶

func (w Warning) String() string

String returns the warning message.

type WarningCode ¶

type WarningCode int

WarningCode identifies the type of warning encountered during PDF processing.

const (
	// WarningMessyPDF indicates the PDF exhibits traits of being "messy" or
	// display-oriented (e.g., generated by Word, Quartz, or highly fragmented).
	// Text extraction may still succeed but results might have ordering issues.
	WarningMessyPDF WarningCode = iota
)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
contentstream
core
font
graphicsstate
internal
filters
layout Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.	Package layout provides document layout analysis including the unified Layout Analyzer that orchestrates all detection components.
model
pages
rag Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.	Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.
reader
resolver
tables
text

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Tabula

Features

Installation

Quick Start

Extract Text

Extract with Options

Extract as Markdown

RAG Chunking

Chunks as Markdown (for Vector DBs)

API Reference

Opening a PDF

Fluent Options

Terminal Operations

Inspection Methods (non-terminal)

RAG Integration

Chunk Filtering

Markdown Options

Custom Chunk Sizing

Working with Results

Chunk Metadata

Collection Statistics

Warnings

Error Handling Helpers

Testing

License

Related Documentation

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func AnalyzeDocument ¶

func AnalyzeDocumentWithConfig ¶

func FormatWarnings ¶

func Must ¶

func MustText ¶

func PopulatePageLayout ¶

func PopulatePageLayoutWithConfig ¶

Types ¶

type ExtractOptions ¶

type Extractor ¶

func FromReader ¶

func Open ¶

func (*Extractor) Analyze ¶

func (*Extractor) Blocks ¶

func (*Extractor) ByColumn ¶

func (*Extractor) Chunks ¶

func (*Extractor) ChunksWithConfig ¶

func (*Extractor) Close ¶

func (*Extractor) Document ¶

func (*Extractor) Elements ¶

func (*Extractor) ExcludeFooters ¶

func (*Extractor) ExcludeHeaders ¶

func (*Extractor) ExcludeHeadersAndFooters ¶

func (*Extractor) Fragments ¶

func (*Extractor) Headings ¶

func (*Extractor) IsCharacterLevel ¶

func (*Extractor) IsMultiColumn ¶

func (*Extractor) JoinParagraphs ¶

func (*Extractor) Lines ¶

func (*Extractor) Lists ¶

func (*Extractor) PageCount ¶

func (*Extractor) PageRange ¶

func (*Extractor) Pages ¶

func (*Extractor) Paragraphs ¶

func (*Extractor) PreserveLayout ¶

func (*Extractor) ReadingOrder ¶

func (*Extractor) Text ¶

func (*Extractor) ToMarkdown ¶

func (*Extractor) ToMarkdownWithOptions ¶

type Warning ¶

func (Warning) String ¶

type WarningCode ¶

Source Files ¶

Directories ¶