text

package

v1.6.6 Latest Latest Go to latest Published: Feb 4, 2026 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

Documentation ¶

Overview ¶

Package text provides text extraction from PDF content streams.

This package handles the extraction of text fragments from PDF pages, including position calculation, font handling, and text direction detection.

Text Extraction ¶

The Extractor type processes PDF content stream operations to extract positioned text:

extractor := text.NewExtractor()
extractor.RegisterFontsFromPage(page, resolver)
fragments, err := extractor.ExtractFromBytes(contentData)

Each TextFragment contains the text along with position (X, Y), dimensions (Width, Height), font information, and text direction.

Font Registration ¶

For accurate text extraction, fonts should be registered before extraction:

RegisterFont - register a font by name
RegisterParsedFont - register a pre-parsed font with ToUnicode CMap
RegisterFontsFromPage - automatically register all fonts from a page

Text Direction ¶

The package supports bidirectional text with the Direction type:

LTR - left-to-right (Latin, CJK, etc.)
RTL - right-to-left (Arabic, Hebrew, etc.)
Neutral - direction-neutral characters (numbers, punctuation)

The DetectDirection function analyzes text to determine its direction.

Smart Spacing ¶

The extractor intelligently handles spacing between fragments:

Word-level PDFs: Uses font space width metrics
Character-level PDFs: Uses adaptive gap detection
Explicit spaces: Respects space characters in the stream

Index ¶

type Direction
- func DetectDirection(text string) Direction
- func GetCharDirection(r rune) Direction
- func (d Direction) String() string
type Extractor
- func NewExtractor() *Extractor
type TextFragment

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Direction ¶

type Direction int

Direction represents the writing direction of text. It is used to detect and handle bidirectional text (bidi) in documents.

const (
	// LTR (Left-to-Right) for Latin, Cyrillic, etc.
	LTR Direction = iota
	// RTL (Right-to-Left) for Arabic, Hebrew, etc.
	RTL
	// Neutral for numbers, punctuation, etc.
	Neutral
)

func DetectDirection ¶

func DetectDirection(text string) Direction

DetectDirection analyzes a string and returns its dominant text direction based on Unicode character properties. It counts strong directional characters and returns the direction with the higher count, or Neutral if no strong directional characters are present.

func GetCharDirection ¶

func GetCharDirection(r rune) Direction

GetCharDirection returns the inherent direction of a single Unicode character. Digits, punctuation, whitespace, and symbols are Neutral; RTL scripts (Arabic, Hebrew, Syriac, Thaana, N'Ko) return RTL; all other scripts return LTR.

func (Direction) String ¶

func (d Direction) String() string

String returns a string representation of the direction ("LTR", "RTL", or "Neutral").

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor extracts text fragments from PDF content streams. It maintains graphics state and registered fonts to properly decode and position text.

func NewExtractor ¶

func NewExtractor() *Extractor

NewExtractor creates a new text extractor with initialized graphics state.

func (*Extractor) Extract ¶

func (e *Extractor) Extract(operations []contentstream.Operation) ([]TextFragment, error)

Extract extracts text fragments from parsed content stream operations.

func (*Extractor) ExtractFromBytes ¶

func (e *Extractor) ExtractFromBytes(data []byte) ([]TextFragment, error)

ExtractFromBytes parses raw content stream data and extracts text fragments.

func (*Extractor) GetFonts ¶

func (e *Extractor) GetFonts() map[string]*font.Font

GetFonts returns the fonts registered in this extractor Useful for debugging font loading and ToUnicode CMap issues

func (*Extractor) GetFragments ¶

func (e *Extractor) GetFragments() []TextFragment

GetFragments returns all extracted text fragments with duplicates removed. Duplicate fragments are those at the same position with the same text, which can occur in PDFs with multiple content layers or tagged structure.

func (*Extractor) GetFragmentsRaw ¶ added in v1.5.4

func (e *Extractor) GetFragmentsRaw() []TextFragment

GetFragmentsRaw returns all extracted text fragments without deduplication. Use this when you need to see all fragments including duplicates.

func (*Extractor) GetText ¶

func (e *Extractor) GetText() string

GetText returns all extracted text as a string with smart spacing. Handles both LTR and RTL text, grouping fragments into lines and adding appropriate word and line breaks. Duplicate fragments at the same position are automatically removed.

func (*Extractor) RegisterFont ¶

func (e *Extractor) RegisterFont(name, baseFont, subtype string)

RegisterFont registers a font by name for use during extraction. The baseFont and subtype are used to create a basic font with default metrics.

func (*Extractor) RegisterFontsFromPage ¶

func (e *Extractor) RegisterFontsFromPage(page *pages.Page, resolver func(core.IndirectRef) (core.Object, error)) error

RegisterFontsFromPage parses and registers all fonts from a page's resources. This is the recommended way to prepare the extractor before extracting text from a page.

func (*Extractor) RegisterFontsFromResources ¶

func (e *Extractor) RegisterFontsFromResources(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error)) error

RegisterFontsFromResources parses and registers all fonts from a resources dictionary. Use this when working with page resources directly rather than through a Page object.

func (*Extractor) RegisterParsedFont ¶

func (e *Extractor) RegisterParsedFont(name string, f *font.Font)

RegisterParsedFont registers a pre-parsed font for use during extraction. Use this when you have already parsed the font with its ToUnicode CMap and widths.

func (*Extractor) SetResourceContext ¶ added in v1.6.4

func (e *Extractor) SetResourceContext(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error))

SetResourceContext configures the extractor with resources and a resolver for XObject processing. This enables extraction of text from Form XObjects.

type TextFragment ¶

type TextFragment struct {
	Text      string    // Decoded text content
	X, Y      float64   // Position in page coordinates
	Width     float64   // Width of the text in page units
	Height    float64   // Height (typically font size)
	FontName  string    // Name of the font used
	FontSize  float64   // Font size in page units
	Direction Direction // Text direction (LTR, RTL, Neutral)
}

TextFragment represents a piece of extracted text with position and font information.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL