text

package
v1.5.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 20, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package text provides text extraction from PDF content streams.

This package handles the extraction of text fragments from PDF pages, including position calculation, font handling, and text direction detection.

Text Extraction

The Extractor type processes PDF content stream operations to extract positioned text:

extractor := text.NewExtractor()
extractor.RegisterFontsFromPage(page, resolver)
fragments, err := extractor.ExtractFromBytes(contentData)

Each TextFragment contains the text along with position (X, Y), dimensions (Width, Height), font information, and text direction.

Font Registration

For accurate text extraction, fonts should be registered before extraction:

  • RegisterFont - register a font by name
  • RegisterParsedFont - register a pre-parsed font with ToUnicode CMap
  • RegisterFontsFromPage - automatically register all fonts from a page

Text Direction

The package supports bidirectional text with the Direction type:

  • LTR - left-to-right (Latin, CJK, etc.)
  • RTL - right-to-left (Arabic, Hebrew, etc.)
  • Neutral - direction-neutral characters (numbers, punctuation)

The DetectDirection function analyzes text to determine its direction.

Smart Spacing

The extractor intelligently handles spacing between fragments:

  • Word-level PDFs: Uses font space width metrics
  • Character-level PDFs: Uses adaptive gap detection
  • Explicit spaces: Respects space characters in the stream

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Direction

type Direction int

Direction represents the writing direction of text. It is used to detect and handle bidirectional text (bidi) in documents.

const (
	// LTR (Left-to-Right) for Latin, Cyrillic, etc.
	LTR Direction = iota
	// RTL (Right-to-Left) for Arabic, Hebrew, etc.
	RTL
	// Neutral for numbers, punctuation, etc.
	Neutral
)

func DetectDirection

func DetectDirection(text string) Direction

DetectDirection analyzes a string and returns its dominant text direction based on Unicode character properties. It counts strong directional characters and returns the direction with the higher count, or Neutral if no strong directional characters are present.

func GetCharDirection

func GetCharDirection(r rune) Direction

GetCharDirection returns the inherent direction of a single Unicode character. Digits, punctuation, whitespace, and symbols are Neutral; RTL scripts (Arabic, Hebrew, Syriac, Thaana, N'Ko) return RTL; all other scripts return LTR.

func (Direction) String

func (d Direction) String() string

String returns a string representation of the direction ("LTR", "RTL", or "Neutral").

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor extracts text fragments from PDF content streams. It maintains graphics state and registered fonts to properly decode and position text.

func NewExtractor

func NewExtractor() *Extractor

NewExtractor creates a new text extractor with initialized graphics state.

func (*Extractor) Extract

func (e *Extractor) Extract(operations []contentstream.Operation) ([]TextFragment, error)

Extract extracts text fragments from parsed content stream operations.

func (*Extractor) ExtractFromBytes

func (e *Extractor) ExtractFromBytes(data []byte) ([]TextFragment, error)

ExtractFromBytes parses raw content stream data and extracts text fragments.

func (*Extractor) GetFonts

func (e *Extractor) GetFonts() map[string]*font.Font

GetFonts returns the fonts registered in this extractor Useful for debugging font loading and ToUnicode CMap issues

func (*Extractor) GetFragments

func (e *Extractor) GetFragments() []TextFragment

GetFragments returns all extracted text fragments.

func (*Extractor) GetText

func (e *Extractor) GetText() string

GetText returns all extracted text as a string with smart spacing. Handles both LTR and RTL text, grouping fragments into lines and adding appropriate word and line breaks.

func (*Extractor) RegisterFont

func (e *Extractor) RegisterFont(name, baseFont, subtype string)

RegisterFont registers a font by name for use during extraction. The baseFont and subtype are used to create a basic font with default metrics.

func (*Extractor) RegisterFontsFromPage

func (e *Extractor) RegisterFontsFromPage(page *pages.Page, resolver func(core.IndirectRef) (core.Object, error)) error

RegisterFontsFromPage parses and registers all fonts from a page's resources. This is the recommended way to prepare the extractor before extracting text from a page.

func (*Extractor) RegisterFontsFromResources

func (e *Extractor) RegisterFontsFromResources(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error)) error

RegisterFontsFromResources parses and registers all fonts from a resources dictionary. Use this when working with page resources directly rather than through a Page object.

func (*Extractor) RegisterParsedFont

func (e *Extractor) RegisterParsedFont(name string, f *font.Font)

RegisterParsedFont registers a pre-parsed font for use during extraction. Use this when you have already parsed the font with its ToUnicode CMap and widths.

type TextFragment

type TextFragment struct {
	Text      string    // Decoded text content
	X, Y      float64   // Position in page coordinates
	Width     float64   // Width of the text in page units
	Height    float64   // Height (typically font size)
	FontName  string    // Name of the font used
	FontSize  float64   // Font size in page units
	Direction Direction // Text direction (LTR, RTL, Neutral)
}

TextFragment represents a piece of extracted text with position and font information.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL