Documentation
¶
Overview ¶
Package text provides text extraction from PDF content streams.
This package handles the extraction of text fragments from PDF pages, including position calculation, font handling, and text direction detection.
Text Extraction ¶
The Extractor type processes PDF content stream operations to extract positioned text:
extractor := text.NewExtractor() extractor.RegisterFontsFromPage(page, resolver) fragments, err := extractor.ExtractFromBytes(contentData)
Each TextFragment contains the text along with position (X, Y), dimensions (Width, Height), font information, and text direction.
Font Registration ¶
For accurate text extraction, fonts should be registered before extraction:
- RegisterFont - register a font by name
- RegisterParsedFont - register a pre-parsed font with ToUnicode CMap
- RegisterFontsFromPage - automatically register all fonts from a page
Text Direction ¶
The package supports bidirectional text with the Direction type:
- LTR - left-to-right (Latin, CJK, etc.)
- RTL - right-to-left (Arabic, Hebrew, etc.)
- Neutral - direction-neutral characters (numbers, punctuation)
The DetectDirection function analyzes text to determine its direction.
Smart Spacing ¶
The extractor intelligently handles spacing between fragments:
- Word-level PDFs: Uses font space width metrics
- Character-level PDFs: Uses adaptive gap detection
- Explicit spaces: Respects space characters in the stream
Index ¶
- type Direction
- type Extractor
- func (e *Extractor) Extract(operations []contentstream.Operation) ([]TextFragment, error)
- func (e *Extractor) ExtractFromBytes(data []byte) ([]TextFragment, error)
- func (e *Extractor) GetFonts() map[string]*font.Font
- func (e *Extractor) GetFragments() []TextFragment
- func (e *Extractor) GetFragmentsRaw() []TextFragment
- func (e *Extractor) GetText() string
- func (e *Extractor) RegisterFont(name, baseFont, subtype string)
- func (e *Extractor) RegisterFontsFromPage(page *pages.Page, resolver func(core.IndirectRef) (core.Object, error)) error
- func (e *Extractor) RegisterFontsFromResources(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error)) error
- func (e *Extractor) RegisterParsedFont(name string, f *font.Font)
- func (e *Extractor) SetResourceContext(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error))
- type TextFragment
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Direction ¶
type Direction int
Direction represents the writing direction of text. It is used to detect and handle bidirectional text (bidi) in documents.
func DetectDirection ¶
DetectDirection analyzes a string and returns its dominant text direction based on Unicode character properties. It counts strong directional characters and returns the direction with the higher count, or Neutral if no strong directional characters are present.
func GetCharDirection ¶
GetCharDirection returns the inherent direction of a single Unicode character. Digits, punctuation, whitespace, and symbols are Neutral; RTL scripts (Arabic, Hebrew, Syriac, Thaana, N'Ko) return RTL; all other scripts return LTR.
type Extractor ¶
type Extractor struct {
// contains filtered or unexported fields
}
Extractor extracts text fragments from PDF content streams. It maintains graphics state and registered fonts to properly decode and position text.
func NewExtractor ¶
func NewExtractor() *Extractor
NewExtractor creates a new text extractor with initialized graphics state.
func (*Extractor) Extract ¶
func (e *Extractor) Extract(operations []contentstream.Operation) ([]TextFragment, error)
Extract extracts text fragments from parsed content stream operations.
func (*Extractor) ExtractFromBytes ¶
func (e *Extractor) ExtractFromBytes(data []byte) ([]TextFragment, error)
ExtractFromBytes parses raw content stream data and extracts text fragments.
func (*Extractor) GetFonts ¶
GetFonts returns the fonts registered in this extractor Useful for debugging font loading and ToUnicode CMap issues
func (*Extractor) GetFragments ¶
func (e *Extractor) GetFragments() []TextFragment
GetFragments returns all extracted text fragments with duplicates removed. Duplicate fragments are those at the same position with the same text, which can occur in PDFs with multiple content layers or tagged structure.
func (*Extractor) GetFragmentsRaw ¶ added in v1.5.4
func (e *Extractor) GetFragmentsRaw() []TextFragment
GetFragmentsRaw returns all extracted text fragments without deduplication. Use this when you need to see all fragments including duplicates.
func (*Extractor) GetText ¶
GetText returns all extracted text as a string with smart spacing. Handles both LTR and RTL text, grouping fragments into lines and adding appropriate word and line breaks. Duplicate fragments at the same position are automatically removed.
func (*Extractor) RegisterFont ¶
RegisterFont registers a font by name for use during extraction. The baseFont and subtype are used to create a basic font with default metrics.
func (*Extractor) RegisterFontsFromPage ¶
func (e *Extractor) RegisterFontsFromPage(page *pages.Page, resolver func(core.IndirectRef) (core.Object, error)) error
RegisterFontsFromPage parses and registers all fonts from a page's resources. This is the recommended way to prepare the extractor before extracting text from a page.
func (*Extractor) RegisterFontsFromResources ¶
func (e *Extractor) RegisterFontsFromResources(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error)) error
RegisterFontsFromResources parses and registers all fonts from a resources dictionary. Use this when working with page resources directly rather than through a Page object.
func (*Extractor) RegisterParsedFont ¶
RegisterParsedFont registers a pre-parsed font for use during extraction. Use this when you have already parsed the font with its ToUnicode CMap and widths.
func (*Extractor) SetResourceContext ¶ added in v1.6.4
func (e *Extractor) SetResourceContext(resources core.Dict, resolver func(core.IndirectRef) (core.Object, error))
SetResourceContext configures the extractor with resources and a resolver for XObject processing. This enables extraction of text from Form XObjects.
type TextFragment ¶
type TextFragment struct {
Text string // Decoded text content
X, Y float64 // Position in page coordinates
Width float64 // Width of the text in page units
Height float64 // Height (typically font size)
FontName string // Name of the font used
FontSize float64 // Font size in page units
Direction Direction // Text direction (LTR, RTL, Neutral)
}
TextFragment represents a piece of extracted text with position and font information.