font

package
v1.5.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 19, 2026 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package font provides PDF font handling including Type1, TrueType, and CID fonts.

This package handles font parsing, character encoding, and text width calculation for accurate text extraction from PDFs.

Font Types

The package supports multiple PDF font types:

Font Creation

Fonts are created from PDF font dictionaries:

font, err := font.NewType1Font(fontDict, resolver)
font, err := font.NewTrueTypeFont(fontDict, resolver)
font, err := font.NewType0Font(fontDict, resolver)

Text Decoding

The Font type provides text decoding using ToUnicode CMaps:

text := font.DecodeString(rawBytes)

Character Widths

Width information is used for text positioning:

width := font.GetWidth(charCode)         // Single character
width := font.GetStringWidth(text)       // String width in font units

Encodings

Character encodings map character codes to glyph names:

  • Standard PDF encodings (WinAnsiEncoding, MacRomanEncoding, etc.)
  • Custom encodings from /Encoding dictionary
  • ToUnicode CMaps for Unicode conversion

CMap Support

CMaps (Character Maps) handle character code to Unicode mapping:

  • Embedded ToUnicode CMaps
  • Predefined CJK CMaps
  • CID-to-Unicode mapping

Index

Constants

This section is empty.

Variables

View Source
var MacRomanEncoding = &standardEncoding{
	name:  "MacRomanEncoding",
	table: macRomanTable,
}

MacRomanEncoding - Classic Mac OS encoding for Western European languages

View Source
var PDFDocEncoding = &standardEncoding{
	name:  "PDFDocEncoding",
	table: pdfDocTable,
}

PDFDocEncoding - PDF's default encoding for text strings

View Source
var StandardEncodingTable = &standardEncoding{
	name:  "StandardEncoding",
	table: standardEncodingTableData,
}

StandardEncodingTable - Adobe StandardEncoding for Type1 fonts

View Source
var SymbolEncoding = &standardEncoding{
	name:  "SymbolEncoding",
	table: symbolEncodingTable,
}

SymbolEncoding - Adobe Symbol font encoding Maps character codes to Greek letters, mathematical symbols, etc.

View Source
var WinAnsiEncoding = &standardEncoding{
	name:  "WinAnsiEncoding",
	table: winAnsiTable,
}

WinAnsiEncoding (Windows Code Page 1252) - Western European encoding This is the most common encoding in PDFs created on Windows

View Source
var ZapfDingbatsEncoding = &standardEncoding{
	name:  "ZapfDingbatsEncoding",
	table: zapfDingbatsEncodingTable,
}

ZapfDingbatsEncoding - Adobe ZapfDingbats font encoding Maps character codes to decorative symbols, arrows, etc.

Functions

func DecodeUTF16BE

func DecodeUTF16BE(data []byte) string

DecodeUTF16BE decodes UTF-16 Big Endian encoded bytes to a string Note: Input should NOT include the BOM (FEFF) - that should be stripped before calling

func DecodeUTF16LE

func DecodeUTF16LE(data []byte) string

DecodeUTF16LE decodes UTF-16 Little Endian encoded bytes to a string Note: Input should NOT include the BOM (FFFE) - that should be stripped before calling

func DecodeWithEncoding

func DecodeWithEncoding(data []byte, encodingName string) string

DecodeWithEncoding decodes data using the specified encoding and applies Unicode normalization

func IsEmojiSequence

func IsEmojiSequence(s string) bool

IsEmojiSequence checks if a string contains emoji sequences Emoji can be multi-codepoint: base + modifiers (skin tone) + ZWJ sequences

func IsValidUTF8

func IsValidUTF8(s string) bool

IsValidUTF8 checks if a string is valid UTF-8 This is useful for detecting UTF-16BE strings (which will fail UTF-8 validation)

func IsVerticalEncoding

func IsVerticalEncoding(encoding string) bool

IsVerticalEncoding checks if an encoding name indicates vertical writing mode Identity-V is used for vertical text in CJK fonts Identity-H (or any other encoding) is horizontal

func NormalizeEmojiSequence

func NormalizeEmojiSequence(s string) string

NormalizeEmojiSequence normalizes emoji sequences for consistent storage This handles skin tone modifiers and ZWJ sequences

func NormalizeUnicode

func NormalizeUnicode(s string) string

NormalizeUnicode normalizes a string to NFC (Canonical Decomposition followed by Canonical Composition) This ensures that characters like é are always represented as U+00E9 (precomposed) rather than U+0065 U+0301 (e + combining acute accent) This is critical for RAG applications to ensure consistent embeddings

Types

type CIDFont

type CIDFont struct {
	BaseFont       string
	Subtype        string // CIDFontType0 or CIDFontType2
	CIDSystemInfo  *CIDSystemInfo
	FontDescriptor *FontDescriptor
	DW             float64           // Default width
	W              []WidthRange      // Width specifications
	DW2            [2]float64        // Default width for vertical writing [w1y w1]
	W2             []VerticalMetrics // Vertical metrics
	CIDToGIDMap    *core.Stream      // CID to GID mapping (for CIDFontType2)
}

CIDFont represents a CIDFont (Character ID keyed font) Used as descendant font in Type0 fonts

func NewCIDFont

func NewCIDFont(fontDict core.Dict, resolver func(core.IndirectRef) (core.Object, error)) (*CIDFont, error)

NewCIDFont creates a CIDFont from a PDF font dictionary

func (*CIDFont) GetCharacterCollection

func (cid *CIDFont) GetCharacterCollection() string

GetCharacterCollection returns a string identifying the character collection

func (*CIDFont) GetWidthForCID

func (cid *CIDFont) GetWidthForCID(cidValue int) float64

GetWidthForCID returns the width for a specific CID

func (*CIDFont) IsCJK

func (cid *CIDFont) IsCJK() bool

IsCJK returns true if this is a CJK (Chinese, Japanese, Korean) font

func (*CIDFont) IsChinese

func (cid *CIDFont) IsChinese() bool

IsChinese returns true if this is a Chinese font

func (*CIDFont) IsJapanese

func (cid *CIDFont) IsJapanese() bool

IsJapanese returns true if this is a Japanese font

func (*CIDFont) IsKorean

func (cid *CIDFont) IsKorean() bool

IsKorean returns true if this is a Korean font

type CIDSystemInfo

type CIDSystemInfo struct {
	Registry   string // e.g., "Adobe"
	Ordering   string // e.g., "Japan1", "GB1", "CNS1", "Korea1"
	Supplement int    // Version of the character collection
}

CIDSystemInfo identifies a character collection

type CMap

type CMap struct {
	// contains filtered or unexported fields
}

CMap represents a character map that maps character codes to Unicode

func NewCMap

func NewCMap() *CMap

NewCMap creates a new empty CMap

func ParseToUnicodeCMap

func ParseToUnicodeCMap(stream *core.Stream) (*CMap, error)

ParseToUnicodeCMap parses a ToUnicode CMap stream

func (*CMap) Lookup

func (cm *CMap) Lookup(charCode uint32) string

Lookup looks up a character code and returns the Unicode string Returns empty string if no mapping is found (caller should handle fallback)

func (*CMap) LookupString

func (cm *CMap) LookupString(data []byte) string

LookupString decodes a string of character codes to Unicode

type CMapRange

type CMapRange struct {
	StartCode    uint32
	EndCode      uint32
	StartUnicode uint32
}

CMapRange represents a range of character code to Unicode mappings

type CMapTable

type CMapTable struct {
	// contains filtered or unexported fields
}

CMapTable represents a TrueType cmap table

type CustomEncoding

type CustomEncoding struct {
	// contains filtered or unexported fields
}

CustomEncoding represents an encoding with custom differences applied to a base encoding This implements the PDF Differences array mechanism where specific character codes are overridden to map to different glyphs

func NewCustomEncoding

func NewCustomEncoding(base Encoding, differences map[byte]rune) *CustomEncoding

NewCustomEncoding creates a custom encoding by applying differences to a base encoding The differences map specifies byte values that should map to different runes than the base encoding

func NewCustomEncodingFromGlyphs

func NewCustomEncodingFromGlyphs(base Encoding, differences map[byte]string) *CustomEncoding

NewCustomEncodingFromGlyphs creates a custom encoding using glyph names instead of runes This matches PDF's Differences array syntax which uses glyph names

func (*CustomEncoding) Decode

func (e *CustomEncoding) Decode(b byte) rune

Decode converts a byte to a rune, using the difference if present, otherwise the base encoding

func (*CustomEncoding) DecodeString

func (e *CustomEncoding) DecodeString(data []byte) string

DecodeString converts a byte sequence to a Unicode string using custom mappings

func (*CustomEncoding) Name

func (e *CustomEncoding) Name() string

Name returns the encoding name

type Encoding

type Encoding interface {
	// Decode converts a byte value to a Unicode rune
	Decode(b byte) rune

	// DecodeString converts a byte sequence to a Unicode string
	DecodeString(data []byte) string

	// Name returns the encoding name
	Name() string
}

Encoding represents a character encoding that maps byte values to Unicode code points

func GetEncoding

func GetEncoding(name string) Encoding

GetEncoding returns the encoding by name

func InferEncodingFromFontName

func InferEncodingFromFontName(fontName string) Encoding

InferEncodingFromFontName attempts to infer the appropriate encoding from a font name This is a fallback strategy when the PDF doesn't specify an encoding or ToUnicode CMap

type Font

type Font struct {
	Name     string
	BaseFont string
	Subtype  string
	Encoding string

	// ToUnicode CMap for character code to Unicode mapping
	ToUnicodeCMap *CMap
	// contains filtered or unexported fields
}

Font represents a PDF font

func NewFont

func NewFont(name, baseFont, subtype string) *Font

NewFont creates a new font

func (*Font) DecodeString

func (f *Font) DecodeString(data []byte) string

DecodeString decodes a string of character codes to Unicode Priority order: 1. Use ToUnicode CMap if present (most accurate) 2. Check for UTF-16 Byte Order Mark (BOM) - FEFF or FFFE 3. Use font's Encoding property (standard encodings) 4. Fall back to raw bytes as string All decoded strings are normalized to NFC for consistent embeddings

func (*Font) GetStringWidth

func (f *Font) GetStringWidth(s string) float64

GetStringWidth calculates the total width of a string

func (*Font) GetWidth

func (f *Font) GetWidth(r rune) float64

GetWidth returns the width of a character (in 1000ths of em)

func (*Font) IsStandardFont

func (f *Font) IsStandardFont() bool

IsStandardFont returns true if this is one of the Standard 14 fonts

func (*Font) IsVertical

func (f *Font) IsVertical() bool

IsVertical returns true if this font uses vertical writing mode Vertical writing is indicated by the Identity-V encoding, commonly used for East Asian languages (Chinese, Japanese, Korean) where text flows top-to-bottom

type FontDescriptor

type FontDescriptor struct {
	FontName     string
	Flags        int
	FontBBox     [4]float64 // [llx lly urx ury]
	ItalicAngle  float64
	Ascent       float64
	Descent      float64
	CapHeight    float64
	StemV        float64
	StemH        float64
	AvgWidth     float64
	MaxWidth     float64
	MissingWidth float64
	FontFile     *core.Stream // Type1 font program
	FontFile2    *core.Stream // TrueType font program
	FontFile3    *core.Stream // Type1C or CIDFont program
}

FontDescriptor contains font metrics and properties

type Metric

type Metric struct {
	W1Y float64
	W1  float64
}

Metric represents a single vertical metric

type TrueTypeFont

type TrueTypeFont struct {
	*Font // Embed basic font

	// TrueType-specific fields
	FirstChar      int
	LastChar       int
	Widths         []float64
	FontDescriptor *FontDescriptor
	ToUnicode      *core.Stream // CMap for character code to Unicode mapping

	// TrueType font program data
	FontProgram []byte            // Raw font program from FontFile2
	Tables      map[string][]byte // Parsed TrueType tables
	// contains filtered or unexported fields
}

TrueTypeFont represents a TrueType font in a PDF TrueType fonts contain glyph outlines as quadratic Bézier curves

func NewTrueTypeFont

func NewTrueTypeFont(fontDict core.Dict, resolver func(core.IndirectRef) (core.Object, error)) (*TrueTypeFont, error)

NewTrueTypeFont creates a TrueType font from a PDF font dictionary

func (*TrueTypeFont) GetGlyphID

func (tt *TrueTypeFont) GetGlyphID(r rune) uint16

GetGlyphID returns the glyph ID for a character

func (*TrueTypeFont) GetWidthFromGlyph

func (tt *TrueTypeFont) GetWidthFromGlyph(glyphID uint16) float64

GetWidthFromGlyph gets the width for a glyph ID

type Type0Font

type Type0Font struct {
	*Font // Embed basic font

	// Type0-specific fields
	Encoding       string
	DescendantFont *CIDFont     // The actual CIDFont
	ToUnicode      *core.Stream // CMap for CID to Unicode mapping
	IsVertical     bool         // true for Identity-V, false for Identity-H
}

Type0Font represents a Type0 (composite) font in a PDF Type0 fonts are used for fonts with large character sets, especially CJK fonts

func NewType0Font

func NewType0Font(fontDict core.Dict, resolver func(core.IndirectRef) (core.Object, error)) (*Type0Font, error)

NewType0Font creates a Type0 font from a PDF font dictionary

func (*Type0Font) GetWidth

func (t0 *Type0Font) GetWidth(r rune) float64

GetWidth returns the width for a character ID (CID)

type Type1Font

type Type1Font struct {
	*Font // Embed basic font

	// Type1-specific fields
	FirstChar      int
	LastChar       int
	Widths         []float64
	FontDescriptor *FontDescriptor
	ToUnicode      *core.Stream // CMap for character code to Unicode mapping
}

Type1Font represents a Type1 font in a PDF Type1 fonts are the original PostScript fonts and one of the most common font types in PDFs

func NewType1Font

func NewType1Font(fontDict core.Dict, resolver func(core.IndirectRef) (core.Object, error)) (*Type1Font, error)

NewType1Font creates a Type1 font from a PDF font dictionary

type VerticalMetrics

type VerticalMetrics struct {
	StartCID int
	EndCID   int
	W1Y      float64  // Position vector y component
	W1       float64  // Vertical width
	Metrics  []Metric // Individual metrics (if W1Y == 0 && W1 == 0)
}

VerticalMetrics represents vertical writing metrics in the W2 array

type WidthRange

type WidthRange struct {
	StartCID int
	EndCID   int
	Width    float64   // Single width for range
	Widths   []float64 // Individual widths (if Width == 0)
}

WidthRange represents a width specification in the W array

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL