pdf

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 27, 2026 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package pdf is the internal content-stream interpreter for pdftable.

It is intentionally NOT a public package: the data model here is the raw output of walking a PDF content stream (glyph events, path segments, graphics-state pushes), and it would be premature to lock that shape behind a stable API while the layout-analysis layers on top are still being built out.

The interpreter mirrors the structure of pdfminer.six's PDFContentEmitter (in pdfinterp.py) — same operator dispatch table, same graphics/text state stacks, same coordinate-transform math — but rewritten in idiomatic Go: explicit state structs instead of dynamic attributes, a switch-on-operator dispatch instead of getattr, and value types for matrices instead of tuples.

What this package does NOT do:

  • Compose glyphs into words or lines. That is layout analysis and lives one layer up (Page.Words, in a later sub-phase).
  • Solve table detection. Also layout analysis.
  • Handle PDF encryption. The reader assumes a decrypted document; callers that need decryption should pre-process with pdfcpu.
  • Render Type 3 fonts. Type 3 fonts are themselves content streams; they are extremely rare in tabular documents and we explicitly punt — the glyph's bbox is still emitted, just with empty Text.

Index

Constants

This section is empty.

Variables

View Source
var IdentityMatrix = Matrix{1, 0, 0, 1, 0, 0}

IdentityMatrix is the affine identity (no transform). Used as the starting CTM and text matrix until an op overrides them.

Functions

func AdobeGlyphToUnicode

func AdobeGlyphToUnicode(name string) string

AdobeGlyphToUnicode resolves Adobe glyph names (e.g. "A", "comma", "fi", "Adieresis", "uni0041") to Unicode strings.

Lookup order:

  • Exact match in adobeGlyphTable (~250 entries; the full set of glyphs referenced by any of the four PDF base encodings, plus common additions like fractions and arrows that appear in real-world /Differences arrays).
  • Compound names with "_" separators are split and each part is resolved recursively (per AGL spec §2 — "f_i" → "fi").
  • Variant suffixes (".alt", ".sc", ...) are stripped before lookup.
  • "uniXXXX"/"uniXXXXXXXX" → one or more UTF-16 hex codepoints.
  • "uXXXX".."uXXXXXX" → a single hex codepoint.

Anything else returns "" — the caller falls back to a (cid:NNN) placeholder.

func ApplyDifferences

func ApplyDifferences(base [256]string, entries []Difference) [256]string

ApplyDifferences overlays a /Differences array on a base encoding. The array is a flat sequence alternating integer-start values with glyph-name entries — see PDF 1.7 §9.6.5.5:

[ 39 /quotesingle 96 /grave /quoteleft ]

means "glyph 39 is /quotesingle, glyph 96 is /grave, glyph 97 is /quoteleft". The integer resets the running CID; each subsequent name occupies CID++.

names is a (cid, name) sequence as decoded by the caller (the content interpreter does the array walking). out is the table returned to the font.

func ApplyPoint

func ApplyPoint(m Matrix, x, y float64) (float64, float64)

ApplyPoint maps (x, y) through m. PDF user space points → device space, or text space → user space, depending on which matrix m is.

func ApplyRect

func ApplyRect(m Matrix, x0, y0, x1, y1 float64) (float64, float64, float64, float64)

ApplyRect maps a rectangle through m. The result is a NEW rectangle that tightly encloses the (possibly rotated) image of the input — it is NOT a rotated rectangle. After the transform we normalise so x0 <= x1 and y0 <= y1; pdfplumber relies on this invariant.

func EncodingByName

func EncodingByName(name string) [256]string

EncodingByName returns the 256-entry cid→Unicode table for a base encoding name. Returns the StandardEncoding (the PDF spec's default) if the name is unrecognised.

Types

type CMap

type CMap struct {
	// contains filtered or unexported fields
}

CMap is a parsed ToUnicode CMap. For each glyph CID (a uint16 because PDF composite fonts use up to 2-byte CIDs), it returns the Unicode string that glyph represents — for almost all glyphs this is one rune, but ligature glyphs like `fi` and `ffi` legitimately produce multi-rune strings via the `bfchar` and `bfrange` CMap directives.

We only parse the `bfchar` and `bfrange` directives plus the basic codespace-range bookkeeping. The other CMap features (`cidchar`, `cidrange`, `notdefrange`) describe how to MAP bytes to CIDs (the inverse direction), and for that we don't need a CMap at all — we treat the input bytes themselves as the CID stream for composite fonts. This is the same shortcut pdfminer.six takes for Identity-H/V-encoded CIDFonts, and it's correct for ~all real-world modern PDFs.

func NewCMap

func NewCMap() *CMap

NewCMap returns an empty CMap. Use Parse to populate it from a ToUnicode stream.

func ParseCMap

func ParseCMap(data []byte) (*CMap, error)

ParseCMap reads a ToUnicode stream and populates a fresh CMap.

The grammar is a tiny subset of PostScript — we tokenize the stream into (operator, operand) tuples and dispatch on operator keywords:

  • `N beginbfchar ... endbfchar`: pairs of <src> <dst>, each repeated N times.
  • `N beginbfrange ... endbfrange`: triples <srcLo> <srcHi> <dst>, repeated N times. `dst` may be a hex string or an array of hex strings.

All other CMap directives (`begincmap`, `begincodespacerange`, `beginnotdefrange`, `def`, `dict`, etc.) are recognised but their payloads are simply skipped — we don't need any of them to resolve CID → Unicode.

Returns nil on a successful parse; returns an error only for I/O- level corruption (mismatched bf-block markers, unparseable hex). Truncated or malformed bf entries are silently dropped — the pdfminer reference does the same, and it's the right call: real PDFs have lots of weird CMaps and a strict parser breaks too easily.

func (*CMap) Lookup

func (c *CMap) Lookup(cid uint16) (string, bool)

Lookup returns the Unicode string for cid and ok=true if present. Returns "", false otherwise — the caller decides whether to fall through to an Encoding-based mapping or to emit a `(cid:NNN)` placeholder.

func (*CMap) Size

func (c *CMap) Size() int

Size returns the number of entries in the CMap; useful for tests.

type CharEvent

type CharEvent struct {
	Text           string
	X0, Y0, X1, Y1 float64
	FontName       string
	FontSize       float64
	Upright        bool
	Advance        float64
}

CharEvent is the per-glyph data emitted by EmitChar.

type Difference

type Difference struct {
	CID       int
	GlyphName string
}

Difference is one (cid, glyph-name) pair from a /Differences array.

type Font

type Font struct {
	// BaseFont is the PostScript name from the font dictionary's
	// /BaseFont entry, e.g. "Helvetica-Bold" or "ABCDEF+Times". Surfaced
	// verbatim to the caller as Char.FontName.
	BaseFont string

	// IsSimple is true for Type1 and TrueType fonts (single-byte CIDs,
	// /Encoding name + optional /Differences array). Composite fonts
	// (CIDFontType0/2) have IsSimple = false and use a Type0 cmap to
	// segment the byte stream into multi-byte CIDs.
	IsSimple bool

	// ToUnicode is the optional parsed /ToUnicode CMap. When present
	// it is consulted FIRST, in front of the encoding table — the PDF
	// spec is unambiguous about this (PDF 1.7 §9.10.2). Many PDFs ship
	// a ToUnicode map even for fonts that already have a usable
	// encoding, because it's the only way to map ligature glyphs back
	// to "fi"/"ffi"/etc.
	ToUnicode *CMap

	// Widths maps CID → advance width in /1000ths of a font unit. For
	// simple fonts the keys are bytes 0..255; for composite fonts they
	// are 2-byte CIDs. DefaultWidth is used for CIDs not in the map.
	Widths       map[uint16]float64
	DefaultWidth float64

	// Ascent and Descent are the font's typographic extrema in
	// /1000ths of a font unit, read from /FontDescriptor. Descent is
	// always stored negative (PDF spec) — we normalise on read.
	Ascent  float64
	Descent float64
	// contains filtered or unexported fields
}

Font is the interpreter's view of a single PDF font resource. Each font on a page (named under `/Font` in the page resources) becomes one of these. The interpreter resolves /Tf operators by looking up the font name in the page's font map and stashing the *Font on the text state — every subsequent text-showing op uses font.Decode to turn the input byte string into a sequence of (CID, Unicode, width) triples.

A Font is constructed once per page (or once and reused across pages, when the same font dict is reachable from multiple pages — pdfcpu dereferences indirect references for us so we always get the same *Font pointer back).

func (*Font) CharWidth

func (f *Font) CharWidth(cid uint16) float64

CharWidth returns the advance width for cid in /1000ths of the font's design unit. Multiply by FontSize/1000 to get the user-space advance (text-space units, before applying the text matrix).

func (*Font) Decode

func (f *Font) Decode(b []byte) []uint16

Decode walks a PDF text-showing operand (a byte string) and yields the sequence of CIDs it represents. For simple fonts that's just the bytes; for composite fonts (Identity-H is by far the most common composite encoding) bytes are paired into 2-byte CIDs.

The returned slice is fresh — callers may retain it. Per-CID resolution (Unicode lookup, width) happens in DecodeUnicode and CharWidth, separately, so callers that only want one of those can avoid the cost of the other.

func (*Font) DecodeUnicode

func (f *Font) DecodeUnicode(cid uint16) string

DecodeUnicode returns the Unicode text for a single CID. Lookup order:

  1. ToUnicode CMap (if present).
  2. Encoding table (simple fonts only).
  3. The literal placeholder "(cid:NNN)" — same convention as pdfminer.six. Layout code can detect this prefix and treat such chars as "positioned but unreadable", which is still useful (the bbox carries the table grid even when the text doesn't come back).

type GraphicsState

type GraphicsState struct {
	CTM       Matrix  // Current transformation matrix.
	LineWidth float64 // From `w`; user-space units.

	// Text state is part of the graphics state per the PDF spec, so it
	// participates in the same q/Q stack. We keep it inline rather than
	// as a separate field-of-fields because the q/Q hot path copies the
	// whole struct; one flat struct is one mempcpy.
	Text TextState
}

GraphicsState is the snapshot pushed by `q` and popped by `Q`.

We track only the fields the public API actually exposes: the CTM, stroke fill (so paths can record whether they were filled or stroked), and line width (so emitted Lines / Rects carry their drawing width). Color, clipping path, dash pattern etc. are parsed by the operator dispatcher so the stack stays balanced, but we don't retain them — adding those fields later is a non-breaking change.

type Interpreter

type Interpreter struct {

	// Fonts is the page's font map (/Font subtree of resources),
	// indexed by the resource name used in Tf operands. Set by the
	// caller before Run.
	Fonts map[string]*Font

	// XObjects is the page's XObject map (/XObject subtree). Used
	// only for `Do`-invoked Form XObjects whose content streams are
	// inlined into the page; image XObjects are recognised and
	// dropped.
	XObjects map[string]XObject

	// Sink receives emitted events. The caller installs whatever
	// implementation it likes — pdftable's Page uses a struct that
	// accumulates Chars and paints into separate slices.
	Sink Sink
	// contains filtered or unexported fields
}

Interpreter walks a PDF content stream, maintaining the graphics and text state, and emits typed events (glyphs and path paints) to a Sink callback supplied by the caller. One Interpreter is used per page; resetting requires constructing a new one.

The design follows pdfminer.six's PDFContentEmitter / PDFTextDevice split, with two changes:

  1. The "device" is just a Sink interface that receives Char and paint-path events. No subclassing, no method override mess — we don't need pdfminer's layered architecture because we're not implementing a separate visual renderer.
  2. The interpreter is purely synchronous and single-threaded. PDF content streams don't have any constructs that benefit from concurrency, and a single-threaded loop is much easier to reason about when porting glyph-position math from Python.

func NewInterpreter

func NewInterpreter(initialCTM Matrix, sink Sink) *Interpreter

NewInterpreter returns a fresh interpreter ready to walk a content stream. The initial CTM is the device transform supplied by the caller (typically identity composed with the page rotation and mediabox-origin translation).

func (*Interpreter) Run

func (it *Interpreter) Run(stream []byte) error

Run lexes and dispatches a single content stream. Multiple content streams on the same page are run sequentially against the same interpreter instance — this matches how the PDF spec defines them: the array `/Contents [a b c]` is semantically `a ++ b ++ c`, with q/Q stack state preserved across the splits.

type Matrix

type Matrix [6]float64

Matrix is a 3x3 affine transform stored as the six PDF parameters [a b c d e f], representing the matrix

[ a b 0 ]
[ c d 0 ]
[ e f 1 ]

Multiplication and point application follow the PDF spec (PDF 1.7, section 8.3.3). We use a fixed-size value type rather than a pointer or slice — the matrix is six floats and copying it is cheaper than reaching through a heap-allocated wrapper, and we save the GC bookkeeping in a hot path that runs once per glyph.

func Mult

func Mult(m2, m1 Matrix) Matrix

Mult is matrix concatenation, following pdfminer's mult_matrix convention exactly: `Mult(m2, m1)` returns the matrix M such that applying M to a point produces the same result as applying m2 first and then applying m1 to the intermediate result.

This convention matters because PDF generators emit operators in the order they want them composed: a `cm M ... TM ...` sequence produces glyph positions via `Translate(textMatrix, ...)` applied through `Mult(textMatrix, ctm)` — i.e. text matrix first, then CTM — which is exactly the order they're written in the stream.

func Translate

func Translate(m Matrix, tx, ty float64) Matrix

Translate returns the matrix m post-translated by (tx, ty) in its own local coordinate system. Equivalent to Mult({1,0,0,1,tx,ty}, m). Used to position each glyph as the text matrix advances.

type Operand

type Operand struct {
	Kind   OperandKind
	Number float64
	Name   string    // for name and keyword operands
	String []byte    // for literal/hex strings
	Array  []Operand // for [ ... ] arrays (TJ uses this)
}

Operand is one parsed value from the content-stream operand stack. The interpreter handlers cast it to the type they expect (e.g. Tf pops a name then a number). We use a single sum type rather than individual Push methods because the operand grammar is uniform — every operator just pops N typed values — and Operand keeps the handler signatures tidy.

type OperandKind

type OperandKind int
const (
	OpNumber OperandKind = iota
	OpName
	OpString
	OpArray
)

type PageInfo

type PageInfo struct {
	Content  []byte
	MediaBox [4]float64
	Rotate   int
	Fonts    map[string]*Font
	XObjects map[string]XObject
}

PageInfo bundles everything the interpreter needs to walk one page: the concatenated content-stream bytes, the page's mediabox, rotation, and the resolved font / xobject maps.

type PathEvent

type PathEvent struct {
	Segments  []PathSeg
	Stroke    bool
	Fill      bool
	EvenOdd   bool
	LineWidth float64
}

PathEvent is the data emitted by EmitPath when a path-painting op (S/s/f/F/f*/B/B*/b/b*/n) drains the current path. The Segments slice owns its memory after the call returns — the interpreter will not modify it.

type PathSeg

type PathSeg struct {
	Op string

	// X, Y are populated for m, l, and re's first corner.
	X, Y float64

	// For curves (c, v, y) we record up to three control/endpoint pairs.
	// pdfplumber's flattened representation only keeps the endpoint
	// (X3, Y3); we keep all of them so callers that want true Bezier
	// reconstruction can do it.
	X1, Y1, X2, Y2, X3, Y3 float64

	// For re: width and height (already absolute, in user space).
	W, H float64
}

PathSeg is one segment of a path. Op is "m", "l", "c", "h", or "re".

type Reader

type Reader struct {
	Ctx *model.Context
}

Reader is the bridge from pdfcpu's parsed-object model to the interpreter. One Reader wraps one *model.Context (one PDF document) and exposes per-page accessors: ReadPage returns the content stream bytes plus the resolved font / xobject maps the interpreter needs.

We deliberately separate pdfcpu-specific code into this one file. Everything in state.go, cmap.go, font.go, ops.go, and content.go is stdlib-only — that way if we ever want to swap pdfcpu for our own PDF object parser (a real possibility for a v1.0 release: pdfcpu is heavy and pulls in image-codec dependencies we don't need), the blast radius is limited to this file.

func NewReader

func NewReader(data []byte) (*Reader, error)

NewReader takes a fully-decoded byte slice and returns a Reader. pdfcpu does all the heavy lifting: xref parsing, FlateDecode of compressed streams, object resolution. If the file is encrypted we surface ErrEncrypted to the caller — full crypto support is a later phase.

func (*Reader) NumPages

func (r *Reader) NumPages() int

NumPages returns the page count.

func (*Reader) ReadPage

func (r *Reader) ReadPage(n int) (*PageInfo, error)

ReadPage loads page n (1-indexed) and resolves all of its resources into the simple maps the interpreter expects. Per the PDF spec the /Resources dictionary is inherited up the page-tree branch; pdfcpu resolves that for us when we pass consolidateRes=true.

type Sink

type Sink interface {
	// EmitChar is called once per glyph drawn by Tj/TJ/'/" operators.
	// All coordinates are already in user space (post text matrix and
	// CTM), with y0 <= y1 and x0 <= x1. Text is the glyph's Unicode
	// (may be empty).
	EmitChar(ev CharEvent)

	// EmitPath is called once per path-painting operator. The path
	// has already been classified into a stroke / fill flag pair and
	// a slice of segments; the sink converts segments into Lines,
	// Rects, or Curves at its discretion.
	EmitPath(ev PathEvent)
}

Sink receives high-level events from the content interpreter. It is implemented by the page-builder in the parent package; per pdftable's package boundary, this is the contract between the parser and the public-API layer.

type TextState

type TextState struct {
	Font      *Font   // Selected by Tf; nil before any Tf is seen.
	FontSize  float64 // From Tf (second operand).
	CharSpace float64 // Tc; added to every glyph advance.
	WordSpace float64 // Tw; added after every space (cid 32) in simple fonts.
	Scale     float64 // Tz / 100; horizontal text scaling factor (1.0 = normal).
	Leading   float64 // TL; negated so it represents the y-delta on T*.
	Rise      float64 // Trise; vertical offset for super/subscript.
	Render    int     // Tr; 0 = fill (default), 3 = invisible.

	// Text matrix: maps text-space coordinates to user space. Set by Tm,
	// translated by Td/TD/T*, mutated by every glyph emission. BT zeros
	// this back to identity.
	Matrix Matrix

	// Text line matrix: snapshotted at the start of each line. Td/TD/T*
	// reset Matrix from LineMatrix; glyph emission only mutates Matrix.
	LineMatrix Matrix
}

TextState is the subset of PDF text-state parameters we need to position glyphs and resolve their font. It is mutated only by text- state operators (Tc Tw Tz TL Tf Tm Td TD T* Trise) and consulted by every text-showing op (Tj TJ ' ").

The text matrix and text line matrix are part of the text OBJECT (BT/ET), not the text state — but we keep them on this struct anyway because operator dispatch is more uniform that way, and BT/ET just zero them out as if they were any other text-state field.

type XObject

type XObject struct {
	Subtype string // "Form" or "Image"

	// For Form XObjects:
	Content []byte
	BBox    [4]float64
	Matrix  Matrix

	// Resources from the XObject's /Resources dict, if any. Form
	// XObjects can carry their own resource scope; if absent, the
	// PDF spec says we inherit the enclosing page's resources.
	Fonts    map[string]*Font
	XObjects map[string]XObject
}

XObject is a Form/Image XObject reachable from the page. When the interpreter encounters `name Do`, it looks the name up here; if the subtype is "Form", the contained content stream is interpreted recursively under the current CTM (which is multiplied by the XObject's /Matrix).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL