Documentation
¶
Overview ¶
Package pdf is the internal content-stream interpreter for pdftable.
It is intentionally NOT a public package: the data model here is the raw output of walking a PDF content stream (glyph events, path segments, graphics-state pushes), and it would be premature to lock that shape behind a stable API while the layout-analysis layers on top are still being built out.
The interpreter mirrors the structure of pdfminer.six's PDFContentEmitter (in pdfinterp.py) — same operator dispatch table, same graphics/text state stacks, same coordinate-transform math — but rewritten in idiomatic Go: explicit state structs instead of dynamic attributes, a switch-on-operator dispatch instead of getattr, and value types for matrices instead of tuples.
What this package does NOT do:
- Compose glyphs into words or lines. That is layout analysis and lives one layer up (Page.Words, in a later sub-phase).
- Solve table detection. Also layout analysis.
- Handle PDF encryption. The reader assumes a decrypted document; callers that need decryption should pre-process with pdfcpu.
- Render Type 3 fonts. Type 3 fonts are themselves content streams; they are extremely rare in tabular documents and we explicitly punt — the glyph's bbox is still emitted, just with empty Text.
Index ¶
- Variables
- func AdobeGlyphToUnicode(name string) string
- func ApplyDifferences(base [256]string, entries []Difference) [256]string
- func ApplyPoint(m Matrix, x, y float64) (float64, float64)
- func ApplyRect(m Matrix, x0, y0, x1, y1 float64) (float64, float64, float64, float64)
- func EncodingByName(name string) [256]string
- type CMap
- type CharEvent
- type Difference
- type Font
- type GraphicsState
- type Interpreter
- type Matrix
- type Operand
- type OperandKind
- type PageInfo
- type PathEvent
- type PathSeg
- type Reader
- type Sink
- type TextState
- type XObject
Constants ¶
This section is empty.
Variables ¶
var IdentityMatrix = Matrix{1, 0, 0, 1, 0, 0}
IdentityMatrix is the affine identity (no transform). Used as the starting CTM and text matrix until an op overrides them.
Functions ¶
func AdobeGlyphToUnicode ¶
AdobeGlyphToUnicode resolves Adobe glyph names (e.g. "A", "comma", "fi", "Adieresis", "uni0041") to Unicode strings.
Lookup order:
- Exact match in adobeGlyphTable (~250 entries; the full set of glyphs referenced by any of the four PDF base encodings, plus common additions like fractions and arrows that appear in real-world /Differences arrays).
- Compound names with "_" separators are split and each part is resolved recursively (per AGL spec §2 — "f_i" → "fi").
- Variant suffixes (".alt", ".sc", ...) are stripped before lookup.
- "uniXXXX"/"uniXXXXXXXX" → one or more UTF-16 hex codepoints.
- "uXXXX".."uXXXXXX" → a single hex codepoint.
Anything else returns "" — the caller falls back to a (cid:NNN) placeholder.
func ApplyDifferences ¶
func ApplyDifferences(base [256]string, entries []Difference) [256]string
ApplyDifferences overlays a /Differences array on a base encoding. The array is a flat sequence alternating integer-start values with glyph-name entries — see PDF 1.7 §9.6.5.5:
[ 39 /quotesingle 96 /grave /quoteleft ]
means "glyph 39 is /quotesingle, glyph 96 is /grave, glyph 97 is /quoteleft". The integer resets the running CID; each subsequent name occupies CID++.
names is a (cid, name) sequence as decoded by the caller (the content interpreter does the array walking). out is the table returned to the font.
func ApplyPoint ¶
ApplyPoint maps (x, y) through m. PDF user space points → device space, or text space → user space, depending on which matrix m is.
func ApplyRect ¶
ApplyRect maps a rectangle through m. The result is a NEW rectangle that tightly encloses the (possibly rotated) image of the input — it is NOT a rotated rectangle. After the transform we normalise so x0 <= x1 and y0 <= y1; pdfplumber relies on this invariant.
func EncodingByName ¶
EncodingByName returns the 256-entry cid→Unicode table for a base encoding name. Returns the StandardEncoding (the PDF spec's default) if the name is unrecognised.
Types ¶
type CMap ¶
type CMap struct {
// contains filtered or unexported fields
}
CMap is a parsed ToUnicode CMap. For each glyph CID (a uint16 because PDF composite fonts use up to 2-byte CIDs), it returns the Unicode string that glyph represents — for almost all glyphs this is one rune, but ligature glyphs like `fi` and `ffi` legitimately produce multi-rune strings via the `bfchar` and `bfrange` CMap directives.
We only parse the `bfchar` and `bfrange` directives plus the basic codespace-range bookkeeping. The other CMap features (`cidchar`, `cidrange`, `notdefrange`) describe how to MAP bytes to CIDs (the inverse direction), and for that we don't need a CMap at all — we treat the input bytes themselves as the CID stream for composite fonts. This is the same shortcut pdfminer.six takes for Identity-H/V-encoded CIDFonts, and it's correct for ~all real-world modern PDFs.
func NewCMap ¶
func NewCMap() *CMap
NewCMap returns an empty CMap. Use Parse to populate it from a ToUnicode stream.
func ParseCMap ¶
ParseCMap reads a ToUnicode stream and populates a fresh CMap.
The grammar is a tiny subset of PostScript — we tokenize the stream into (operator, operand) tuples and dispatch on operator keywords:
- `N beginbfchar ... endbfchar`: pairs of <src> <dst>, each repeated N times.
- `N beginbfrange ... endbfrange`: triples <srcLo> <srcHi> <dst>, repeated N times. `dst` may be a hex string or an array of hex strings.
All other CMap directives (`begincmap`, `begincodespacerange`, `beginnotdefrange`, `def`, `dict`, etc.) are recognised but their payloads are simply skipped — we don't need any of them to resolve CID → Unicode.
Returns nil on a successful parse; returns an error only for I/O- level corruption (mismatched bf-block markers, unparseable hex). Truncated or malformed bf entries are silently dropped — the pdfminer reference does the same, and it's the right call: real PDFs have lots of weird CMaps and a strict parser breaks too easily.
type CharEvent ¶
type CharEvent struct {
Text string
X0, Y0, X1, Y1 float64
FontName string
FontSize float64
Upright bool
Advance float64
}
CharEvent is the per-glyph data emitted by EmitChar.
type Difference ¶
Difference is one (cid, glyph-name) pair from a /Differences array.
type Font ¶
type Font struct {
// BaseFont is the PostScript name from the font dictionary's
// /BaseFont entry, e.g. "Helvetica-Bold" or "ABCDEF+Times". Surfaced
// verbatim to the caller as Char.FontName.
BaseFont string
// IsSimple is true for Type1 and TrueType fonts (single-byte CIDs,
// /Encoding name + optional /Differences array). Composite fonts
// (CIDFontType0/2) have IsSimple = false and use a Type0 cmap to
// segment the byte stream into multi-byte CIDs.
IsSimple bool
// ToUnicode is the optional parsed /ToUnicode CMap. When present
// it is consulted FIRST, in front of the encoding table — the PDF
// spec is unambiguous about this (PDF 1.7 §9.10.2). Many PDFs ship
// a ToUnicode map even for fonts that already have a usable
// encoding, because it's the only way to map ligature glyphs back
// to "fi"/"ffi"/etc.
ToUnicode *CMap
// Widths maps CID → advance width in /1000ths of a font unit. For
// simple fonts the keys are bytes 0..255; for composite fonts they
// are 2-byte CIDs. DefaultWidth is used for CIDs not in the map.
Widths map[uint16]float64
DefaultWidth float64
// Ascent and Descent are the font's typographic extrema in
// /1000ths of a font unit, read from /FontDescriptor. Descent is
// always stored negative (PDF spec) — we normalise on read.
Ascent float64
Descent float64
// contains filtered or unexported fields
}
Font is the interpreter's view of a single PDF font resource. Each font on a page (named under `/Font` in the page resources) becomes one of these. The interpreter resolves /Tf operators by looking up the font name in the page's font map and stashing the *Font on the text state — every subsequent text-showing op uses font.Decode to turn the input byte string into a sequence of (CID, Unicode, width) triples.
A Font is constructed once per page (or once and reused across pages, when the same font dict is reachable from multiple pages — pdfcpu dereferences indirect references for us so we always get the same *Font pointer back).
func (*Font) CharWidth ¶
CharWidth returns the advance width for cid in /1000ths of the font's design unit. Multiply by FontSize/1000 to get the user-space advance (text-space units, before applying the text matrix).
func (*Font) Decode ¶
Decode walks a PDF text-showing operand (a byte string) and yields the sequence of CIDs it represents. For simple fonts that's just the bytes; for composite fonts (Identity-H is by far the most common composite encoding) bytes are paired into 2-byte CIDs.
The returned slice is fresh — callers may retain it. Per-CID resolution (Unicode lookup, width) happens in DecodeUnicode and CharWidth, separately, so callers that only want one of those can avoid the cost of the other.
func (*Font) DecodeUnicode ¶
DecodeUnicode returns the Unicode text for a single CID. Lookup order:
- ToUnicode CMap (if present).
- Encoding table (simple fonts only).
- The literal placeholder "(cid:NNN)" — same convention as pdfminer.six. Layout code can detect this prefix and treat such chars as "positioned but unreadable", which is still useful (the bbox carries the table grid even when the text doesn't come back).
type GraphicsState ¶
type GraphicsState struct {
CTM Matrix // Current transformation matrix.
LineWidth float64 // From `w`; user-space units.
// Text state is part of the graphics state per the PDF spec, so it
// participates in the same q/Q stack. We keep it inline rather than
// as a separate field-of-fields because the q/Q hot path copies the
// whole struct; one flat struct is one mempcpy.
Text TextState
}
GraphicsState is the snapshot pushed by `q` and popped by `Q`.
We track only the fields the public API actually exposes: the CTM, stroke fill (so paths can record whether they were filled or stroked), and line width (so emitted Lines / Rects carry their drawing width). Color, clipping path, dash pattern etc. are parsed by the operator dispatcher so the stack stays balanced, but we don't retain them — adding those fields later is a non-breaking change.
type Interpreter ¶
type Interpreter struct {
// Fonts is the page's font map (/Font subtree of resources),
// indexed by the resource name used in Tf operands. Set by the
// caller before Run.
Fonts map[string]*Font
// XObjects is the page's XObject map (/XObject subtree). Used
// only for `Do`-invoked Form XObjects whose content streams are
// inlined into the page; image XObjects are recognised and
// dropped.
XObjects map[string]XObject
// Sink receives emitted events. The caller installs whatever
// implementation it likes — pdftable's Page uses a struct that
// accumulates Chars and paints into separate slices.
Sink Sink
// contains filtered or unexported fields
}
Interpreter walks a PDF content stream, maintaining the graphics and text state, and emits typed events (glyphs and path paints) to a Sink callback supplied by the caller. One Interpreter is used per page; resetting requires constructing a new one.
The design follows pdfminer.six's PDFContentEmitter / PDFTextDevice split, with two changes:
- The "device" is just a Sink interface that receives Char and paint-path events. No subclassing, no method override mess — we don't need pdfminer's layered architecture because we're not implementing a separate visual renderer.
- The interpreter is purely synchronous and single-threaded. PDF content streams don't have any constructs that benefit from concurrency, and a single-threaded loop is much easier to reason about when porting glyph-position math from Python.
func NewInterpreter ¶
func NewInterpreter(initialCTM Matrix, sink Sink) *Interpreter
NewInterpreter returns a fresh interpreter ready to walk a content stream. The initial CTM is the device transform supplied by the caller (typically identity composed with the page rotation and mediabox-origin translation).
func (*Interpreter) Run ¶
func (it *Interpreter) Run(stream []byte) error
Run lexes and dispatches a single content stream. Multiple content streams on the same page are run sequentially against the same interpreter instance — this matches how the PDF spec defines them: the array `/Contents [a b c]` is semantically `a ++ b ++ c`, with q/Q stack state preserved across the splits.
type Matrix ¶
type Matrix [6]float64
Matrix is a 3x3 affine transform stored as the six PDF parameters [a b c d e f], representing the matrix
[ a b 0 ] [ c d 0 ] [ e f 1 ]
Multiplication and point application follow the PDF spec (PDF 1.7, section 8.3.3). We use a fixed-size value type rather than a pointer or slice — the matrix is six floats and copying it is cheaper than reaching through a heap-allocated wrapper, and we save the GC bookkeeping in a hot path that runs once per glyph.
func Mult ¶
Mult is matrix concatenation, following pdfminer's mult_matrix convention exactly: `Mult(m2, m1)` returns the matrix M such that applying M to a point produces the same result as applying m2 first and then applying m1 to the intermediate result.
This convention matters because PDF generators emit operators in the order they want them composed: a `cm M ... TM ...` sequence produces glyph positions via `Translate(textMatrix, ...)` applied through `Mult(textMatrix, ctm)` — i.e. text matrix first, then CTM — which is exactly the order they're written in the stream.
type Operand ¶
type Operand struct {
Kind OperandKind
Number float64
Name string // for name and keyword operands
String []byte // for literal/hex strings
Array []Operand // for [ ... ] arrays (TJ uses this)
}
Operand is one parsed value from the content-stream operand stack. The interpreter handlers cast it to the type they expect (e.g. Tf pops a name then a number). We use a single sum type rather than individual Push methods because the operand grammar is uniform — every operator just pops N typed values — and Operand keeps the handler signatures tidy.
type OperandKind ¶
type OperandKind int
const ( OpNumber OperandKind = iota OpName OpString OpArray )
type PageInfo ¶
type PageInfo struct {
Content []byte
MediaBox [4]float64
Rotate int
Fonts map[string]*Font
XObjects map[string]XObject
}
PageInfo bundles everything the interpreter needs to walk one page: the concatenated content-stream bytes, the page's mediabox, rotation, and the resolved font / xobject maps.
type PathEvent ¶
PathEvent is the data emitted by EmitPath when a path-painting op (S/s/f/F/f*/B/B*/b/b*/n) drains the current path. The Segments slice owns its memory after the call returns — the interpreter will not modify it.
type PathSeg ¶
type PathSeg struct {
Op string
// X, Y are populated for m, l, and re's first corner.
X, Y float64
// For curves (c, v, y) we record up to three control/endpoint pairs.
// pdfplumber's flattened representation only keeps the endpoint
// (X3, Y3); we keep all of them so callers that want true Bezier
// reconstruction can do it.
X1, Y1, X2, Y2, X3, Y3 float64
// For re: width and height (already absolute, in user space).
W, H float64
}
PathSeg is one segment of a path. Op is "m", "l", "c", "h", or "re".
type Reader ¶
Reader is the bridge from pdfcpu's parsed-object model to the interpreter. One Reader wraps one *model.Context (one PDF document) and exposes per-page accessors: ReadPage returns the content stream bytes plus the resolved font / xobject maps the interpreter needs.
We deliberately separate pdfcpu-specific code into this one file. Everything in state.go, cmap.go, font.go, ops.go, and content.go is stdlib-only — that way if we ever want to swap pdfcpu for our own PDF object parser (a real possibility for a v1.0 release: pdfcpu is heavy and pulls in image-codec dependencies we don't need), the blast radius is limited to this file.
func NewReader ¶
NewReader takes a fully-decoded byte slice and returns a Reader. pdfcpu does all the heavy lifting: xref parsing, FlateDecode of compressed streams, object resolution. If the file is encrypted we surface ErrEncrypted to the caller — full crypto support is a later phase.
type Sink ¶
type Sink interface {
// EmitChar is called once per glyph drawn by Tj/TJ/'/" operators.
// All coordinates are already in user space (post text matrix and
// CTM), with y0 <= y1 and x0 <= x1. Text is the glyph's Unicode
// (may be empty).
EmitChar(ev CharEvent)
// EmitPath is called once per path-painting operator. The path
// has already been classified into a stroke / fill flag pair and
// a slice of segments; the sink converts segments into Lines,
// Rects, or Curves at its discretion.
EmitPath(ev PathEvent)
}
Sink receives high-level events from the content interpreter. It is implemented by the page-builder in the parent package; per pdftable's package boundary, this is the contract between the parser and the public-API layer.
type TextState ¶
type TextState struct {
Font *Font // Selected by Tf; nil before any Tf is seen.
FontSize float64 // From Tf (second operand).
CharSpace float64 // Tc; added to every glyph advance.
WordSpace float64 // Tw; added after every space (cid 32) in simple fonts.
Scale float64 // Tz / 100; horizontal text scaling factor (1.0 = normal).
Leading float64 // TL; negated so it represents the y-delta on T*.
Rise float64 // Trise; vertical offset for super/subscript.
Render int // Tr; 0 = fill (default), 3 = invisible.
// Text matrix: maps text-space coordinates to user space. Set by Tm,
// translated by Td/TD/T*, mutated by every glyph emission. BT zeros
// this back to identity.
Matrix Matrix
// Text line matrix: snapshotted at the start of each line. Td/TD/T*
// reset Matrix from LineMatrix; glyph emission only mutates Matrix.
LineMatrix Matrix
}
TextState is the subset of PDF text-state parameters we need to position glyphs and resolve their font. It is mutated only by text- state operators (Tc Tw Tz TL Tf Tm Td TD T* Trise) and consulted by every text-showing op (Tj TJ ' ").
The text matrix and text line matrix are part of the text OBJECT (BT/ET), not the text state — but we keep them on this struct anyway because operator dispatch is more uniform that way, and BT/ET just zero them out as if they were any other text-state field.
type XObject ¶
type XObject struct {
Subtype string // "Form" or "Image"
// For Form XObjects:
Content []byte
BBox [4]float64
Matrix Matrix
// Resources from the XObject's /Resources dict, if any. Form
// XObjects can carry their own resource scope; if absent, the
// PDF spec says we inherit the enclosing page's resources.
Fonts map[string]*Font
XObjects map[string]XObject
}
XObject is a Form/Image XObject reachable from the page. When the interpreter encounters `name Do`, it looks the name up here; if the subtype is "Form", the contained content stream is interpreted recursively under the current CTM (which is multiplied by the XObject's /Matrix).