Documentation
¶
Overview ¶
Package contentstream tokenises PDF content streams into a sequence of operations. Content streams are the postfix-notation graphics instructions that paint each page (text-showing operators, path operators, graphics-state ops, marked-content tags, …).
The scanner is operand-aware: operands are collected up to each operator keyword and surfaced together as one Op. Inline images (BI/ID/EI) are folded into a single synthetic EI op so the binary image bytes between ID and EI do not derail tokenisation.
The scanner does NOT interpret the operations: it does not track graphics state, does not render glyphs, does not resolve XObjects. Higher-level consumers (e.g. tagged-PDF validators) layer that logic on top.
Usage ¶
for op, err := range contentstream.New(decoded).All() {
if err != nil { ... }
switch op.Operator {
case "Tf":
// op.Operands[0].Name is the font resource key
case "BDC":
// op.Operands[0].Name is the structure tag
// op.Operands[1] is either a Name (ref into /Properties)
// or a Dict (inline properties)
}
}
Scope ¶
The scanner accepts the subset of PDF object syntax that can appear in content streams: numbers, names, strings (literal and hex), arrays, dictionaries, and operator keywords. Indirect references and stream objects do not occur in content streams and are not handled.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ErrUnexpectedEOF = errors.New("pdfdisassembler/contentstream: unexpected EOF")
ErrUnexpectedEOF indicates that the scanner ran out of bytes mid- operation (e.g. inside a dictionary, or while looking for EI).
Functions ¶
This section is empty.
Types ¶
type Dict ¶
Dict is a small key→Operand map for inline content-stream dictionaries. Nested dictionaries are supported.
type Kind ¶
type Kind int
Kind identifies the type of an operand value.
const ( // KindUnknown is the zero value; not produced by the scanner. KindUnknown Kind = iota // KindNumber covers both PDF integers and reals. Use Operand.Int() // to recover an int64 when the producer wrote an integer literal. KindNumber // KindName is a PDF name without the leading slash. KindName // KindString is a literal or hex string. The raw decoded bytes are // in Operand.Bytes; the scanner does not apply text-string decoding // (UTF-16BE BOM, PDFDocEncoding, …) because content-stream strings // are text shown to the reader and their semantic encoding depends // on the active font, not on the PDF text-string convention. KindString // KindArray holds operands of a PDF array, in source order. The // most common occurrence is the operand of TJ: a mix of strings // and number adjustments. KindArray // KindDict holds the entries of an inline dictionary. The most // common occurrence is the property dictionary that follows BDC. KindDict // KindBool is rare in content streams but appears in BDC property // dictionaries occasionally. KindBool // KindNull is rare in content streams but appears in BDC property // dictionaries occasionally. KindNull )
type Op ¶
type Op struct {
Operator string
Operands []Operand
Image []byte
// Offset is the byte position of the operator keyword in the
// source slice. Useful for error messages and source ranges.
Offset int64
}
Op is one content-stream operation: zero or more operands followed by an operator keyword (e.g. "Tf", "Tj", "BDC").
For inline-image runs, Operator is "EI" and Image carries the raw bytes between ID and EI; the BI dictionary is in Operands[0] as a KindDict (or empty if BI carried no entries).
type Operand ¶
type Operand struct {
Kind Kind
// Number carries the parsed numeric value when Kind == KindNumber.
// numStr preserves the original literal so Int() can decide whether
// the producer wrote an integer.
Number float64
// Name is the name body (no leading slash) when Kind == KindName.
Name string
// Bytes is the decoded string payload when Kind == KindString.
Bytes []byte
// Array is the element list when Kind == KindArray.
Array []Operand
// Dict is the entry map when Kind == KindDict. Iteration order is
// not preserved; use Dict.Keys / parse separately if order matters.
Dict Dict
// Bool is the boolean value when Kind == KindBool.
Bool bool
// contains filtered or unexported fields
}
Operand is a single value pushed onto the operand stack before an operator keyword. It is a tagged union: which field is meaningful depends on Kind.
type Scanner ¶
type Scanner struct {
// contains filtered or unexported fields
}
Scanner walks a decoded content stream and yields one Op at a time. It is not safe for concurrent use.
func New ¶
New returns a Scanner over the decoded content-stream bytes src. src is not copied. For pages whose /Contents is an array of streams, concatenate the decoded payloads with a single whitespace byte (per PDF 32000-1:2008 §7.8.2) before passing them in.