contentstream

package
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package contentstream tokenises PDF content streams into a sequence of operations. Content streams are the postfix-notation graphics instructions that paint each page (text-showing operators, path operators, graphics-state ops, marked-content tags, …).

The scanner is operand-aware: operands are collected up to each operator keyword and surfaced together as one Op. Inline images (BI/ID/EI) are folded into a single synthetic EI op so the binary image bytes between ID and EI do not derail tokenisation.

The scanner does NOT interpret the operations: it does not track graphics state, does not render glyphs, does not resolve XObjects. Higher-level consumers (e.g. tagged-PDF validators) layer that logic on top.

Usage

for op, err := range contentstream.New(decoded).All() {
    if err != nil { ... }
    switch op.Operator {
    case "Tf":
        // op.Operands[0].Name is the font resource key
    case "BDC":
        // op.Operands[0].Name is the structure tag
        // op.Operands[1] is either a Name (ref into /Properties)
        // or a Dict (inline properties)
    }
}

Scope

The scanner accepts the subset of PDF object syntax that can appear in content streams: numbers, names, strings (literal and hex), arrays, dictionaries, and operator keywords. Indirect references and stream objects do not occur in content streams and are not handled.

Index

Constants

This section is empty.

Variables

View Source
var ErrUnexpectedEOF = errors.New("pdfdisassembler/contentstream: unexpected EOF")

ErrUnexpectedEOF indicates that the scanner ran out of bytes mid- operation (e.g. inside a dictionary, or while looking for EI).

Functions

This section is empty.

Types

type Dict

type Dict map[string]Operand

Dict is a small key→Operand map for inline content-stream dictionaries. Nested dictionaries are supported.

type Kind

type Kind int

Kind identifies the type of an operand value.

const (
	// KindUnknown is the zero value; not produced by the scanner.
	KindUnknown Kind = iota
	// KindNumber covers both PDF integers and reals. Use Operand.Int()
	// to recover an int64 when the producer wrote an integer literal.
	KindNumber
	// KindName is a PDF name without the leading slash.
	KindName
	// KindString is a literal or hex string. The raw decoded bytes are
	// in Operand.Bytes; the scanner does not apply text-string decoding
	// (UTF-16BE BOM, PDFDocEncoding, …) because content-stream strings
	// are text shown to the reader and their semantic encoding depends
	// on the active font, not on the PDF text-string convention.
	KindString
	// KindArray holds operands of a PDF array, in source order. The
	// most common occurrence is the operand of TJ: a mix of strings
	// and number adjustments.
	KindArray
	// KindDict holds the entries of an inline dictionary. The most
	// common occurrence is the property dictionary that follows BDC.
	KindDict
	// KindBool is rare in content streams but appears in BDC property
	// dictionaries occasionally.
	KindBool
	// KindNull is rare in content streams but appears in BDC property
	// dictionaries occasionally.
	KindNull
)

type Op

type Op struct {
	Operator string
	Operands []Operand
	Image    []byte
	// Offset is the byte position of the operator keyword in the
	// source slice. Useful for error messages and source ranges.
	Offset int64
}

Op is one content-stream operation: zero or more operands followed by an operator keyword (e.g. "Tf", "Tj", "BDC").

For inline-image runs, Operator is "EI" and Image carries the raw bytes between ID and EI; the BI dictionary is in Operands[0] as a KindDict (or empty if BI carried no entries).

type Operand

type Operand struct {
	Kind Kind
	// Number carries the parsed numeric value when Kind == KindNumber.
	// numStr preserves the original literal so Int() can decide whether
	// the producer wrote an integer.
	Number float64

	// Name is the name body (no leading slash) when Kind == KindName.
	Name string
	// Bytes is the decoded string payload when Kind == KindString.
	Bytes []byte
	// Array is the element list when Kind == KindArray.
	Array []Operand
	// Dict is the entry map when Kind == KindDict. Iteration order is
	// not preserved; use Dict.Keys / parse separately if order matters.
	Dict Dict
	// Bool is the boolean value when Kind == KindBool.
	Bool bool
	// contains filtered or unexported fields
}

Operand is a single value pushed onto the operand stack before an operator keyword. It is a tagged union: which field is meaningful depends on Kind.

func (Operand) Int

func (o Operand) Int() (int64, bool)

Int reports the operand as an int64 if the producer wrote an integer literal (no decimal point, no exponent). The ok flag is false for real-number literals and for non-number operands.

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner walks a decoded content stream and yields one Op at a time. It is not safe for concurrent use.

func New

func New(src []byte) *Scanner

New returns a Scanner over the decoded content-stream bytes src. src is not copied. For pages whose /Contents is an array of streams, concatenate the decoded payloads with a single whitespace byte (per PDF 32000-1:2008 §7.8.2) before passing them in.

func (*Scanner) All

func (s *Scanner) All() iter.Seq2[Op, error]

All returns a range-over-func iterator that yields each Op until EOF or the first error. The error is delivered through the second loop variable as on the final iteration.

func (*Scanner) Next

func (s *Scanner) Next() (Op, error)

Next returns the next operation. At end of stream it returns io.EOF. Any other error indicates malformed input; the scanner is not safe to keep using after an error.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL