pdfdisassembler

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2026 License: MIT Imports: 17 Imported by: 0

README

pdfdisassembler

Test Coverage Go Reference

A focused, read-only PDF parser for Go. Built for tooling that inspects PDFs — accessibility checkers, validators, debuggers — without dragging in the writing, optimisation, signing and image-rendering machinery that general-purpose PDF libraries carry.

Full API documentation: https://pkg.go.dev/github.com/speedata/pdfdisassembler

Status

Pre-1.0. The API may break between minor releases.

Why

The Go PDF ecosystem has a real gap for read-only structural inspection. Existing libraries are either too large (pdfcpu: ~50 kLOC, multi-MB WASM overhead), licensed restrictively (unipdf: AGPL/commercial), CGo (go-fitz), or too thin (rsc/pdf, ledongthuc/pdf). pdfdisassembler targets PDF 1.x and 2.0 reading in pure Go, WASM-friendly by construction.

Scope

In scope: PDF 1.x and 2.0 reading, classical xref and xref streams, indirect-object resolution, stream filters (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength), text-string decoding (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM), catalog + page tree, DocumentInfo, XMP metadata access, structure tree traversal, /Standard security handler (V2, V4, V5), defensive parsing.

Out of scope: writing PDFs, image filters (DCTDecode/JBIG2/JPX/CCITTFax), image rendering, font internals, XFA, public-key encryption, signature verification, content-stream graphics-state interpretation, LTV.

Usage

Open a file and read top-level metadata:

import "github.com/speedata/pdfdisassembler"

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

fmt.Println("PDF version:", r.Version())
info := r.DocumentInfo()
fmt.Println("Title:", info.Title)

Walk every live indirect object and decode any streams that carry one of the supported filters:

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

for entry := range r.Objects() {
    s, ok := entry.Object.(*pdfdisassembler.Stream)
    if !ok {
        continue
    }
    ref := entry.Reference
    data, err := r.DecodeStream(ref)
    if err != nil {
        fmt.Printf("%d %d R: %v\n", ref.Number, ref.Generation, err)
        continue
    }
    fmt.Printf("%d %d R: %d bytes raw, %d bytes decoded\n",
        ref.Number, ref.Generation, s.RawLength(), len(data))
}

More complete examples live under examples/: inspect prints a summary of a PDF, structtree walks the /StructTreeRoot as a starting point for accessibility tooling.

Testing

Snapshot tests live under testdata/fixtures/<name>/. Each fixture has an input.pdf and a committed golden.json. TestFixtures opens every fixture, runs Dump, and compares against the golden — a byte-stable JSON snapshot of the object graph.

Adding a fixture:

  1. Drop input.pdf into testdata/fixtures/<name>/
  2. go test -update -run TestFixtures/<name> — generates golden.json
  3. Inspect the golden manually: does it match what the PDF spec says should happen? The golden is the expected behaviour, not "what the parser currently does"
  4. Commit the PDF, the golden, and an optional README.md explaining what the fixture proves

For synthetic fixtures, see testdata/fixtures/generate.go. Run it from the repo root to (re)create the in-code fixture PDFs.

The same dump format is exposed as a CLI:

go install github.com/speedata/pdfdisassembler/cmd/pdfdump@latest
pdfdump doc.pdf > doc.json
diff <(pdfdump a.pdf) <(pdfdump b.pdf)

License

MIT. See LICENSE.

Documentation

Overview

Package pdfdisassembler is a focused, read-only PDF parser for Go.

It targets tooling that *inspects* PDFs — accessibility checkers, validators, debuggers — without dragging in the writing, optimisation, signing and image-rendering machinery that general-purpose PDF libraries carry.

Scope

In scope: PDF 1.x and 2.0 reading, classical xref and xref streams, indirect-object resolution, stream filters (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength), text-string decoding (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM), catalog + page tree, DocumentInfo, XMP metadata access, structure-tree traversal, the /Standard security handler (V2, V4, V5), defensive xref recovery.

Out of scope: writing PDFs, image filters (DCTDecode/JBIG2/JPX/CCITTFax), image rendering, font internals, XFA, public-key encryption, signature verification, content-stream graphics-state interpretation, LTV.

Usage

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

fmt.Println("PDF version:", r.Version())
info := r.DocumentInfo()
fmt.Println("Title:", info.Title)

for entry := range r.Objects() {
    // inspect every live indirect object
    _ = entry
}

API stability

Pre-1.0. The API may break between minor releases but never within a patch release.

Index

Constants

View Source
const DefaultMaxStreamSize int64 = 16 << 20

DefaultMaxStreamSize is the per-stream decoded-size cap Open uses by default.

Variables

This section is empty.

Functions

func Dump

func Dump(r *Reader, opts DumpOptions) ([]byte, error)

Dump returns a deterministic JSON snapshot of r suitable for golden-file snapshot tests and human inspection.

The output is tagged so every PDF value kind is unambiguous (Name vs. Text-String vs. Byte-String, Integer vs. Real, etc.). Indirect references are preserved as references, so the object graph is acyclic and diffs stay reviewable in PRs. Streams contribute metadata (raw_length, filter chain, decoded length, SHA-256, optional preview) but never their full content — see DumpOptions.InlineStreamContent if you need it.

The output is intended to be byte-stable across runs; dictionary keys are emitted in PDF insertion order, objects in (Number, Generation) order.

Types

type Array

type Array []Object

Array is a PDF array of objects.

type Bool

type Bool bool

Bool is a PDF boolean object.

type Dict

type Dict struct {
	// contains filtered or unexported fields
}

Dict is a PDF dictionary that preserves insertion order during iteration.

func (*Dict) Array

func (d *Dict) Array(key string) (Array, bool)

Array returns the Array value at key. If the value is a Reference, it is resolved first.

func (*Dict) Bool

func (d *Dict) Bool(key string) (bool, bool)

Bool returns the Bool value at key. If the value is a Reference, it is resolved first.

func (*Dict) Bytes

func (d *Dict) Bytes(key string) ([]byte, bool)

Bytes returns the raw bytes of a String value at key, without text-string decoding. Useful for byte strings (file identifiers, hashes). If the value is a Reference, it is resolved first.

func (*Dict) Dict

func (d *Dict) Dict(key string) (*Dict, bool)

Dict returns the *Dict value at key. If the value is a Reference, it is resolved first.

func (*Dict) Get

func (d *Dict) Get(key string) (Object, bool)

Get returns the raw object for key. The returned object may be a Reference; use Dict.Dict / Reader.Resolve if you need the resolved value.

func (*Dict) Has

func (d *Dict) Has(key string) bool

Has reports whether key is present.

func (*Dict) Int

func (d *Dict) Int(key string) (int64, bool)

Int returns the Integer value at key. If the value is a Reference, it is resolved first.

func (*Dict) Iter

func (d *Dict) Iter() iter.Seq2[string, Object]

Iter returns an iterator over key/value pairs in insertion order.

func (*Dict) Keys

func (d *Dict) Keys() []string

Keys returns the dictionary keys in insertion order.

func (*Dict) Len

func (d *Dict) Len() int

Len returns the number of entries in the dictionary.

func (*Dict) Name

func (d *Dict) Name(key string) (Name, bool)

Name returns the Name value at key. If the value is a Reference, it is resolved first.

func (*Dict) Stream

func (d *Dict) Stream(key string) (*Stream, bool)

Stream returns the *Stream value at key. If the value is a Reference, it is resolved first.

func (*Dict) String

func (d *Dict) String(key string) (string, bool)

String returns the value at key as a Go string, decoded according to the PDF text-string rules (UTF-16BE BOM, UTF-8 BOM, otherwise PDFDocEncoding). If the value is a Reference, it is resolved first.

type DocInfo

type DocInfo struct {
	Title        string
	Author       string
	Subject      string
	Keywords     string
	Creator      string
	Producer     string
	CreationDate time.Time
	ModDate      time.Time
	Custom       map[string]string
}

DocInfo is a value snapshot of the standard /Info dictionary entries. Missing entries are zero values. Custom carries any non-standard keys as raw decoded strings.

type DumpOptions

type DumpOptions struct {
	// PreviewMaxBytes is the maximum number of decoded stream bytes shown
	// as the preview_utf8 field. Default 80. Set to -1 to disable.
	PreviewMaxBytes int
	// InlineStreamContent embeds the decoded stream as hex under the
	// "decoded.hex" field. Off by default — real PDFs produce huge
	// fixtures otherwise.
	InlineStreamContent bool
}

DumpOptions controls Dump's behaviour.

type EmbeddedFile added in v0.0.3

type EmbeddedFile struct {
	Name string
	Spec *Dict
}

EmbeddedFile is one entry from the catalog's EmbeddedFiles name tree (a PDF attachment). Spec is the /Filespec dictionary; its /EF stream holds the bytes.

type Integer

type Integer int64

Integer is a PDF integer object.

type Name

type Name string

Name is a PDF name object, e.g. /Length, /Type. The leading slash is not stored.

type Null

type Null struct{}

Null is the PDF null object.

type Object

type Object interface {
	// contains filtered or unexported methods
}

Object is the sealed PDF object type. Concrete variants:

Name, Integer, Real, Bool, String, *Dict, Array, *Stream, Reference, Null

Callers should type-switch or type-assert on the concrete type when they need a specific value. The Resolve* helpers on Reader perform the common dereference-and-assert pattern.

type ObjectEntry

type ObjectEntry struct {
	Reference Reference
	Object    Object
}

ObjectEntry is yielded by Reader.Objects: an in-use indirect object plus its resolved value.

type Option added in v0.0.3

type Option func(*Reader)

Option configures a Reader at Open time.

func WithMaxStreamSize added in v0.0.3

func WithMaxStreamSize(n int64) Option

WithMaxStreamSize sets the per-stream decoded-size cap; n <= 0 disables it. Applied before parsing, so it also bounds streams decoded during Open.

type Reader

type Reader struct {

	// MaxStreamSize caps each stream's decoded size; <= 0 disables it. Setting
	// it after Open misses Open-time (xref/object) streams; use WithMaxStreamSize.
	MaxStreamSize int64
	// contains filtered or unexported fields
}

Reader is a parsed PDF document. It is not safe for concurrent use.

func Open

func Open(rs io.ReadSeeker, opts ...Option) (*Reader, error)

Open parses a PDF from rs. rs must remain valid for the lifetime of the returned Reader.

func OpenFile

func OpenFile(path string, opts ...Option) (*Reader, error)

OpenFile opens path and parses it as a PDF. The file stays open until Reader.Close is called.

func (*Reader) Catalog

func (r *Reader) Catalog() (*Dict, error)

Catalog returns the document catalog dictionary.

func (*Reader) Close

func (r *Reader) Close() error

Close releases the underlying resource. For Reader instances created via Open with a non-file ReadSeeker, Close is a no-op.

func (*Reader) DecodeStream

func (r *Reader) DecodeStream(obj Object) ([]byte, error)

DecodeStream resolves obj to a stream and returns its decoded content.

func (*Reader) DocumentInfo

func (r *Reader) DocumentInfo() DocInfo

DocumentInfo returns the standard /Info dictionary entries as a value snapshot. Missing entries return zero values.

func (*Reader) EmbeddedFiles added in v0.0.3

func (r *Reader) EmbeddedFiles() []EmbeddedFile

EmbeddedFiles returns the document's embedded files (PDF attachments) from the catalog's EmbeddedFiles name tree, in tree order. Returns nil when there are none.

func (*Reader) Objects

func (r *Reader) Objects() iter.Seq[ObjectEntry]

Objects iterates every live indirect object in the xref table.

func (*Reader) Resolve

func (r *Reader) Resolve(obj Object) (Object, error)

Resolve follows an indirect reference. If obj is not a Reference, returns obj unchanged. Resolution is cached.

func (*Reader) ResolveArray

func (r *Reader) ResolveArray(obj Object) (Array, error)

ResolveArray resolves obj to an Array; errors otherwise.

func (*Reader) ResolveBool

func (r *Reader) ResolveBool(obj Object) (bool, error)

ResolveBool resolves obj to a bool; errors otherwise.

func (*Reader) ResolveDict

func (r *Reader) ResolveDict(obj Object) (*Dict, error)

ResolveDict resolves obj to a *Dict; errors when obj is missing or is not a dictionary.

func (*Reader) ResolveInt

func (r *Reader) ResolveInt(obj Object) (int64, error)

ResolveInt resolves obj to an int64; errors otherwise.

func (*Reader) Trailer

func (r *Reader) Trailer() *Dict

Trailer returns the trailer dictionary.

func (*Reader) Version

func (r *Reader) Version() string

Version returns the PDF version declared in the file header (e.g. "1.7" or "2.0"). If the catalog declares a /Version entry that exceeds the header version, the catalog value wins (per spec).

type Real

type Real float64

Real is a PDF real-number object.

type Reference

type Reference struct {
	Number     int
	Generation int
}

Reference is an indirect object reference (e.g. "12 0 R").

type Stream

type Stream struct {
	// Dict is the stream's parameter dictionary, e.g. /Length, /Filter.
	Dict *Dict
	// contains filtered or unexported fields
}

Stream is a stream object. The decoded content is produced by Content, which applies the declared filter chain (FlateDecode, ASCII85, …) and any document-level decryption, and caches the result.

func (*Stream) Content

func (s *Stream) Content() ([]byte, error)

Content returns the decoded stream bytes. Filters and decryption are applied on the first call and the result is cached for subsequent calls.

func (*Stream) RawLength

func (s *Stream) RawLength() int64

RawLength returns the declared raw byte length of the stream.

type String

type String []byte

String is a PDF string object after parsing. The parser strips the literal-string parentheses or hex-string angle brackets and decodes escape sequences and hexadecimal pairs, but does not re-encode the bytes — they are whatever the producer wrote.

For PDF text strings ("Title", "Subject", "Producer", and so on) use Dict.String or DocumentInfo, which apply the text-string decoding rules (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM).

Directories

Path Synopsis
cmd
pdfdump command
Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness.
Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness.
Package contentstream tokenises PDF content streams into a sequence of operations.
Package contentstream tokenises PDF content streams into a sequence of operations.
examples
inspect command
Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count.
Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count.
structtree command
Command structtree dumps the /StructTreeRoot of a PDF in indented form.
Command structtree dumps the /StructTreeRoot of a PDF in indented form.
internal
crypt
Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0).
Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0).
filter
Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength.
Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength.
lex
Package lex tokenises PDF input.
Package lex tokenises PDF input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL