pdfdisassembler

package module

v0.0.3 Latest Latest Go to latest Published: Jun 19, 2026 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/speedata/pdfdisassembler

Links

Open Source Insights

README ¶

pdfdisassembler

A focused, read-only PDF parser for Go. Built for tooling that inspects PDFs — accessibility checkers, validators, debuggers — without dragging in the writing, optimisation, signing and image-rendering machinery that general-purpose PDF libraries carry.

Full API documentation: https://pkg.go.dev/github.com/speedata/pdfdisassembler

Status

Pre-1.0. The API may break between minor releases.

Why

The Go PDF ecosystem has a real gap for read-only structural inspection. Existing libraries are either too large (pdfcpu: ~50 kLOC, multi-MB WASM overhead), licensed restrictively (unipdf: AGPL/commercial), CGo (go-fitz), or too thin (rsc/pdf, ledongthuc/pdf). pdfdisassembler targets PDF 1.x and 2.0 reading in pure Go, WASM-friendly by construction.

Scope

In scope: PDF 1.x and 2.0 reading, classical xref and xref streams, indirect-object resolution, stream filters (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength), text-string decoding (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM), catalog + page tree, DocumentInfo, XMP metadata access, structure tree traversal, /Standard security handler (V2, V4, V5), defensive parsing.

Out of scope: writing PDFs, image filters (DCTDecode/JBIG2/JPX/CCITTFax), image rendering, font internals, XFA, public-key encryption, signature verification, content-stream graphics-state interpretation, LTV.

Usage

Open a file and read top-level metadata:

import "github.com/speedata/pdfdisassembler"

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

fmt.Println("PDF version:", r.Version())
info := r.DocumentInfo()
fmt.Println("Title:", info.Title)

Walk every live indirect object and decode any streams that carry one of the supported filters:

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

for entry := range r.Objects() {
    s, ok := entry.Object.(*pdfdisassembler.Stream)
    if !ok {
        continue
    }
    ref := entry.Reference
    data, err := r.DecodeStream(ref)
    if err != nil {
        fmt.Printf("%d %d R: %v\n", ref.Number, ref.Generation, err)
        continue
    }
    fmt.Printf("%d %d R: %d bytes raw, %d bytes decoded\n",
        ref.Number, ref.Generation, s.RawLength(), len(data))
}

More complete examples live under examples/: inspect prints a summary of a PDF, structtree walks the /StructTreeRoot as a starting point for accessibility tooling.

Testing

Snapshot tests live under testdata/fixtures/<name>/. Each fixture has an input.pdf and a committed golden.json. TestFixtures opens every fixture, runs Dump, and compares against the golden — a byte-stable JSON snapshot of the object graph.

Adding a fixture:

Drop input.pdf into testdata/fixtures/<name>/
go test -update -run TestFixtures/<name> — generates golden.json
Inspect the golden manually: does it match what the PDF spec says should happen? The golden is the expected behaviour, not "what the parser currently does"
Commit the PDF, the golden, and an optional README.md explaining what the fixture proves

For synthetic fixtures, see testdata/fixtures/generate.go. Run it from the repo root to (re)create the in-code fixture PDFs.

The same dump format is exposed as a CLI:

go install github.com/speedata/pdfdisassembler/cmd/pdfdump@latest
pdfdump doc.pdf > doc.json
diff <(pdfdump a.pdf) <(pdfdump b.pdf)

License

MIT. See LICENSE.

Documentation ¶

Overview ¶

Package pdfdisassembler is a focused, read-only PDF parser for Go.

It targets tooling that *inspects* PDFs — accessibility checkers, validators, debuggers — without dragging in the writing, optimisation, signing and image-rendering machinery that general-purpose PDF libraries carry.

Scope ¶

In scope: PDF 1.x and 2.0 reading, classical xref and xref streams, indirect-object resolution, stream filters (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength), text-string decoding (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM), catalog + page tree, DocumentInfo, XMP metadata access, structure-tree traversal, the /Standard security handler (V2, V4, V5), defensive xref recovery.

Out of scope: writing PDFs, image filters (DCTDecode/JBIG2/JPX/CCITTFax), image rendering, font internals, XFA, public-key encryption, signature verification, content-stream graphics-state interpretation, LTV.

Usage ¶

r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
    return err
}
defer r.Close()

fmt.Println("PDF version:", r.Version())
info := r.DocumentInfo()
fmt.Println("Title:", info.Title)

for entry := range r.Objects() {
    // inspect every live indirect object
    _ = entry
}

API stability ¶

Pre-1.0. The API may break between minor releases but never within a patch release.

Index ¶

Constants
func Dump(r *Reader, opts DumpOptions) ([]byte, error)
type Array
type Bool
type Dict
- func (d *Dict) Array(key string) (Array, bool)
- func (d *Dict) Bool(key string) (bool, bool)
- func (d *Dict) Bytes(key string) ([]byte, bool)
- func (d *Dict) Dict(key string) (*Dict, bool)
- func (d *Dict) Get(key string) (Object, bool)
- func (d *Dict) Has(key string) bool
- func (d *Dict) Int(key string) (int64, bool)
- func (d *Dict) Iter() iter.Seq2[string, Object]
- func (d *Dict) Keys() []string
- func (d *Dict) Len() int
- func (d *Dict) Name(key string) (Name, bool)
- func (d *Dict) Stream(key string) (*Stream, bool)
- func (d *Dict) String(key string) (string, bool)
type DocInfo
type DumpOptions
type EmbeddedFile
type Integer
type Name
type Null
type Object
type ObjectEntry
type Option
- func WithMaxStreamSize(n int64) Option
type Reader
- func Open(rs io.ReadSeeker, opts ...Option) (*Reader, error)
- func OpenFile(path string, opts ...Option) (*Reader, error)
- func (r *Reader) Catalog() (*Dict, error)
- func (r *Reader) Close() error
- func (r *Reader) DecodeStream(obj Object) ([]byte, error)
- func (r *Reader) DocumentInfo() DocInfo
- func (r *Reader) EmbeddedFiles() []EmbeddedFile
- func (r *Reader) Objects() iter.Seq[ObjectEntry]
- func (r *Reader) Resolve(obj Object) (Object, error)
- func (r *Reader) ResolveArray(obj Object) (Array, error)
- func (r *Reader) ResolveBool(obj Object) (bool, error)
- func (r *Reader) ResolveDict(obj Object) (*Dict, error)
- func (r *Reader) ResolveInt(obj Object) (int64, error)
- func (r *Reader) Trailer() *Dict
- func (r *Reader) Version() string
type Real
type Reference
type Stream
- func (s *Stream) Content() ([]byte, error)
- func (s *Stream) RawLength() int64
type String

Constants ¶

View Source

const DefaultMaxStreamSize int64 = 16 << 20

DefaultMaxStreamSize is the per-stream decoded-size cap Open uses by default.

Variables ¶

This section is empty.

Functions ¶

func Dump ¶

func Dump(r *Reader, opts DumpOptions) ([]byte, error)

Dump returns a deterministic JSON snapshot of r suitable for golden-file snapshot tests and human inspection.

The output is tagged so every PDF value kind is unambiguous (Name vs. Text-String vs. Byte-String, Integer vs. Real, etc.). Indirect references are preserved as references, so the object graph is acyclic and diffs stay reviewable in PRs. Streams contribute metadata (raw_length, filter chain, decoded length, SHA-256, optional preview) but never their full content — see DumpOptions.InlineStreamContent if you need it.

The output is intended to be byte-stable across runs; dictionary keys are emitted in PDF insertion order, objects in (Number, Generation) order.

Types ¶

type Array ¶

type Array []Object

Array is a PDF array of objects.

type Bool ¶

type Bool bool

Bool is a PDF boolean object.

type Dict ¶

type Dict struct {
	// contains filtered or unexported fields
}

Dict is a PDF dictionary that preserves insertion order during iteration.

func (*Dict) Array ¶

func (d *Dict) Array(key string) (Array, bool)

Array returns the Array value at key. If the value is a Reference, it is resolved first.

func (*Dict) Bool ¶

func (d *Dict) Bool(key string) (bool, bool)

Bool returns the Bool value at key. If the value is a Reference, it is resolved first.

func (*Dict) Bytes ¶

func (d *Dict) Bytes(key string) ([]byte, bool)

Bytes returns the raw bytes of a String value at key, without text-string decoding. Useful for byte strings (file identifiers, hashes). If the value is a Reference, it is resolved first.

func (*Dict) Dict ¶

func (d *Dict) Dict(key string) (*Dict, bool)

Dict returns the *Dict value at key. If the value is a Reference, it is resolved first.

func (*Dict) Get ¶

func (d *Dict) Get(key string) (Object, bool)

Get returns the raw object for key. The returned object may be a Reference; use Dict.Dict / Reader.Resolve if you need the resolved value.

func (*Dict) Has ¶

func (d *Dict) Has(key string) bool

Has reports whether key is present.

func (*Dict) Int ¶

func (d *Dict) Int(key string) (int64, bool)

Int returns the Integer value at key. If the value is a Reference, it is resolved first.

func (*Dict) Iter ¶

func (d *Dict) Iter() iter.Seq2[string, Object]

Iter returns an iterator over key/value pairs in insertion order.

func (*Dict) Keys ¶

func (d *Dict) Keys() []string

Keys returns the dictionary keys in insertion order.

func (*Dict) Len ¶

func (d *Dict) Len() int

Len returns the number of entries in the dictionary.

func (*Dict) Name ¶

func (d *Dict) Name(key string) (Name, bool)

Name returns the Name value at key. If the value is a Reference, it is resolved first.

func (*Dict) Stream ¶

func (d *Dict) Stream(key string) (*Stream, bool)

Stream returns the *Stream value at key. If the value is a Reference, it is resolved first.

func (*Dict) String ¶

func (d *Dict) String(key string) (string, bool)

String returns the value at key as a Go string, decoded according to the PDF text-string rules (UTF-16BE BOM, UTF-8 BOM, otherwise PDFDocEncoding). If the value is a Reference, it is resolved first.

type DocInfo ¶

type DocInfo struct {
	Title        string
	Author       string
	Subject      string
	Keywords     string
	Creator      string
	Producer     string
	CreationDate time.Time
	ModDate      time.Time
	Custom       map[string]string
}

DocInfo is a value snapshot of the standard /Info dictionary entries. Missing entries are zero values. Custom carries any non-standard keys as raw decoded strings.

type DumpOptions ¶

type DumpOptions struct {
	// PreviewMaxBytes is the maximum number of decoded stream bytes shown
	// as the preview_utf8 field. Default 80. Set to -1 to disable.
	PreviewMaxBytes int
	// InlineStreamContent embeds the decoded stream as hex under the
	// "decoded.hex" field. Off by default — real PDFs produce huge
	// fixtures otherwise.
	InlineStreamContent bool
}

DumpOptions controls Dump's behaviour.

type EmbeddedFile ¶ added in v0.0.3

type EmbeddedFile struct {
	Name string
	Spec *Dict
}

EmbeddedFile is one entry from the catalog's EmbeddedFiles name tree (a PDF attachment). Spec is the /Filespec dictionary; its /EF stream holds the bytes.

type Integer ¶

type Integer int64

Integer is a PDF integer object.

type Name ¶

type Name string

Name is a PDF name object, e.g. /Length, /Type. The leading slash is not stored.

type Null ¶

type Null struct{}

Null is the PDF null object.

type Object ¶

type Object interface {
	// contains filtered or unexported methods
}

Object is the sealed PDF object type. Concrete variants:

Name, Integer, Real, Bool, String, *Dict, Array, *Stream, Reference, Null

Callers should type-switch or type-assert on the concrete type when they need a specific value. The Resolve* helpers on Reader perform the common dereference-and-assert pattern.

type ObjectEntry ¶

type ObjectEntry struct {
	Reference Reference
	Object    Object
}

ObjectEntry is yielded by Reader.Objects: an in-use indirect object plus its resolved value.

type Option ¶ added in v0.0.3

type Option func(*Reader)

Option configures a Reader at Open time.

func WithMaxStreamSize ¶ added in v0.0.3

func WithMaxStreamSize(n int64) Option

WithMaxStreamSize sets the per-stream decoded-size cap; n <= 0 disables it. Applied before parsing, so it also bounds streams decoded during Open.

type Reader ¶

type Reader struct {

	// MaxStreamSize caps each stream's decoded size; <= 0 disables it. Setting
	// it after Open misses Open-time (xref/object) streams; use WithMaxStreamSize.
	MaxStreamSize int64
	// contains filtered or unexported fields
}

Reader is a parsed PDF document. It is not safe for concurrent use.

func Open ¶

func Open(rs io.ReadSeeker, opts ...Option) (*Reader, error)

Open parses a PDF from rs. rs must remain valid for the lifetime of the returned Reader.

func OpenFile ¶

func OpenFile(path string, opts ...Option) (*Reader, error)

OpenFile opens path and parses it as a PDF. The file stays open until Reader.Close is called.

func (*Reader) Catalog ¶

func (r *Reader) Catalog() (*Dict, error)

Catalog returns the document catalog dictionary.

func (*Reader) Close ¶

func (r *Reader) Close() error

Close releases the underlying resource. For Reader instances created via Open with a non-file ReadSeeker, Close is a no-op.

func (*Reader) DecodeStream ¶

func (r *Reader) DecodeStream(obj Object) ([]byte, error)

DecodeStream resolves obj to a stream and returns its decoded content.

func (*Reader) DocumentInfo ¶

func (r *Reader) DocumentInfo() DocInfo

DocumentInfo returns the standard /Info dictionary entries as a value snapshot. Missing entries return zero values.

func (*Reader) EmbeddedFiles ¶ added in v0.0.3

func (r *Reader) EmbeddedFiles() []EmbeddedFile

EmbeddedFiles returns the document's embedded files (PDF attachments) from the catalog's EmbeddedFiles name tree, in tree order. Returns nil when there are none.

func (*Reader) Objects ¶

func (r *Reader) Objects() iter.Seq[ObjectEntry]

Objects iterates every live indirect object in the xref table.

func (*Reader) Resolve ¶

func (r *Reader) Resolve(obj Object) (Object, error)

Resolve follows an indirect reference. If obj is not a Reference, returns obj unchanged. Resolution is cached.

func (*Reader) ResolveArray ¶

func (r *Reader) ResolveArray(obj Object) (Array, error)

ResolveArray resolves obj to an Array; errors otherwise.

func (*Reader) ResolveBool ¶

func (r *Reader) ResolveBool(obj Object) (bool, error)

ResolveBool resolves obj to a bool; errors otherwise.

func (*Reader) ResolveDict ¶

func (r *Reader) ResolveDict(obj Object) (*Dict, error)

ResolveDict resolves obj to a *Dict; errors when obj is missing or is not a dictionary.

func (*Reader) ResolveInt ¶

func (r *Reader) ResolveInt(obj Object) (int64, error)

ResolveInt resolves obj to an int64; errors otherwise.

func (*Reader) Trailer ¶

func (r *Reader) Trailer() *Dict

Trailer returns the trailer dictionary.

func (*Reader) Version ¶

func (r *Reader) Version() string

Version returns the PDF version declared in the file header (e.g. "1.7" or "2.0"). If the catalog declares a /Version entry that exceeds the header version, the catalog value wins (per spec).

type Real ¶

type Real float64

Real is a PDF real-number object.

type Reference ¶

type Reference struct {
	Number     int
	Generation int
}

Reference is an indirect object reference (e.g. "12 0 R").

type Stream ¶

type Stream struct {
	// Dict is the stream's parameter dictionary, e.g. /Length, /Filter.
	Dict *Dict
	// contains filtered or unexported fields
}

Stream is a stream object. The decoded content is produced by Content, which applies the declared filter chain (FlateDecode, ASCII85, …) and any document-level decryption, and caches the result.

func (*Stream) Content ¶

func (s *Stream) Content() ([]byte, error)

Content returns the decoded stream bytes. Filters and decryption are applied on the first call and the result is cached for subsequent calls.

func (*Stream) RawLength ¶

func (s *Stream) RawLength() int64

RawLength returns the declared raw byte length of the stream.

type String ¶

type String []byte

String is a PDF string object after parsing. The parser strips the literal-string parentheses or hex-string angle brackets and decodes escape sequences and hexadecimal pairs, but does not re-encode the bytes — they are whatever the producer wrote.

For PDF text strings ("Title", "Subject", "Producer", and so on) use Dict.String or DocumentInfo, which apply the text-string decoding rules (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM).

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
pdfdump command Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness.	Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness.
contentstream Package contentstream tokenises PDF content streams into a sequence of operations.	Package contentstream tokenises PDF content streams into a sequence of operations.
examples
inspect command Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count.	Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count.
structtree command Command structtree dumps the /StructTreeRoot of a PDF in indented form.	Command structtree dumps the /StructTreeRoot of a PDF in indented form.
internal
crypt Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0).	Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0).
filter Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength.	Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength.
lex Package lex tokenises PDF input.	Package lex tokenises PDF input.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL