Documentation
¶
Overview ¶
Package pdfdisassembler is a focused, read-only PDF parser for Go.
It targets tooling that *inspects* PDFs — accessibility checkers, validators, debuggers — without dragging in the writing, optimisation, signing and image-rendering machinery that general-purpose PDF libraries carry.
Scope ¶
In scope: PDF 1.x and 2.0 reading, classical xref and xref streams, indirect-object resolution, stream filters (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength), text-string decoding (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM), catalog + page tree, DocumentInfo, XMP metadata access, structure-tree traversal, the /Standard security handler (V2, V4, V5), defensive xref recovery.
Out of scope: writing PDFs, image filters (DCTDecode/JBIG2/JPX/CCITTFax), image rendering, font internals, XFA, public-key encryption, signature verification, content-stream graphics-state interpretation, LTV.
Usage ¶
r, err := pdfdisassembler.OpenFile("doc.pdf")
if err != nil {
return err
}
defer r.Close()
fmt.Println("PDF version:", r.Version())
info := r.DocumentInfo()
fmt.Println("Title:", info.Title)
for entry := range r.Objects() {
// inspect every live indirect object
_ = entry
}
API stability ¶
Pre-1.0. The API may break between minor releases but never within a patch release.
Index ¶
- Constants
- func Dump(r *Reader, opts DumpOptions) ([]byte, error)
- type Array
- type Bool
- type Dict
- func (d *Dict) Array(key string) (Array, bool)
- func (d *Dict) Bool(key string) (bool, bool)
- func (d *Dict) Bytes(key string) ([]byte, bool)
- func (d *Dict) Dict(key string) (*Dict, bool)
- func (d *Dict) Get(key string) (Object, bool)
- func (d *Dict) Has(key string) bool
- func (d *Dict) Int(key string) (int64, bool)
- func (d *Dict) Iter() iter.Seq2[string, Object]
- func (d *Dict) Keys() []string
- func (d *Dict) Len() int
- func (d *Dict) Name(key string) (Name, bool)
- func (d *Dict) Stream(key string) (*Stream, bool)
- func (d *Dict) String(key string) (string, bool)
- type DocInfo
- type DumpOptions
- type EmbeddedFile
- type Integer
- type Name
- type Null
- type Object
- type ObjectEntry
- type Option
- type Reader
- func (r *Reader) Catalog() (*Dict, error)
- func (r *Reader) Close() error
- func (r *Reader) DecodeStream(obj Object) ([]byte, error)
- func (r *Reader) DocumentInfo() DocInfo
- func (r *Reader) EmbeddedFiles() []EmbeddedFile
- func (r *Reader) Objects() iter.Seq[ObjectEntry]
- func (r *Reader) Resolve(obj Object) (Object, error)
- func (r *Reader) ResolveArray(obj Object) (Array, error)
- func (r *Reader) ResolveBool(obj Object) (bool, error)
- func (r *Reader) ResolveDict(obj Object) (*Dict, error)
- func (r *Reader) ResolveInt(obj Object) (int64, error)
- func (r *Reader) Trailer() *Dict
- func (r *Reader) Version() string
- type Real
- type Reference
- type Stream
- type String
Constants ¶
const DefaultMaxStreamSize int64 = 16 << 20
DefaultMaxStreamSize is the per-stream decoded-size cap Open uses by default.
Variables ¶
This section is empty.
Functions ¶
func Dump ¶
func Dump(r *Reader, opts DumpOptions) ([]byte, error)
Dump returns a deterministic JSON snapshot of r suitable for golden-file snapshot tests and human inspection.
The output is tagged so every PDF value kind is unambiguous (Name vs. Text-String vs. Byte-String, Integer vs. Real, etc.). Indirect references are preserved as references, so the object graph is acyclic and diffs stay reviewable in PRs. Streams contribute metadata (raw_length, filter chain, decoded length, SHA-256, optional preview) but never their full content — see DumpOptions.InlineStreamContent if you need it.
The output is intended to be byte-stable across runs; dictionary keys are emitted in PDF insertion order, objects in (Number, Generation) order.
Types ¶
type Dict ¶
type Dict struct {
// contains filtered or unexported fields
}
Dict is a PDF dictionary that preserves insertion order during iteration.
func (*Dict) Array ¶
Array returns the Array value at key. If the value is a Reference, it is resolved first.
func (*Dict) Bool ¶
Bool returns the Bool value at key. If the value is a Reference, it is resolved first.
func (*Dict) Bytes ¶
Bytes returns the raw bytes of a String value at key, without text-string decoding. Useful for byte strings (file identifiers, hashes). If the value is a Reference, it is resolved first.
func (*Dict) Dict ¶
Dict returns the *Dict value at key. If the value is a Reference, it is resolved first.
func (*Dict) Get ¶
Get returns the raw object for key. The returned object may be a Reference; use Dict.Dict / Reader.Resolve if you need the resolved value.
func (*Dict) Int ¶
Int returns the Integer value at key. If the value is a Reference, it is resolved first.
func (*Dict) Name ¶
Name returns the Name value at key. If the value is a Reference, it is resolved first.
type DocInfo ¶
type DocInfo struct {
Title string
Author string
Subject string
Keywords string
Creator string
Producer string
CreationDate time.Time
ModDate time.Time
Custom map[string]string
}
DocInfo is a value snapshot of the standard /Info dictionary entries. Missing entries are zero values. Custom carries any non-standard keys as raw decoded strings.
type DumpOptions ¶
type DumpOptions struct {
// PreviewMaxBytes is the maximum number of decoded stream bytes shown
// as the preview_utf8 field. Default 80. Set to -1 to disable.
PreviewMaxBytes int
// InlineStreamContent embeds the decoded stream as hex under the
// "decoded.hex" field. Off by default — real PDFs produce huge
// fixtures otherwise.
InlineStreamContent bool
}
DumpOptions controls Dump's behaviour.
type EmbeddedFile ¶ added in v0.0.3
EmbeddedFile is one entry from the catalog's EmbeddedFiles name tree (a PDF attachment). Spec is the /Filespec dictionary; its /EF stream holds the bytes.
type Name ¶
type Name string
Name is a PDF name object, e.g. /Length, /Type. The leading slash is not stored.
type Object ¶
type Object interface {
// contains filtered or unexported methods
}
Object is the sealed PDF object type. Concrete variants:
Name, Integer, Real, Bool, String, *Dict, Array, *Stream, Reference, Null
Callers should type-switch or type-assert on the concrete type when they need a specific value. The Resolve* helpers on Reader perform the common dereference-and-assert pattern.
type ObjectEntry ¶
ObjectEntry is yielded by Reader.Objects: an in-use indirect object plus its resolved value.
type Option ¶ added in v0.0.3
type Option func(*Reader)
Option configures a Reader at Open time.
func WithMaxStreamSize ¶ added in v0.0.3
WithMaxStreamSize sets the per-stream decoded-size cap; n <= 0 disables it. Applied before parsing, so it also bounds streams decoded during Open.
type Reader ¶
type Reader struct {
// MaxStreamSize caps each stream's decoded size; <= 0 disables it. Setting
// it after Open misses Open-time (xref/object) streams; use WithMaxStreamSize.
MaxStreamSize int64
// contains filtered or unexported fields
}
Reader is a parsed PDF document. It is not safe for concurrent use.
func Open ¶
func Open(rs io.ReadSeeker, opts ...Option) (*Reader, error)
Open parses a PDF from rs. rs must remain valid for the lifetime of the returned Reader.
func OpenFile ¶
OpenFile opens path and parses it as a PDF. The file stays open until Reader.Close is called.
func (*Reader) Close ¶
Close releases the underlying resource. For Reader instances created via Open with a non-file ReadSeeker, Close is a no-op.
func (*Reader) DecodeStream ¶
DecodeStream resolves obj to a stream and returns its decoded content.
func (*Reader) DocumentInfo ¶
DocumentInfo returns the standard /Info dictionary entries as a value snapshot. Missing entries return zero values.
func (*Reader) EmbeddedFiles ¶ added in v0.0.3
func (r *Reader) EmbeddedFiles() []EmbeddedFile
EmbeddedFiles returns the document's embedded files (PDF attachments) from the catalog's EmbeddedFiles name tree, in tree order. Returns nil when there are none.
func (*Reader) Objects ¶
func (r *Reader) Objects() iter.Seq[ObjectEntry]
Objects iterates every live indirect object in the xref table.
func (*Reader) Resolve ¶
Resolve follows an indirect reference. If obj is not a Reference, returns obj unchanged. Resolution is cached.
func (*Reader) ResolveArray ¶
ResolveArray resolves obj to an Array; errors otherwise.
func (*Reader) ResolveBool ¶
ResolveBool resolves obj to a bool; errors otherwise.
func (*Reader) ResolveDict ¶
ResolveDict resolves obj to a *Dict; errors when obj is missing or is not a dictionary.
func (*Reader) ResolveInt ¶
ResolveInt resolves obj to an int64; errors otherwise.
type Stream ¶
type Stream struct {
// Dict is the stream's parameter dictionary, e.g. /Length, /Filter.
Dict *Dict
// contains filtered or unexported fields
}
Stream is a stream object. The decoded content is produced by Content, which applies the declared filter chain (FlateDecode, ASCII85, …) and any document-level decryption, and caches the result.
type String ¶
type String []byte
String is a PDF string object after parsing. The parser strips the literal-string parentheses or hex-string angle brackets and decodes escape sequences and hexadecimal pairs, but does not re-encode the bytes — they are whatever the producer wrote.
For PDF text strings ("Title", "Subject", "Producer", and so on) use Dict.String or DocumentInfo, which apply the text-string decoding rules (PDFDocEncoding, UTF-16BE BOM, UTF-8 BOM).
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
pdfdump
command
Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness.
|
Command pdfdump emits a JSON snapshot of a PDF in the same format used by pdfdisassembler's snapshot-test harness. |
|
Package contentstream tokenises PDF content streams into a sequence of operations.
|
Package contentstream tokenises PDF content streams into a sequence of operations. |
|
examples
|
|
|
inspect
command
Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count.
|
Command inspect prints a summary of a PDF: version, document info, catalog top-level keys, page count. |
|
structtree
command
Command structtree dumps the /StructTreeRoot of a PDF in indented form.
|
Command structtree dumps the /StructTreeRoot of a PDF in indented form. |
|
internal
|
|
|
crypt
Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0).
|
Package crypt implements the PDF /Standard security handler for versions V2 (RC4), V4 (RC4 or AES-128) and V5 (AES-256, PDF 1.7 Extension 3 and PDF 2.0). |
|
filter
Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength.
|
Package filter implements the PDF stream filters needed for read-only inspection of document structure: FlateDecode (with predictors), LZW, ASCII85, ASCIIHex, RunLength. |
|
lex
Package lex tokenises PDF input.
|
Package lex tokenises PDF input. |