Documentation
¶
Overview ¶
Package core provides low-level PDF parsing primitives and object types.
This package implements the fundamental building blocks for working with PDF files, including all eight PDF object types (null, boolean, integer, real, string, name, array, and dictionary), as well as streams, indirect references, cross-reference tables, and object streams.
Object Types ¶
PDF defines eight basic object types, all implemented as types satisfying the Object interface:
- Null - represents the PDF null object
- Bool - represents PDF boolean values (true/false)
- Int - represents PDF integers
- Real - represents PDF real numbers (floating point)
- String - represents PDF string objects (literal or hexadecimal)
- Name - represents PDF name objects (e.g., /Type, /Font)
- Array - represents PDF arrays
- Dict - represents PDF dictionaries
Additionally, Stream represents a PDF stream (dictionary + binary data), and IndirectRef represents a reference to an indirect object.
Parsing ¶
The Parser type handles parsing PDF syntax from an io.Reader. It can parse individual objects or complete indirect object definitions.
The Lexer type provides tokenization of PDF input, converting raw bytes into tokens that the parser consumes.
Cross-Reference Tables ¶
The XRefTable type represents a PDF cross-reference table, which maps object numbers to their locations in the file. The XRefParser type handles parsing both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).
Object Streams ¶
The ObjectStream type (PDF 1.5+) handles object streams, which store multiple objects in a single compressed stream for better compression.
Stream Decoding ¶
Streams can be compressed using various filters. The Stream.Decode method handles decompression, supporting filters like FlateDecode, ASCIIHexDecode, and ASCII85Decode.
Index ¶
- type Array
- type Bool
- type Dict
- func (d Dict) Delete(key string)
- func (d Dict) Get(key string) Object
- func (d Dict) GetArray(key string) (Array, bool)
- func (d Dict) GetBool(key string) (Bool, bool)
- func (d Dict) GetDict(key string) (Dict, bool)
- func (d Dict) GetIndirectRef(key string) (IndirectRef, bool)
- func (d Dict) GetInt(key string) (Int, bool)
- func (d Dict) GetName(key string) (Name, bool)
- func (d Dict) GetReal(key string) (Real, bool)
- func (d Dict) GetStream(key string) (*Stream, bool)
- func (d Dict) GetString(key string) (String, bool)
- func (d Dict) Has(key string) bool
- func (d Dict) Keys() []string
- func (d Dict) Set(key string, value Object)
- func (d Dict) String() string
- func (d Dict) Type() ObjectType
- type IndirectObject
- type IndirectRef
- type Int
- type Lexer
- type Name
- type Null
- type Object
- type ObjectStream
- func (os *ObjectStream) ContainsObject(objNum int) (bool, error)
- func (os *ObjectStream) Extends() *IndirectRef
- func (os *ObjectStream) First() int
- func (os *ObjectStream) GetObjectByIndex(index int) (Object, int, error)
- func (os *ObjectStream) GetObjectByNumber(objNum int) (Object, int, error)
- func (os *ObjectStream) N() int
- func (os *ObjectStream) ObjectNumbers() ([]int, error)
- type ObjectType
- type Parser
- type Real
- type ReferenceResolver
- type Stream
- type String
- type Token
- type TokenType
- type XRefEntry
- type XRefEntryType
- type XRefParser
- func (x *XRefParser) FindXRef() (int64, error)
- func (x *XRefParser) ParseAllXRefs() ([]*XRefTable, error)
- func (x *XRefParser) ParsePrevXRef(table *XRefTable) (*XRefTable, error)
- func (x *XRefParser) ParseXRef(offset int64) (*XRefTable, error)
- func (x *XRefParser) ParseXRefFromEOF() (*XRefTable, error)
- type XRefTable
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Array ¶
type Array []Object
Array represents a PDF array, an ordered collection of PDF objects.
func (Array) GetInt ¶
GetInt returns the integer at the given index, with a boolean indicating success.
func (Array) GetName ¶
GetName returns the name at the given index, with a boolean indicating success.
func (Array) GetReal ¶
GetReal returns the real number at the given index, with a boolean indicating success.
func (Array) Type ¶
func (a Array) Type() ObjectType
type Bool ¶
type Bool bool
Bool represents a PDF boolean value (true or false).
func (Bool) Type ¶
func (b Bool) Type() ObjectType
type Dict ¶
Dict represents a PDF dictionary, a collection of key-value pairs where keys are names (strings) and values are arbitrary PDF objects.
func (Dict) GetArray ¶
GetArray returns the Array value for the key, with a boolean indicating success.
func (Dict) GetBool ¶
GetBool returns the Bool value for the key, with a boolean indicating success.
func (Dict) GetDict ¶
GetDict returns the Dict value for the key, with a boolean indicating success.
func (Dict) GetIndirectRef ¶
func (d Dict) GetIndirectRef(key string) (IndirectRef, bool)
GetIndirectRef returns the IndirectRef value for the key, with a boolean indicating success.
func (Dict) GetName ¶
GetName returns the Name value for the key, with a boolean indicating success.
func (Dict) GetReal ¶
GetReal returns the Real value for the key, with a boolean indicating success.
func (Dict) GetStream ¶
GetStream returns the Stream value for the key, with a boolean indicating success.
func (Dict) GetString ¶
GetString returns the String value for the key, with a boolean indicating success.
func (Dict) Type ¶
func (d Dict) Type() ObjectType
type IndirectObject ¶
type IndirectObject struct {
Ref IndirectRef // Reference identifying this object
Object Object // The actual object value
}
IndirectObject represents an indirect object definition (e.g., "5 0 obj ... endobj"). It pairs an IndirectRef with the actual object value.
type IndirectRef ¶
type IndirectRef struct {
Number int // Object number
Generation int // Generation number (usually 0)
}
IndirectRef represents an indirect object reference in PDF syntax (e.g., "5 0 R"). It references an object by its object number and generation number.
func (IndirectRef) String ¶
func (r IndirectRef) String() string
func (IndirectRef) Type ¶
func (r IndirectRef) Type() ObjectType
type Int ¶
type Int int64
Int represents a PDF integer object, stored as a 64-bit signed integer.
func (Int) Type ¶
func (i Int) Type() ObjectType
type Lexer ¶
type Lexer struct {
// contains filtered or unexported fields
}
Lexer performs lexical analysis of PDF content, breaking the input into tokens. It handles all PDF lexical elements including strings with escape sequences, hexadecimal strings, names with # escapes, and nested parentheses.
func (*Lexer) NextToken ¶
NextToken returns the next token from the input. It skips whitespace and returns TokenEOF when the input is exhausted.
func (*Lexer) ReadBytes ¶
ReadBytes reads exactly n bytes from the underlying reader. Used for reading binary stream data where tokenization is not appropriate.
func (*Lexer) SkipStreamEOL ¶ added in v1.6.0
SkipStreamEOL skips the mandatory end-of-line marker after the 'stream' keyword. Per PDF spec, this is either a single LF (0x0A) or CR+LF (0x0D 0x0A).
type Name ¶
type Name string
Name represents a PDF name object, used as identifiers (e.g., /Type, /Font). The leading slash is not stored; it is added in String() output.
func (Name) Type ¶
func (n Name) Type() ObjectType
type Null ¶
type Null struct{}
Null represents the PDF null object, which denotes the absence of a value.
func (Null) Type ¶
func (n Null) Type() ObjectType
type Object ¶
type Object interface {
// Type returns the ObjectType identifying this object's type.
Type() ObjectType
// String returns a PDF-syntax string representation of the object.
String() string
}
Object is the interface implemented by all PDF object types. Every PDF object can report its type and provide a string representation.
type ObjectStream ¶
type ObjectStream struct {
// contains filtered or unexported fields
}
ObjectStream represents a PDF Object Stream (Type /ObjStm), introduced in PDF 1.5. Object streams store multiple objects in a single compressed stream, providing better compression than storing objects individually.
func NewObjectStream ¶
func NewObjectStream(stream *Stream) (*ObjectStream, error)
NewObjectStream creates an ObjectStream from a Stream object. The stream must have Type /ObjStm and required entries /N and /First. Returns an error if the stream is not a valid object stream.
func (*ObjectStream) ContainsObject ¶
func (os *ObjectStream) ContainsObject(objNum int) (bool, error)
ContainsObject reports whether the given object number is stored in this stream.
func (*ObjectStream) Extends ¶
func (os *ObjectStream) Extends() *IndirectRef
Extends returns the reference to another object stream this one extends, or nil.
func (*ObjectStream) First ¶
func (os *ObjectStream) First() int
First returns the byte offset to the first object's data in the decoded stream. The header (object number/offset pairs) precedes this offset.
func (*ObjectStream) GetObjectByIndex ¶
func (os *ObjectStream) GetObjectByIndex(index int) (Object, int, error)
GetObjectByIndex extracts an object by its index within the stream (0-based). Returns the object, its object number, and any error. The index corresponds to the position in the header, not the object number.
func (*ObjectStream) GetObjectByNumber ¶
func (os *ObjectStream) GetObjectByNumber(objNum int) (Object, int, error)
GetObjectByNumber finds and extracts an object by its object number. Returns the object, its index within the stream, and any error.
func (*ObjectStream) N ¶
func (os *ObjectStream) N() int
N returns the number of objects stored in the stream.
func (*ObjectStream) ObjectNumbers ¶
func (os *ObjectStream) ObjectNumbers() ([]int, error)
ObjectNumbers returns a slice of all object numbers stored in this stream.
type ObjectType ¶
type ObjectType int
ObjectType identifies the type of a PDF object.
const ( ObjNull ObjectType = iota // Null object ObjBool // Boolean (true/false) ObjInt // Integer ObjReal // Real number (floating point) ObjString // String (literal or hexadecimal) ObjName // Name object (e.g., /Type) ObjArray // Array ObjDict // Dictionary ObjStream // Stream (dictionary + data) ObjIndirect // Indirect reference (e.g., "5 0 R") )
PDF object type constants.
func (ObjectType) String ¶
func (t ObjectType) String() string
String returns a human-readable name for the object type.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser parses PDF objects from an io.Reader using a Lexer for tokenization. It supports parsing all PDF object types including indirect objects and streams.
func NewParser ¶
NewParser creates a new PDF parser for the given reader. It initializes the lexer and loads the first two tokens for lookahead.
func (*Parser) ParseIndirectObject ¶
func (p *Parser) ParseIndirectObject() (*IndirectObject, error)
ParseIndirectObject parses an indirect object definition. Format: "num gen obj <object> endobj" or "num gen obj <dict> stream ... endstream endobj"
func (*Parser) ParseObject ¶
ParseObject parses and returns the next PDF object from the input. It handles all PDF object types: null, boolean, integer, real, string, name, array, dictionary, and indirect references.
func (*Parser) SetReferenceResolver ¶ added in v1.5.6
func (p *Parser) SetReferenceResolver(resolver ReferenceResolver)
SetReferenceResolver sets the reference resolver for the parser. This is needed to resolve indirect stream lengths.
type Real ¶
type Real float64
Real represents a PDF real number (floating-point), stored as float64.
func (Real) Type ¶
func (r Real) Type() ObjectType
type ReferenceResolver ¶ added in v1.5.6
type ReferenceResolver interface {
ResolveReference(ref IndirectRef) (Object, error)
}
ReferenceResolver is an interface for resolving indirect references. This allows the parser to resolve indirect stream lengths when needed.
type Stream ¶
type Stream struct {
Dict Dict // Stream dictionary containing metadata and filter information
Data []byte // Raw (possibly compressed) stream data
// contains filtered or unexported fields
}
Stream represents a PDF stream object, consisting of a dictionary and binary data. Streams are used for content that may be compressed or filtered, such as page content, images, and fonts.
func (*Stream) Decode ¶
Decode decodes the stream data according to the Filter(s) specified in the stream dictionary. It supports FlateDecode, ASCIIHexDecode, ASCII85Decode, and filter chains. Returns the decoded data or an error.
func (*Stream) Decoded ¶
Decoded returns the decoded (decompressed) stream data. Results are cached for subsequent calls. Use Stream.Decode for full decoding.
func (*Stream) Type ¶
func (s *Stream) Type() ObjectType
type String ¶
type String string
String represents a PDF string object (either literal or hexadecimal encoded).
func (String) Type ¶
func (s String) Type() ObjectType
type Token ¶
type Token struct {
Type TokenType // Token type
Value []byte // Raw token value (without delimiters for strings/names)
Pos int64 // Byte position in the input stream
SkippedBytes []byte // Bytes skipped as whitespace before this token (for stream data recovery)
}
Token represents a lexical token from PDF input.
type TokenType ¶
type TokenType int
TokenType identifies the type of a lexical token.
const ( TokenEOF TokenType = iota // End of input TokenWhitespace // Whitespace (space, tab, newline, etc.) TokenComment // Comment (% to end of line) TokenKeyword // Keywords: true, false, null, obj, endobj, stream, endstream TokenInteger // Integer literal (e.g., 123, -45) TokenReal // Real number literal (e.g., 3.14, -0.5) TokenString // Literal string: (hello) TokenHexString // Hexadecimal string: <48656C6C6F> TokenName // Name object: /Type TokenArrayStart // Array start: [ TokenArrayEnd // Array end: ] TokenDictStart // Dictionary start: << TokenDictEnd // Dictionary end: >> TokenIndirectRef // Indirect reference marker: R )
Token type constants for PDF lexical elements.
type XRefEntry ¶
type XRefEntry struct {
Type XRefEntryType // Entry type (free, uncompressed, or compressed)
Offset int64 // Byte offset (uncompressed) or object stream number (compressed)
Generation int // Generation number (uncompressed) or index within object stream (compressed)
InUse bool // True if object is in use (Type != XRefEntryFree)
}
XRefEntry represents a single entry in the cross-reference table, describing where an object is located in the PDF file.
type XRefEntryType ¶
type XRefEntryType int
XRefEntryType identifies the type of a cross-reference table entry.
const ( // XRefEntryFree indicates a free (deleted) object entry. XRefEntryFree XRefEntryType = 0 // XRefEntryUncompressed indicates an in-use object at a byte offset in the file. XRefEntryUncompressed XRefEntryType = 1 // XRefEntryCompressed indicates an object stored in an object stream (PDF 1.5+). XRefEntryCompressed XRefEntryType = 2 )
XRef entry type constants.
func (XRefEntryType) String ¶
func (t XRefEntryType) String() string
String returns a human-readable name for the entry type.
type XRefParser ¶
type XRefParser struct {
// contains filtered or unexported fields
}
XRefParser parses PDF cross-reference tables from a seekable reader. It supports both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).
func NewXRefParser ¶
func NewXRefParser(r io.ReadSeeker) *XRefParser
NewXRefParser creates a new XRef parser for the given reader.
func (*XRefParser) FindXRef ¶
func (x *XRefParser) FindXRef() (int64, error)
FindXRef finds the byte offset of the xref table by scanning from EOF. PDF files end with "startxref\n<offset>\n%%EOF", where offset points to the xref.
func (*XRefParser) ParseAllXRefs ¶
func (x *XRefParser) ParseAllXRefs() ([]*XRefTable, error)
ParseAllXRefs parses the main xref table and all previous ones from incremental updates, following /Prev links. Returns tables in chronological order (oldest first).
func (*XRefParser) ParsePrevXRef ¶
func (x *XRefParser) ParsePrevXRef(table *XRefTable) (*XRefTable, error)
ParsePrevXRef checks if the trailer has a /Prev entry and parses that xref table. This handles incremental updates in PDFs, where each update adds a new xref table that points to the previous one.
func (*XRefParser) ParseXRef ¶
func (x *XRefParser) ParseXRef(offset int64) (*XRefTable, error)
ParseXRef parses the xref table at the given byte offset. It auto-detects and handles both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).
func (*XRefParser) ParseXRefFromEOF ¶
func (x *XRefParser) ParseXRefFromEOF() (*XRefTable, error)
ParseXRefFromEOF locates and parses the xref table by scanning from the end of the file to find the startxref offset.
type XRefTable ¶
type XRefTable struct {
Entries map[int]*XRefEntry // Map from object number to entry
Trailer Dict // Trailer dictionary with /Root, /Info, /Size, etc.
IsStream bool // True if this XRef came from a stream (PDF 1.5+)
}
XRefTable represents a PDF cross-reference table, which maps object numbers to their locations in the file. It includes the trailer dictionary containing document-level information.
func MergeXRefTables ¶
MergeXRefTables merges multiple xref tables from incremental updates. Tables should be provided in chronological order (oldest first); later entries override earlier ones for the same object number.
func NewXRefTable ¶
func NewXRefTable() *XRefTable
NewXRefTable creates a new empty cross-reference table.
func (*XRefTable) Get ¶
Get returns the entry for the given object number and a boolean indicating whether the entry exists.