core

package
v1.6.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package core provides low-level PDF parsing primitives and object types.

This package implements the fundamental building blocks for working with PDF files, including all eight PDF object types (null, boolean, integer, real, string, name, array, and dictionary), as well as streams, indirect references, cross-reference tables, and object streams.

Object Types

PDF defines eight basic object types, all implemented as types satisfying the Object interface:

  • Null - represents the PDF null object
  • Bool - represents PDF boolean values (true/false)
  • Int - represents PDF integers
  • Real - represents PDF real numbers (floating point)
  • String - represents PDF string objects (literal or hexadecimal)
  • Name - represents PDF name objects (e.g., /Type, /Font)
  • Array - represents PDF arrays
  • Dict - represents PDF dictionaries

Additionally, Stream represents a PDF stream (dictionary + binary data), and IndirectRef represents a reference to an indirect object.

Parsing

The Parser type handles parsing PDF syntax from an io.Reader. It can parse individual objects or complete indirect object definitions.

The Lexer type provides tokenization of PDF input, converting raw bytes into tokens that the parser consumes.

Cross-Reference Tables

The XRefTable type represents a PDF cross-reference table, which maps object numbers to their locations in the file. The XRefParser type handles parsing both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

Object Streams

The ObjectStream type (PDF 1.5+) handles object streams, which store multiple objects in a single compressed stream for better compression.

Stream Decoding

Streams can be compressed using various filters. The Stream.Decode method handles decompression, supporting filters like FlateDecode, ASCIIHexDecode, and ASCII85Decode.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Array

type Array []Object

Array represents a PDF array, an ordered collection of PDF objects.

func (Array) Get

func (a Array) Get(index int) Object

Get returns the element at the given index, or nil if out of bounds.

func (Array) GetInt

func (a Array) GetInt(index int) (Int, bool)

GetInt returns the integer at the given index, with a boolean indicating success.

func (Array) GetName

func (a Array) GetName(index int) (Name, bool)

GetName returns the name at the given index, with a boolean indicating success.

func (Array) GetReal

func (a Array) GetReal(index int) (Real, bool)

GetReal returns the real number at the given index, with a boolean indicating success.

func (Array) Len

func (a Array) Len() int

Len returns the number of elements in the array.

func (Array) String

func (a Array) String() string

func (Array) Type

func (a Array) Type() ObjectType

type Bool

type Bool bool

Bool represents a PDF boolean value (true or false).

func (Bool) String

func (b Bool) String() string

func (Bool) Type

func (b Bool) Type() ObjectType

type Dict

type Dict map[string]Object

Dict represents a PDF dictionary, a collection of key-value pairs where keys are names (strings) and values are arbitrary PDF objects.

func (Dict) Delete

func (d Dict) Delete(key string)

Delete removes the key and its value from the dictionary.

func (Dict) Get

func (d Dict) Get(key string) Object

Get returns the value associated with the key, or nil if not present.

func (Dict) GetArray

func (d Dict) GetArray(key string) (Array, bool)

GetArray returns the Array value for the key, with a boolean indicating success.

func (Dict) GetBool

func (d Dict) GetBool(key string) (Bool, bool)

GetBool returns the Bool value for the key, with a boolean indicating success.

func (Dict) GetDict

func (d Dict) GetDict(key string) (Dict, bool)

GetDict returns the Dict value for the key, with a boolean indicating success.

func (Dict) GetIndirectRef

func (d Dict) GetIndirectRef(key string) (IndirectRef, bool)

GetIndirectRef returns the IndirectRef value for the key, with a boolean indicating success.

func (Dict) GetInt

func (d Dict) GetInt(key string) (Int, bool)

GetInt returns the Int value for the key, with a boolean indicating success.

func (Dict) GetName

func (d Dict) GetName(key string) (Name, bool)

GetName returns the Name value for the key, with a boolean indicating success.

func (Dict) GetReal

func (d Dict) GetReal(key string) (Real, bool)

GetReal returns the Real value for the key, with a boolean indicating success.

func (Dict) GetStream

func (d Dict) GetStream(key string) (*Stream, bool)

GetStream returns the Stream value for the key, with a boolean indicating success.

func (Dict) GetString

func (d Dict) GetString(key string) (String, bool)

GetString returns the String value for the key, with a boolean indicating success.

func (Dict) Has

func (d Dict) Has(key string) bool

Has reports whether the key exists in the dictionary.

func (Dict) Keys

func (d Dict) Keys() []string

Keys returns all keys in the dictionary in an arbitrary order.

func (Dict) Set

func (d Dict) Set(key string, value Object)

Set associates a value with the key in the dictionary.

func (Dict) String

func (d Dict) String() string

func (Dict) Type

func (d Dict) Type() ObjectType

type IndirectObject

type IndirectObject struct {
	Ref    IndirectRef // Reference identifying this object
	Object Object      // The actual object value
}

IndirectObject represents an indirect object definition (e.g., "5 0 obj ... endobj"). It pairs an IndirectRef with the actual object value.

type IndirectRef

type IndirectRef struct {
	Number     int // Object number
	Generation int // Generation number (usually 0)
}

IndirectRef represents an indirect object reference in PDF syntax (e.g., "5 0 R"). It references an object by its object number and generation number.

func (IndirectRef) String

func (r IndirectRef) String() string

func (IndirectRef) Type

func (r IndirectRef) Type() ObjectType

type Int

type Int int64

Int represents a PDF integer object, stored as a 64-bit signed integer.

func (Int) String

func (i Int) String() string

func (Int) Type

func (i Int) Type() ObjectType

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer performs lexical analysis of PDF content, breaking the input into tokens. It handles all PDF lexical elements including strings with escape sequences, hexadecimal strings, names with # escapes, and nested parentheses.

func NewLexer

func NewLexer(r io.Reader) *Lexer

NewLexer creates a new lexer for the given reader.

func (*Lexer) NextToken

func (l *Lexer) NextToken() (*Token, error)

NextToken returns the next token from the input. It skips whitespace and returns TokenEOF when the input is exhausted.

func (*Lexer) Peek

func (l *Lexer) Peek() (byte, error)

Peek returns the next byte without consuming it.

func (*Lexer) ReadByte

func (l *Lexer) ReadByte() (byte, error)

ReadByte reads and returns a single byte, advancing the position.

func (*Lexer) ReadBytes

func (l *Lexer) ReadBytes(n int) ([]byte, error)

ReadBytes reads exactly n bytes from the underlying reader. Used for reading binary stream data where tokenization is not appropriate.

func (*Lexer) SkipBytes

func (l *Lexer) SkipBytes(n int) error

SkipBytes discards exactly n bytes from the underlying reader.

func (*Lexer) SkipStreamEOL added in v1.6.0

func (l *Lexer) SkipStreamEOL() error

SkipStreamEOL skips the mandatory end-of-line marker after the 'stream' keyword. Per PDF spec, this is either a single LF (0x0A) or CR+LF (0x0D 0x0A).

type Name

type Name string

Name represents a PDF name object, used as identifiers (e.g., /Type, /Font). The leading slash is not stored; it is added in String() output.

func (Name) String

func (n Name) String() string

func (Name) Type

func (n Name) Type() ObjectType

type Null

type Null struct{}

Null represents the PDF null object, which denotes the absence of a value.

func (Null) String

func (n Null) String() string

func (Null) Type

func (n Null) Type() ObjectType

type Object

type Object interface {
	// Type returns the ObjectType identifying this object's type.
	Type() ObjectType
	// String returns a PDF-syntax string representation of the object.
	String() string
}

Object is the interface implemented by all PDF object types. Every PDF object can report its type and provide a string representation.

type ObjectStream

type ObjectStream struct {
	// contains filtered or unexported fields
}

ObjectStream represents a PDF Object Stream (Type /ObjStm), introduced in PDF 1.5. Object streams store multiple objects in a single compressed stream, providing better compression than storing objects individually.

func NewObjectStream

func NewObjectStream(stream *Stream) (*ObjectStream, error)

NewObjectStream creates an ObjectStream from a Stream object. The stream must have Type /ObjStm and required entries /N and /First. Returns an error if the stream is not a valid object stream.

func (*ObjectStream) ContainsObject

func (os *ObjectStream) ContainsObject(objNum int) (bool, error)

ContainsObject reports whether the given object number is stored in this stream.

func (*ObjectStream) Extends

func (os *ObjectStream) Extends() *IndirectRef

Extends returns the reference to another object stream this one extends, or nil.

func (*ObjectStream) First

func (os *ObjectStream) First() int

First returns the byte offset to the first object's data in the decoded stream. The header (object number/offset pairs) precedes this offset.

func (*ObjectStream) GetObjectByIndex

func (os *ObjectStream) GetObjectByIndex(index int) (Object, int, error)

GetObjectByIndex extracts an object by its index within the stream (0-based). Returns the object, its object number, and any error. The index corresponds to the position in the header, not the object number.

func (*ObjectStream) GetObjectByNumber

func (os *ObjectStream) GetObjectByNumber(objNum int) (Object, int, error)

GetObjectByNumber finds and extracts an object by its object number. Returns the object, its index within the stream, and any error.

func (*ObjectStream) N

func (os *ObjectStream) N() int

N returns the number of objects stored in the stream.

func (*ObjectStream) ObjectNumbers

func (os *ObjectStream) ObjectNumbers() ([]int, error)

ObjectNumbers returns a slice of all object numbers stored in this stream.

type ObjectType

type ObjectType int

ObjectType identifies the type of a PDF object.

const (
	ObjNull     ObjectType = iota // Null object
	ObjBool                       // Boolean (true/false)
	ObjInt                        // Integer
	ObjReal                       // Real number (floating point)
	ObjString                     // String (literal or hexadecimal)
	ObjName                       // Name object (e.g., /Type)
	ObjArray                      // Array
	ObjDict                       // Dictionary
	ObjStream                     // Stream (dictionary + data)
	ObjIndirect                   // Indirect reference (e.g., "5 0 R")
)

PDF object type constants.

func (ObjectType) String

func (t ObjectType) String() string

String returns a human-readable name for the object type.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser parses PDF objects from an io.Reader using a Lexer for tokenization. It supports parsing all PDF object types including indirect objects and streams.

func NewParser

func NewParser(r io.Reader) *Parser

NewParser creates a new PDF parser for the given reader. It initializes the lexer and loads the first two tokens for lookahead.

func (*Parser) ParseIndirectObject

func (p *Parser) ParseIndirectObject() (*IndirectObject, error)

ParseIndirectObject parses an indirect object definition. Format: "num gen obj <object> endobj" or "num gen obj <dict> stream ... endstream endobj"

func (*Parser) ParseObject

func (p *Parser) ParseObject() (Object, error)

ParseObject parses and returns the next PDF object from the input. It handles all PDF object types: null, boolean, integer, real, string, name, array, dictionary, and indirect references.

func (*Parser) SetReferenceResolver added in v1.5.6

func (p *Parser) SetReferenceResolver(resolver ReferenceResolver)

SetReferenceResolver sets the reference resolver for the parser. This is needed to resolve indirect stream lengths.

type Real

type Real float64

Real represents a PDF real number (floating-point), stored as float64.

func (Real) String

func (r Real) String() string

func (Real) Type

func (r Real) Type() ObjectType

type ReferenceResolver added in v1.5.6

type ReferenceResolver interface {
	ResolveReference(ref IndirectRef) (Object, error)
}

ReferenceResolver is an interface for resolving indirect references. This allows the parser to resolve indirect stream lengths when needed.

type Stream

type Stream struct {
	Dict Dict   // Stream dictionary containing metadata and filter information
	Data []byte // Raw (possibly compressed) stream data
	// contains filtered or unexported fields
}

Stream represents a PDF stream object, consisting of a dictionary and binary data. Streams are used for content that may be compressed or filtered, such as page content, images, and fonts.

func (*Stream) Decode

func (s *Stream) Decode() ([]byte, error)

Decode decodes the stream data according to the Filter(s) specified in the stream dictionary. It supports FlateDecode, ASCIIHexDecode, ASCII85Decode, and filter chains. Returns the decoded data or an error.

func (*Stream) Decoded

func (s *Stream) Decoded() ([]byte, error)

Decoded returns the decoded (decompressed) stream data. Results are cached for subsequent calls. Use Stream.Decode for full decoding.

func (*Stream) String

func (s *Stream) String() string

func (*Stream) Type

func (s *Stream) Type() ObjectType

type String

type String string

String represents a PDF string object (either literal or hexadecimal encoded).

func (String) String

func (s String) String() string

func (String) Type

func (s String) Type() ObjectType

type Token

type Token struct {
	Type         TokenType // Token type
	Value        []byte    // Raw token value (without delimiters for strings/names)
	Pos          int64     // Byte position in the input stream
	SkippedBytes []byte    // Bytes skipped as whitespace before this token (for stream data recovery)
}

Token represents a lexical token from PDF input.

type TokenType

type TokenType int

TokenType identifies the type of a lexical token.

const (
	TokenEOF         TokenType = iota // End of input
	TokenWhitespace                   // Whitespace (space, tab, newline, etc.)
	TokenComment                      // Comment (% to end of line)
	TokenKeyword                      // Keywords: true, false, null, obj, endobj, stream, endstream
	TokenInteger                      // Integer literal (e.g., 123, -45)
	TokenReal                         // Real number literal (e.g., 3.14, -0.5)
	TokenString                       // Literal string: (hello)
	TokenHexString                    // Hexadecimal string: <48656C6C6F>
	TokenName                         // Name object: /Type
	TokenArrayStart                   // Array start: [
	TokenArrayEnd                     // Array end: ]
	TokenDictStart                    // Dictionary start: <<
	TokenDictEnd                      // Dictionary end: >>
	TokenIndirectRef                  // Indirect reference marker: R
)

Token type constants for PDF lexical elements.

type XRefEntry

type XRefEntry struct {
	Type       XRefEntryType // Entry type (free, uncompressed, or compressed)
	Offset     int64         // Byte offset (uncompressed) or object stream number (compressed)
	Generation int           // Generation number (uncompressed) or index within object stream (compressed)
	InUse      bool          // True if object is in use (Type != XRefEntryFree)
}

XRefEntry represents a single entry in the cross-reference table, describing where an object is located in the PDF file.

type XRefEntryType

type XRefEntryType int

XRefEntryType identifies the type of a cross-reference table entry.

const (
	// XRefEntryFree indicates a free (deleted) object entry.
	XRefEntryFree XRefEntryType = 0
	// XRefEntryUncompressed indicates an in-use object at a byte offset in the file.
	XRefEntryUncompressed XRefEntryType = 1
	// XRefEntryCompressed indicates an object stored in an object stream (PDF 1.5+).
	XRefEntryCompressed XRefEntryType = 2
)

XRef entry type constants.

func (XRefEntryType) String

func (t XRefEntryType) String() string

String returns a human-readable name for the entry type.

type XRefParser

type XRefParser struct {
	// contains filtered or unexported fields
}

XRefParser parses PDF cross-reference tables from a seekable reader. It supports both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

func NewXRefParser

func NewXRefParser(r io.ReadSeeker) *XRefParser

NewXRefParser creates a new XRef parser for the given reader.

func (*XRefParser) FindXRef

func (x *XRefParser) FindXRef() (int64, error)

FindXRef finds the byte offset of the xref table by scanning from EOF. PDF files end with "startxref\n<offset>\n%%EOF", where offset points to the xref.

func (*XRefParser) ParseAllXRefs

func (x *XRefParser) ParseAllXRefs() ([]*XRefTable, error)

ParseAllXRefs parses the main xref table and all previous ones from incremental updates, following /Prev links. Returns tables in chronological order (oldest first).

func (*XRefParser) ParsePrevXRef

func (x *XRefParser) ParsePrevXRef(table *XRefTable) (*XRefTable, error)

ParsePrevXRef checks if the trailer has a /Prev entry and parses that xref table. This handles incremental updates in PDFs, where each update adds a new xref table that points to the previous one.

func (*XRefParser) ParseXRef

func (x *XRefParser) ParseXRef(offset int64) (*XRefTable, error)

ParseXRef parses the xref table at the given byte offset. It auto-detects and handles both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

func (*XRefParser) ParseXRefFromEOF

func (x *XRefParser) ParseXRefFromEOF() (*XRefTable, error)

ParseXRefFromEOF locates and parses the xref table by scanning from the end of the file to find the startxref offset.

type XRefTable

type XRefTable struct {
	Entries  map[int]*XRefEntry // Map from object number to entry
	Trailer  Dict               // Trailer dictionary with /Root, /Info, /Size, etc.
	IsStream bool               // True if this XRef came from a stream (PDF 1.5+)
}

XRefTable represents a PDF cross-reference table, which maps object numbers to their locations in the file. It includes the trailer dictionary containing document-level information.

func MergeXRefTables

func MergeXRefTables(tables ...*XRefTable) *XRefTable

MergeXRefTables merges multiple xref tables from incremental updates. Tables should be provided in chronological order (oldest first); later entries override earlier ones for the same object number.

func NewXRefTable

func NewXRefTable() *XRefTable

NewXRefTable creates a new empty cross-reference table.

func (*XRefTable) Get

func (x *XRefTable) Get(objNum int) (*XRefEntry, bool)

Get returns the entry for the given object number and a boolean indicating whether the entry exists.

func (*XRefTable) Set

func (x *XRefTable) Set(objNum int, entry *XRefEntry)

Set adds or updates an entry for the given object number.

func (*XRefTable) Size

func (x *XRefTable) Size() int

Size returns the number of entries in the table.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL