core

package

v1.6.6 Latest Latest Go to latest Published: Feb 4, 2026 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

Documentation ¶

Overview ¶

Package core provides low-level PDF parsing primitives and object types.

This package implements the fundamental building blocks for working with PDF files, including all eight PDF object types (null, boolean, integer, real, string, name, array, and dictionary), as well as streams, indirect references, cross-reference tables, and object streams.

Object Types ¶

PDF defines eight basic object types, all implemented as types satisfying the Object interface:

Null - represents the PDF null object
Bool - represents PDF boolean values (true/false)
Int - represents PDF integers
Real - represents PDF real numbers (floating point)
String - represents PDF string objects (literal or hexadecimal)
Name - represents PDF name objects (e.g., /Type, /Font)
Array - represents PDF arrays
Dict - represents PDF dictionaries

Additionally, Stream represents a PDF stream (dictionary + binary data), and IndirectRef represents a reference to an indirect object.

Parsing ¶

The Parser type handles parsing PDF syntax from an io.Reader. It can parse individual objects or complete indirect object definitions.

The Lexer type provides tokenization of PDF input, converting raw bytes into tokens that the parser consumes.

Cross-Reference Tables ¶

The XRefTable type represents a PDF cross-reference table, which maps object numbers to their locations in the file. The XRefParser type handles parsing both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

Object Streams ¶

The ObjectStream type (PDF 1.5+) handles object streams, which store multiple objects in a single compressed stream for better compression.

Stream Decoding ¶

Streams can be compressed using various filters. The Stream.Decode method handles decompression, supporting filters like FlateDecode, ASCIIHexDecode, and ASCII85Decode.

Index ¶

type Array
- func (a Array) Get(index int) Object
- func (a Array) GetInt(index int) (Int, bool)
- func (a Array) GetName(index int) (Name, bool)
- func (a Array) GetReal(index int) (Real, bool)
- func (a Array) Len() int
- func (a Array) String() string
- func (a Array) Type() ObjectType
type Bool
- func (b Bool) String() string
- func (b Bool) Type() ObjectType
type Dict
- func (d Dict) Delete(key string)
- func (d Dict) Get(key string) Object
- func (d Dict) GetArray(key string) (Array, bool)
- func (d Dict) GetBool(key string) (Bool, bool)
- func (d Dict) GetDict(key string) (Dict, bool)
- func (d Dict) GetIndirectRef(key string) (IndirectRef, bool)
- func (d Dict) GetInt(key string) (Int, bool)
- func (d Dict) GetName(key string) (Name, bool)
- func (d Dict) GetReal(key string) (Real, bool)
- func (d Dict) GetStream(key string) (*Stream, bool)
- func (d Dict) GetString(key string) (String, bool)
- func (d Dict) Has(key string) bool
- func (d Dict) Keys() []string
- func (d Dict) Set(key string, value Object)
- func (d Dict) String() string
- func (d Dict) Type() ObjectType
type IndirectObject
type IndirectRef
- func (r IndirectRef) String() string
- func (r IndirectRef) Type() ObjectType
type Int
- func (i Int) String() string
- func (i Int) Type() ObjectType
type Lexer
- func NewLexer(r io.Reader) *Lexer
- func (l *Lexer) NextToken() (*Token, error)
- func (l *Lexer) Peek() (byte, error)
- func (l *Lexer) ReadByte() (byte, error)
- func (l *Lexer) ReadBytes(n int) ([]byte, error)
- func (l *Lexer) SkipBytes(n int) error
- func (l *Lexer) SkipStreamEOL() error
type Name
- func (n Name) String() string
- func (n Name) Type() ObjectType
type Null
- func (n Null) String() string
- func (n Null) Type() ObjectType
type Object
type ObjectStream
- func NewObjectStream(stream *Stream) (*ObjectStream, error)
- func (os *ObjectStream) ContainsObject(objNum int) (bool, error)
- func (os *ObjectStream) Extends() *IndirectRef
- func (os *ObjectStream) First() int
- func (os *ObjectStream) GetObjectByIndex(index int) (Object, int, error)
- func (os *ObjectStream) GetObjectByNumber(objNum int) (Object, int, error)
- func (os *ObjectStream) N() int
- func (os *ObjectStream) ObjectNumbers() ([]int, error)
type ObjectType
- func (t ObjectType) String() string
type Parser
- func NewParser(r io.Reader) *Parser
- func (p *Parser) ParseIndirectObject() (*IndirectObject, error)
- func (p *Parser) ParseObject() (Object, error)
- func (p *Parser) SetReferenceResolver(resolver ReferenceResolver)
type Real
- func (r Real) String() string
- func (r Real) Type() ObjectType
type ReferenceResolver
type Stream
- func (s *Stream) Decode() ([]byte, error)
- func (s *Stream) Decoded() ([]byte, error)
- func (s *Stream) String() string
- func (s *Stream) Type() ObjectType
type String
- func (s String) String() string
- func (s String) Type() ObjectType
type Token
type TokenType
type XRefEntry
type XRefEntryType
- func (t XRefEntryType) String() string
type XRefParser
- func NewXRefParser(r io.ReadSeeker) *XRefParser
- func (x *XRefParser) FindXRef() (int64, error)
- func (x *XRefParser) ParseAllXRefs() ([]*XRefTable, error)
- func (x *XRefParser) ParsePrevXRef(table *XRefTable) (*XRefTable, error)
- func (x *XRefParser) ParseXRef(offset int64) (*XRefTable, error)
- func (x *XRefParser) ParseXRefFromEOF() (*XRefTable, error)
type XRefTable
- func MergeXRefTables(tables ...*XRefTable) *XRefTable
- func NewXRefTable() *XRefTable
- func (x *XRefTable) Get(objNum int) (*XRefEntry, bool)
- func (x *XRefTable) Set(objNum int, entry *XRefEntry)
- func (x *XRefTable) Size() int

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Array ¶

type Array []Object

Array represents a PDF array, an ordered collection of PDF objects.

func (Array) Get ¶

func (a Array) Get(index int) Object

Get returns the element at the given index, or nil if out of bounds.

func (Array) GetInt ¶

func (a Array) GetInt(index int) (Int, bool)

GetInt returns the integer at the given index, with a boolean indicating success.

func (Array) GetName ¶

func (a Array) GetName(index int) (Name, bool)

GetName returns the name at the given index, with a boolean indicating success.

func (Array) GetReal ¶

func (a Array) GetReal(index int) (Real, bool)

GetReal returns the real number at the given index, with a boolean indicating success.

func (Array) Len ¶

func (a Array) Len() int

Len returns the number of elements in the array.

func (Array) String ¶

func (a Array) String() string

func (Array) Type ¶

func (a Array) Type() ObjectType

type Bool ¶

type Bool bool

Bool represents a PDF boolean value (true or false).

func (Bool) String ¶

func (b Bool) String() string

func (Bool) Type ¶

func (b Bool) Type() ObjectType

type Dict ¶

type Dict map[string]Object

Dict represents a PDF dictionary, a collection of key-value pairs where keys are names (strings) and values are arbitrary PDF objects.

func (Dict) Delete ¶

func (d Dict) Delete(key string)

Delete removes the key and its value from the dictionary.

func (Dict) Get ¶

func (d Dict) Get(key string) Object

Get returns the value associated with the key, or nil if not present.

func (Dict) GetArray ¶

func (d Dict) GetArray(key string) (Array, bool)

GetArray returns the Array value for the key, with a boolean indicating success.

func (Dict) GetBool ¶

func (d Dict) GetBool(key string) (Bool, bool)

GetBool returns the Bool value for the key, with a boolean indicating success.

func (Dict) GetDict ¶

func (d Dict) GetDict(key string) (Dict, bool)

GetDict returns the Dict value for the key, with a boolean indicating success.

func (Dict) GetIndirectRef ¶

func (d Dict) GetIndirectRef(key string) (IndirectRef, bool)

GetIndirectRef returns the IndirectRef value for the key, with a boolean indicating success.

func (Dict) GetInt ¶

func (d Dict) GetInt(key string) (Int, bool)

GetInt returns the Int value for the key, with a boolean indicating success.

func (Dict) GetName ¶

func (d Dict) GetName(key string) (Name, bool)

GetName returns the Name value for the key, with a boolean indicating success.

func (Dict) GetReal ¶

func (d Dict) GetReal(key string) (Real, bool)

GetReal returns the Real value for the key, with a boolean indicating success.

func (Dict) GetStream ¶

func (d Dict) GetStream(key string) (*Stream, bool)

GetStream returns the Stream value for the key, with a boolean indicating success.

func (Dict) GetString ¶

func (d Dict) GetString(key string) (String, bool)

GetString returns the String value for the key, with a boolean indicating success.

func (Dict) Has ¶

func (d Dict) Has(key string) bool

Has reports whether the key exists in the dictionary.

func (Dict) Keys ¶

func (d Dict) Keys() []string

Keys returns all keys in the dictionary in an arbitrary order.

func (Dict) Set ¶

func (d Dict) Set(key string, value Object)

Set associates a value with the key in the dictionary.

func (Dict) String ¶

func (d Dict) String() string

func (Dict) Type ¶

func (d Dict) Type() ObjectType

type IndirectObject ¶

type IndirectObject struct {
	Ref    IndirectRef // Reference identifying this object
	Object Object      // The actual object value
}

IndirectObject represents an indirect object definition (e.g., "5 0 obj ... endobj"). It pairs an IndirectRef with the actual object value.

type IndirectRef ¶

type IndirectRef struct {
	Number     int // Object number
	Generation int // Generation number (usually 0)
}

IndirectRef represents an indirect object reference in PDF syntax (e.g., "5 0 R"). It references an object by its object number and generation number.

func (IndirectRef) String ¶

func (r IndirectRef) String() string

func (IndirectRef) Type ¶

func (r IndirectRef) Type() ObjectType

type Int ¶

type Int int64

Int represents a PDF integer object, stored as a 64-bit signed integer.

func (Int) String ¶

func (i Int) String() string

func (Int) Type ¶

func (i Int) Type() ObjectType

type Lexer ¶

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer performs lexical analysis of PDF content, breaking the input into tokens. It handles all PDF lexical elements including strings with escape sequences, hexadecimal strings, names with # escapes, and nested parentheses.

func NewLexer ¶

func NewLexer(r io.Reader) *Lexer

NewLexer creates a new lexer for the given reader.

func (*Lexer) NextToken ¶

func (l *Lexer) NextToken() (*Token, error)

NextToken returns the next token from the input. It skips whitespace and returns TokenEOF when the input is exhausted.

func (*Lexer) Peek ¶

func (l *Lexer) Peek() (byte, error)

Peek returns the next byte without consuming it.

func (*Lexer) ReadByte ¶

func (l *Lexer) ReadByte() (byte, error)

ReadByte reads and returns a single byte, advancing the position.

func (*Lexer) ReadBytes ¶

func (l *Lexer) ReadBytes(n int) ([]byte, error)

ReadBytes reads exactly n bytes from the underlying reader. Used for reading binary stream data where tokenization is not appropriate.

func (*Lexer) SkipBytes ¶

func (l *Lexer) SkipBytes(n int) error

SkipBytes discards exactly n bytes from the underlying reader.

func (*Lexer) SkipStreamEOL ¶ added in v1.6.0

func (l *Lexer) SkipStreamEOL() error

SkipStreamEOL skips the mandatory end-of-line marker after the 'stream' keyword. Per PDF spec, this is either a single LF (0x0A) or CR+LF (0x0D 0x0A).

type Name ¶

type Name string

Name represents a PDF name object, used as identifiers (e.g., /Type, /Font). The leading slash is not stored; it is added in String() output.

func (Name) String ¶

func (n Name) String() string

func (Name) Type ¶

func (n Name) Type() ObjectType

type Null ¶

type Null struct{}

Null represents the PDF null object, which denotes the absence of a value.

func (Null) String ¶

func (n Null) String() string

func (Null) Type ¶

func (n Null) Type() ObjectType

type Object ¶

type Object interface {
	// Type returns the ObjectType identifying this object's type.
	Type() ObjectType
	// String returns a PDF-syntax string representation of the object.
	String() string
}

Object is the interface implemented by all PDF object types. Every PDF object can report its type and provide a string representation.

type ObjectStream ¶

type ObjectStream struct {
	// contains filtered or unexported fields
}

ObjectStream represents a PDF Object Stream (Type /ObjStm), introduced in PDF 1.5. Object streams store multiple objects in a single compressed stream, providing better compression than storing objects individually.

func NewObjectStream ¶

func NewObjectStream(stream *Stream) (*ObjectStream, error)

NewObjectStream creates an ObjectStream from a Stream object. The stream must have Type /ObjStm and required entries /N and /First. Returns an error if the stream is not a valid object stream.

func (*ObjectStream) ContainsObject ¶

func (os *ObjectStream) ContainsObject(objNum int) (bool, error)

ContainsObject reports whether the given object number is stored in this stream.

func (*ObjectStream) Extends ¶

func (os *ObjectStream) Extends() *IndirectRef

Extends returns the reference to another object stream this one extends, or nil.

func (*ObjectStream) First ¶

func (os *ObjectStream) First() int

First returns the byte offset to the first object's data in the decoded stream. The header (object number/offset pairs) precedes this offset.

func (*ObjectStream) GetObjectByIndex ¶

func (os *ObjectStream) GetObjectByIndex(index int) (Object, int, error)

GetObjectByIndex extracts an object by its index within the stream (0-based). Returns the object, its object number, and any error. The index corresponds to the position in the header, not the object number.

func (*ObjectStream) GetObjectByNumber ¶

func (os *ObjectStream) GetObjectByNumber(objNum int) (Object, int, error)

GetObjectByNumber finds and extracts an object by its object number. Returns the object, its index within the stream, and any error.

func (*ObjectStream) N ¶

func (os *ObjectStream) N() int

N returns the number of objects stored in the stream.

func (*ObjectStream) ObjectNumbers ¶

func (os *ObjectStream) ObjectNumbers() ([]int, error)

ObjectNumbers returns a slice of all object numbers stored in this stream.

type ObjectType ¶

type ObjectType int

ObjectType identifies the type of a PDF object.

const (
	ObjNull     ObjectType = iota // Null object
	ObjBool                       // Boolean (true/false)
	ObjInt                        // Integer
	ObjReal                       // Real number (floating point)
	ObjString                     // String (literal or hexadecimal)
	ObjName                       // Name object (e.g., /Type)
	ObjArray                      // Array
	ObjDict                       // Dictionary
	ObjStream                     // Stream (dictionary + data)
	ObjIndirect                   // Indirect reference (e.g., "5 0 R")
)

PDF object type constants.

func (ObjectType) String ¶

func (t ObjectType) String() string

String returns a human-readable name for the object type.

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser parses PDF objects from an io.Reader using a Lexer for tokenization. It supports parsing all PDF object types including indirect objects and streams.

func NewParser ¶

func NewParser(r io.Reader) *Parser

NewParser creates a new PDF parser for the given reader. It initializes the lexer and loads the first two tokens for lookahead.

func (*Parser) ParseIndirectObject ¶

func (p *Parser) ParseIndirectObject() (*IndirectObject, error)

ParseIndirectObject parses an indirect object definition. Format: "num gen obj <object> endobj" or "num gen obj <dict> stream ... endstream endobj"

func (*Parser) ParseObject ¶

func (p *Parser) ParseObject() (Object, error)

ParseObject parses and returns the next PDF object from the input. It handles all PDF object types: null, boolean, integer, real, string, name, array, dictionary, and indirect references.

func (*Parser) SetReferenceResolver ¶ added in v1.5.6

func (p *Parser) SetReferenceResolver(resolver ReferenceResolver)

SetReferenceResolver sets the reference resolver for the parser. This is needed to resolve indirect stream lengths.

type Real ¶

type Real float64

Real represents a PDF real number (floating-point), stored as float64.

func (Real) String ¶

func (r Real) String() string

func (Real) Type ¶

func (r Real) Type() ObjectType

type ReferenceResolver ¶ added in v1.5.6

type ReferenceResolver interface {
	ResolveReference(ref IndirectRef) (Object, error)
}

ReferenceResolver is an interface for resolving indirect references. This allows the parser to resolve indirect stream lengths when needed.

type Stream ¶

type Stream struct {
	Dict Dict   // Stream dictionary containing metadata and filter information
	Data []byte // Raw (possibly compressed) stream data
	// contains filtered or unexported fields
}

Stream represents a PDF stream object, consisting of a dictionary and binary data. Streams are used for content that may be compressed or filtered, such as page content, images, and fonts.

func (*Stream) Decode ¶

func (s *Stream) Decode() ([]byte, error)

Decode decodes the stream data according to the Filter(s) specified in the stream dictionary. It supports FlateDecode, ASCIIHexDecode, ASCII85Decode, and filter chains. Returns the decoded data or an error.

func (*Stream) Decoded ¶

func (s *Stream) Decoded() ([]byte, error)

Decoded returns the decoded (decompressed) stream data. Results are cached for subsequent calls. Use Stream.Decode for full decoding.

func (*Stream) String ¶

func (s *Stream) String() string

func (*Stream) Type ¶

func (s *Stream) Type() ObjectType

type String ¶

type String string

String represents a PDF string object (either literal or hexadecimal encoded).

func (String) String ¶

func (s String) String() string

func (String) Type ¶

func (s String) Type() ObjectType

type Token ¶

type Token struct {
	Type         TokenType // Token type
	Value        []byte    // Raw token value (without delimiters for strings/names)
	Pos          int64     // Byte position in the input stream
	SkippedBytes []byte    // Bytes skipped as whitespace before this token (for stream data recovery)
}

Token represents a lexical token from PDF input.

type TokenType ¶

type TokenType int

TokenType identifies the type of a lexical token.

const (
	TokenEOF         TokenType = iota // End of input
	TokenWhitespace                   // Whitespace (space, tab, newline, etc.)
	TokenComment                      // Comment (% to end of line)
	TokenKeyword                      // Keywords: true, false, null, obj, endobj, stream, endstream
	TokenInteger                      // Integer literal (e.g., 123, -45)
	TokenReal                         // Real number literal (e.g., 3.14, -0.5)
	TokenString                       // Literal string: (hello)
	TokenHexString                    // Hexadecimal string: <48656C6C6F>
	TokenName                         // Name object: /Type
	TokenArrayStart                   // Array start: [
	TokenArrayEnd                     // Array end: ]
	TokenDictStart                    // Dictionary start: <<
	TokenDictEnd                      // Dictionary end: >>
	TokenIndirectRef                  // Indirect reference marker: R
)

Token type constants for PDF lexical elements.

type XRefEntry ¶

type XRefEntry struct {
	Type       XRefEntryType // Entry type (free, uncompressed, or compressed)
	Offset     int64         // Byte offset (uncompressed) or object stream number (compressed)
	Generation int           // Generation number (uncompressed) or index within object stream (compressed)
	InUse      bool          // True if object is in use (Type != XRefEntryFree)
}

XRefEntry represents a single entry in the cross-reference table, describing where an object is located in the PDF file.

type XRefEntryType ¶

type XRefEntryType int

XRefEntryType identifies the type of a cross-reference table entry.

const (
	// XRefEntryFree indicates a free (deleted) object entry.
	XRefEntryFree XRefEntryType = 0
	// XRefEntryUncompressed indicates an in-use object at a byte offset in the file.
	XRefEntryUncompressed XRefEntryType = 1
	// XRefEntryCompressed indicates an object stored in an object stream (PDF 1.5+).
	XRefEntryCompressed XRefEntryType = 2
)

XRef entry type constants.

func (XRefEntryType) String ¶

func (t XRefEntryType) String() string

String returns a human-readable name for the entry type.

type XRefParser ¶

type XRefParser struct {
	// contains filtered or unexported fields
}

XRefParser parses PDF cross-reference tables from a seekable reader. It supports both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

func NewXRefParser ¶

func NewXRefParser(r io.ReadSeeker) *XRefParser

NewXRefParser creates a new XRef parser for the given reader.

func (*XRefParser) FindXRef ¶

func (x *XRefParser) FindXRef() (int64, error)

FindXRef finds the byte offset of the xref table by scanning from EOF. PDF files end with "startxref\n<offset>\n%%EOF", where offset points to the xref.

func (*XRefParser) ParseAllXRefs ¶

func (x *XRefParser) ParseAllXRefs() ([]*XRefTable, error)

ParseAllXRefs parses the main xref table and all previous ones from incremental updates, following /Prev links. Returns tables in chronological order (oldest first).

func (*XRefParser) ParsePrevXRef ¶

func (x *XRefParser) ParsePrevXRef(table *XRefTable) (*XRefTable, error)

ParsePrevXRef checks if the trailer has a /Prev entry and parses that xref table. This handles incremental updates in PDFs, where each update adds a new xref table that points to the previous one.

func (*XRefParser) ParseXRef ¶

func (x *XRefParser) ParseXRef(offset int64) (*XRefTable, error)

ParseXRef parses the xref table at the given byte offset. It auto-detects and handles both traditional xref tables (PDF 1.0-1.4) and xref streams (PDF 1.5+).

func (*XRefParser) ParseXRefFromEOF ¶

func (x *XRefParser) ParseXRefFromEOF() (*XRefTable, error)

ParseXRefFromEOF locates and parses the xref table by scanning from the end of the file to find the startxref offset.

type XRefTable ¶

type XRefTable struct {
	Entries  map[int]*XRefEntry // Map from object number to entry
	Trailer  Dict               // Trailer dictionary with /Root, /Info, /Size, etc.
	IsStream bool               // True if this XRef came from a stream (PDF 1.5+)
}

XRefTable represents a PDF cross-reference table, which maps object numbers to their locations in the file. It includes the trailer dictionary containing document-level information.

func MergeXRefTables ¶

func MergeXRefTables(tables ...*XRefTable) *XRefTable

MergeXRefTables merges multiple xref tables from incremental updates. Tables should be provided in chronological order (oldest first); later entries override earlier ones for the same object number.

func NewXRefTable ¶

func NewXRefTable() *XRefTable

NewXRefTable creates a new empty cross-reference table.

func (*XRefTable) Get ¶

func (x *XRefTable) Get(objNum int) (*XRefEntry, bool)

Get returns the entry for the given object number and a boolean indicating whether the entry exists.

func (*XRefTable) Set ¶

func (x *XRefTable) Set(objNum int, entry *XRefEntry)

Set adds or updates an entry for the given object number.

func (*XRefTable) Size ¶

func (x *XRefTable) Size() int

Size returns the number of entries in the table.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL