lex

package
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2026 License: MIT Imports: 2 Imported by: 0

Documentation

Overview

Package lex tokenises PDF input. It deals with the lexical layer of PDF objects — whitespace, comments, names, numbers, strings, arrays, dictionaries, the stream/endstream/obj/endobj/R/null/true/false keywords — but does not assemble higher-level structures. The parser layered above it turns token streams into Object trees.

Index

Constants

This section is empty.

Variables

View Source
var ErrUnexpectedEOF = errors.New("pdfdisassembler/lex: unexpected EOF")

ErrUnexpectedEOF indicates that the lexer ran out of bytes mid-token.

Functions

func IsDelimiter

func IsDelimiter(c byte) bool

IsDelimiter reports whether c is a PDF delimiter character (§7.2.2).

func IsRegular

func IsRegular(c byte) bool

IsRegular reports whether c is a regular character (neither whitespace nor delimiter).

func IsWhitespace

func IsWhitespace(c byte) bool

IsWhitespace reports whether c is a PDF whitespace character (§7.2.2).

Types

type Kind

type Kind int

Kind identifies a token's lexical category.

const (
	// EOF marks end of input.
	EOF Kind = iota
	// Name is a PDF name without the leading slash.
	Name
	// Integer is a literal integer (no decimal point, optional sign).
	Integer
	// Real is a literal real number (has a decimal point or 'e' exponent —
	// PDF does not actually allow exponents but we accept them).
	Real
	// LitString is a parenthesised literal string with escapes already
	// resolved.
	LitString
	// HexString is an angle-bracketed hex string with hex pairs already
	// decoded to bytes.
	HexString
	// ArrayStart is the '[' token.
	ArrayStart
	// ArrayEnd is the ']' token.
	ArrayEnd
	// DictStart is the '<<' token.
	DictStart
	// DictEnd is the '>>' token.
	DictEnd
	// Keyword is any unquoted identifier: true, false, null, obj, endobj,
	// stream, endstream, R, xref, trailer, startxref, n, f.
	Keyword
)

func (Kind) String

func (k Kind) String() string

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer converts a byte slice into a stream of Tokens. It is not safe for concurrent use.

func New

func New(src []byte) *Lexer

New creates a Lexer over src. The src slice is not copied.

func (*Lexer) Next

func (l *Lexer) Next() (Token, error)

Next returns the next token. At EOF it returns a Token with Kind=EOF.

func (*Lexer) Pos

func (l *Lexer) Pos() int

Pos returns the current byte offset.

func (*Lexer) ReadStreamData

func (l *Lexer) ReadStreamData(length int) ([]byte, error)

ReadStreamData consumes raw stream bytes of the given length, starting at the current position. It honours the spec's EOL handling: a single LF or CRLF *immediately* after the "stream" keyword is part of the keyword line, not the stream content. Callers should call this after the "stream" keyword token has been consumed.

func (*Lexer) Remaining

func (l *Lexer) Remaining() []byte

Remaining returns the unread portion of the source.

func (*Lexer) SetPos

func (l *Lexer) SetPos(p int)

SetPos rewinds or fast-forwards the lexer.

func (*Lexer) SkipWhitespace

func (l *Lexer) SkipWhitespace()

SkipWhitespace advances over PDF whitespace and comments.

func (*Lexer) Source

func (l *Lexer) Source() []byte

Source returns the underlying source slice.

type Token

type Token struct {
	Kind   Kind
	Bytes  []byte
	Offset int64 // byte offset in the input where this token started
}

Token is a single lexical unit. Bytes carries the token payload; its meaning depends on Kind:

  • Name, Keyword: ASCII name body, no leading slash
  • Integer, Real: literal digits
  • LitString, HexString: decoded bytes
  • ArrayStart, ArrayEnd, DictStart, DictEnd, EOF: empty

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL