lex

package

v0.0.3 Latest Latest Go to latest Published: Jun 19, 2026 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/speedata/pdfdisassembler

Links

Open Source Insights

Documentation ¶

Overview ¶

Package lex tokenises PDF input. It deals with the lexical layer of PDF objects — whitespace, comments, names, numbers, strings, arrays, dictionaries, the stream/endstream/obj/endobj/R/null/true/false keywords — but does not assemble higher-level structures. The parser layered above it turns token streams into Object trees.

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrUnexpectedEOF = errors.New("pdfdisassembler/lex: unexpected EOF")

ErrUnexpectedEOF indicates that the lexer ran out of bytes mid-token.

Functions ¶

func IsDelimiter ¶

func IsDelimiter(c byte) bool

IsDelimiter reports whether c is a PDF delimiter character (§7.2.2).

func IsRegular ¶

func IsRegular(c byte) bool

IsRegular reports whether c is a regular character (neither whitespace nor delimiter).

func IsWhitespace ¶

func IsWhitespace(c byte) bool

IsWhitespace reports whether c is a PDF whitespace character (§7.2.2).

Types ¶

type Kind ¶

type Kind int

Kind identifies a token's lexical category.

const (
	// EOF marks end of input.
	EOF Kind = iota
	// Name is a PDF name without the leading slash.
	Name
	// Integer is a literal integer (no decimal point, optional sign).
	Integer
	// Real is a literal real number (has a decimal point or 'e' exponent —
	// PDF does not actually allow exponents but we accept them).
	Real
	// LitString is a parenthesised literal string with escapes already
	// resolved.
	LitString
	// HexString is an angle-bracketed hex string with hex pairs already
	// decoded to bytes.
	HexString
	// ArrayStart is the '[' token.
	ArrayStart
	// ArrayEnd is the ']' token.
	ArrayEnd
	// DictStart is the '<<' token.
	DictStart
	// DictEnd is the '>>' token.
	DictEnd
	// Keyword is any unquoted identifier: true, false, null, obj, endobj,
	// stream, endstream, R, xref, trailer, startxref, n, f.
	Keyword
)

func (Kind) String ¶

func (k Kind) String() string

type Lexer ¶

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer converts a byte slice into a stream of Tokens. It is not safe for concurrent use.

func New ¶

func New(src []byte) *Lexer

New creates a Lexer over src. The src slice is not copied.

func (*Lexer) Next ¶

func (l *Lexer) Next() (Token, error)

Next returns the next token. At EOF it returns a Token with Kind=EOF.

func (*Lexer) Pos ¶

func (l *Lexer) Pos() int

Pos returns the current byte offset.

func (*Lexer) ReadStreamData ¶

func (l *Lexer) ReadStreamData(length int) ([]byte, error)

ReadStreamData consumes raw stream bytes of the given length, starting at the current position. It honours the spec's EOL handling: a single LF or CRLF *immediately* after the "stream" keyword is part of the keyword line, not the stream content. Callers should call this after the "stream" keyword token has been consumed.

func (*Lexer) Remaining ¶

func (l *Lexer) Remaining() []byte

Remaining returns the unread portion of the source.

func (*Lexer) SetPos ¶

func (l *Lexer) SetPos(p int)

SetPos rewinds or fast-forwards the lexer.

func (*Lexer) SkipWhitespace ¶

func (l *Lexer) SkipWhitespace()

SkipWhitespace advances over PDF whitespace and comments.

func (*Lexer) Source ¶

func (l *Lexer) Source() []byte

Source returns the underlying source slice.

type Token ¶

type Token struct {
	Kind   Kind
	Bytes  []byte
	Offset int64 // byte offset in the input where this token started
}

Token is a single lexical unit. Bytes carries the token payload; its meaning depends on Kind:

Name, Keyword: ASCII name body, no leading slash
Integer, Real: literal digits
LitString, HexString: decoded bytes
ArrayStart, ArrayEnd, DictStart, DictEnd, EOF: empty

Source Files ¶

View all Source files

lex.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL