lexer

package
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 7, 2026 License: Apache-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package lexer ports cpython/Parser/lexer/ and cpython/Parser/tokenizer/ to Go. The lexer turns source bytes into tokens with positions; the driver layer feeds it from strings, byte slices, files, or readline callbacks.

Tokens emitted here use kinds from the tokenize package (which are pinned to Include/internal/pycore_token.h). The pegen runtime in parser/pegen consumes these tokens.

CPython: Parser/lexer/state.h, Parser/lexer/state.c

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckBOMCookieConflict

func CheckBOMCookieConflict(src []byte) string

CheckBOMCookieConflict reports the CPython error text when the source begins with a UTF-8 BOM but the PEP 263 cookie names a non-utf-8 encoding. Returns the empty string when there is no conflict (no BOM, no cookie, or cookie says utf-8 / utf8 / U8).

CPython: Parser/tokenizer/helpers.c:223 check_bom

func DetectEncodingCookie

func DetectEncodingCookie(src []byte) string

DetectEncodingCookie scans the first two physical lines of src for a PEP 263 `coding:` declaration and returns the encoding name, or "" when no cookie is present. The scan stops at byte codingCookieMax of each line. Lines may end with \n, \r\n, or \r; the function is newline-flavor agnostic.

CPython: Parser/tokenizer/helpers.c:165 check_coding_spec

func NormalizeNewlines

func NormalizeNewlines(src []byte) []byte

NormalizeNewlines folds \r\n and bare \r into \n so the FSM can treat newline as a single byte. CPython does the same fold in the file driver before handing lines to the scanner.

CPython: Parser/tokenizer/file_tokenizer.c:118 translate_newlines

Types

type Mode

type Mode int

Mode is the tokenizer top-level mode. Mirrors the PyCompile_Mode used by the upstream tokenizer entry points.

CPython: Parser/lexer/state.h:14 start mode constants

const (
	ModeFile Mode = iota
	ModeSingle
	ModeEval
	ModeFunctionType
	ModeFString
)

Mode constants. ModeFile is the default for `python script.py`, ModeSingle drives the REPL, ModeEval handles `eval(...)`, ModeFunctionType backs `inspect.signature` style annotation parsing, and ModeFString is reserved for direct f-string parsing.

type Pos

type Pos struct {
	Line int
	Col  int
}

Pos is a token start/end coordinate. Both fields are 1-based for line and 0-based for col, matching CPython's lineno / col_offset convention.

type ReadlineFunc

type ReadlineFunc func() ([]byte, error)

ReadlineFunc is the shape of the readline callback. Returning io.EOF signals end of input; any other error stops tokenisation.

type State

type State struct {
	// contains filtered or unexported fields
}

State is the tokenizer's per-call state. One State drives one tokenization pass.

CPython: Parser/lexer/state.h:74 struct tok_state

func FromBytes

func FromBytes(src []byte, mode Mode) *State

FromBytes is the byte-slice variant. The caller hands ownership of the slice to the lexer; we still grow when the FSM needs more room but for in-memory drivers that's a no-op since cur never reaches inp past the original length.

CPython: Parser/tokenizer/utf8_tokenizer.c:11 _PyTokenizer_FromUTF8

func FromReader

func FromReader(r io.Reader, mode Mode) *State

FromReader builds a State that reads source incrementally from r. Lines are pulled on demand via the underflow callback. Encoding detection runs on the first two physical lines (BOM and PEP 263 cookie); when a non-UTF-8 cookie is found the driver slurps the whole stream and decodes it via the codecs registry, then falls back to the in-memory pipeline.

CPython: Parser/tokenizer/file_tokenizer.c:31 _PyTokenizer_FromFile

func FromReadline

func FromReadline(rl ReadlineFunc, mode Mode) *State

FromReadline builds a State whose underflow callback pulls one line per refill from rl. The interactive prompt and history are out of scope for the lexer; the embedder owns those.

CPython: Parser/tokenizer/readline_tokenizer.c:24 _PyTokenizer_FromReadline

func FromString

func FromString(src string, mode Mode) *State

FromString builds a State that tokenises the given source. The driver loads the whole buffer up front; underflow returns false on the next refill request, matching the C source's _PyTokenizer_FromUTF8 / _PyTokenizer_FromString behavior after the final line lands.

CPython: Parser/tokenizer/utf8_tokenizer.c:11 _PyTokenizer_FromUTF8 (and Parser/tokenizer/string_tokenizer.c:106 _PyTokenizer_FromString)

func (*State) Encoding

func (s *State) Encoding() string

Encoding returns the source encoding detected from a BOM or PEP 263 cookie, or "" when no cookie was seen.

func (*State) Err

func (s *State) Err() *SyntaxError

Err returns the first SyntaxError recorded, or nil.

func (*State) Filename

func (s *State) Filename() string

Filename returns the configured filename. Used by error formatters.

func (*State) Get

func (s *State) Get() Tok

Get is the public entry point. One call returns one token, or sets s.done / s.err and returns an ERRORTOKEN / ENDMARKER.

CPython: Parser/lexer/lexer.c:1626 _PyTokenizer_Get

func (*State) SetExtraTokens

func (s *State) SetExtraTokens(v bool)

SetExtraTokens enables COMMENT, NL, and ENCODING token emission. Mirrors tokenize.tokenize()'s extra_tokens flag.

CPython: Parser/lexer/state.h:133 tok_extra_tokens

func (*State) SetFilename

func (s *State) SetFilename(name string)

SetFilename pins a name for error messages.

func (*State) SetTypeComments

func (s *State) SetTypeComments(v bool)

SetTypeComments enables type-comment emission (`# type: ...`).

CPython: Parser/lexer/state.h:122 type_comments

type SyntaxError

type SyntaxError struct {
	Pos     Pos
	EndPos  Pos
	Message string
	Text    string
}

SyntaxError is the lexer's error type. The pegen runtime lifts this into the parser-level *SyntaxError when needed.

CPython: Parser/pegen_errors.c:184 _PyPegen_raise_error_known_location

func (*SyntaxError) Error

func (e *SyntaxError) Error() string

Error renders the lexer error in CPython's "<msg>" form. The full "File ..., line N" envelope is added by the pegen layer.

type Tok

type Tok struct {
	Kind        token.Type
	Bytes       []byte
	Start       Pos
	End         Pos
	Level       int
	StartOffset int
	EndOffset   int
	// Metadata holds f-string/t-string interpolation expression
	// text captured during scanning. nil for ordinary tokens.
	Metadata []byte
}

Tok is the lexer's emitted token. Distinct from tokenize.Token (the Python-facing surface in 1665) which adds the Bytes/Line strings.

CPython: Parser/lexer/state.h:29 struct token

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL