Documentation
¶
Overview ¶
Package lexer ports cpython/Parser/lexer/ and cpython/Parser/tokenizer/ to Go. The lexer turns source bytes into tokens with positions; the driver layer feeds it from strings, byte slices, files, or readline callbacks.
Tokens emitted here use kinds from the tokenize package (which are pinned to Include/internal/pycore_token.h). The pegen runtime in parser/pegen consumes these tokens.
CPython: Parser/lexer/state.h, Parser/lexer/state.c
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CheckBOMCookieConflict ¶
CheckBOMCookieConflict reports the CPython error text when the source begins with a UTF-8 BOM but the PEP 263 cookie names a non-utf-8 encoding. Returns the empty string when there is no conflict (no BOM, no cookie, or cookie says utf-8 / utf8 / U8).
CPython: Parser/tokenizer/helpers.c:223 check_bom
func DetectEncodingCookie ¶
DetectEncodingCookie scans the first two physical lines of src for a PEP 263 `coding:` declaration and returns the encoding name, or "" when no cookie is present. The scan stops at byte codingCookieMax of each line. Lines may end with \n, \r\n, or \r; the function is newline-flavor agnostic.
CPython: Parser/tokenizer/helpers.c:165 check_coding_spec
func NormalizeNewlines ¶
NormalizeNewlines folds \r\n and bare \r into \n so the FSM can treat newline as a single byte. CPython does the same fold in the file driver before handing lines to the scanner.
CPython: Parser/tokenizer/file_tokenizer.c:118 translate_newlines
Types ¶
type Mode ¶
type Mode int
Mode is the tokenizer top-level mode. Mirrors the PyCompile_Mode used by the upstream tokenizer entry points.
CPython: Parser/lexer/state.h:14 start mode constants
Mode constants. ModeFile is the default for `python script.py`, ModeSingle drives the REPL, ModeEval handles `eval(...)`, ModeFunctionType backs `inspect.signature` style annotation parsing, and ModeFString is reserved for direct f-string parsing.
type Pos ¶
Pos is a token start/end coordinate. Both fields are 1-based for line and 0-based for col, matching CPython's lineno / col_offset convention.
type ReadlineFunc ¶
ReadlineFunc is the shape of the readline callback. Returning io.EOF signals end of input; any other error stops tokenisation.
type State ¶
type State struct {
// contains filtered or unexported fields
}
State is the tokenizer's per-call state. One State drives one tokenization pass.
CPython: Parser/lexer/state.h:74 struct tok_state
func FromBytes ¶
FromBytes is the byte-slice variant. The caller hands ownership of the slice to the lexer; we still grow when the FSM needs more room but for in-memory drivers that's a no-op since cur never reaches inp past the original length.
CPython: Parser/tokenizer/utf8_tokenizer.c:11 _PyTokenizer_FromUTF8
func FromReader ¶
FromReader builds a State that reads source incrementally from r. Lines are pulled on demand via the underflow callback. Encoding detection runs on the first two physical lines (BOM and PEP 263 cookie); after that the driver assumes UTF-8.
CPython: Parser/tokenizer/file_tokenizer.c:31 _PyTokenizer_FromFile
func FromReadline ¶
func FromReadline(rl ReadlineFunc, mode Mode) *State
FromReadline builds a State whose underflow callback pulls one line per refill from rl. The interactive prompt and history are out of scope for the lexer; the embedder owns those.
CPython: Parser/tokenizer/readline_tokenizer.c:24 _PyTokenizer_FromReadline
func FromString ¶
FromString builds a State that tokenises the given source. The driver loads the whole buffer up front; underflow returns false on the next refill request, matching the C source's _PyTokenizer_FromUTF8 / _PyTokenizer_FromString behavior after the final line lands.
CPython: Parser/tokenizer/utf8_tokenizer.c:11 _PyTokenizer_FromUTF8 (and Parser/tokenizer/string_tokenizer.c:106 _PyTokenizer_FromString)
func (*State) Encoding ¶
Encoding returns the source encoding detected from a BOM or PEP 263 cookie, or "" when no cookie was seen.
func (*State) Err ¶
func (s *State) Err() *SyntaxError
Err returns the first SyntaxError recorded, or nil.
func (*State) Get ¶
Get is the public entry point. One call returns one token, or sets s.done / s.err and returns an ERRORTOKEN / ENDMARKER.
CPython: Parser/lexer/lexer.c:1626 _PyTokenizer_Get
func (*State) SetExtraTokens ¶
SetExtraTokens enables COMMENT, NL, and ENCODING token emission. Mirrors tokenize.tokenize()'s extra_tokens flag.
CPython: Parser/lexer/state.h:133 tok_extra_tokens
func (*State) SetFilename ¶
SetFilename pins a name for error messages.
func (*State) SetTypeComments ¶
SetTypeComments enables type-comment emission (`# type: ...`).
CPython: Parser/lexer/state.h:122 type_comments
type SyntaxError ¶
SyntaxError is the lexer's error type. The pegen runtime lifts this into the parser-level *SyntaxError when needed.
CPython: Parser/pegen_errors.c:184 _PyPegen_raise_error_known_location
func (*SyntaxError) Error ¶
func (e *SyntaxError) Error() string
Error renders the lexer error in CPython's "<msg>" form. The full "File ..., line N" envelope is added by the pegen layer.
type Tok ¶
type Tok struct {
Kind tokenize.Type
Bytes []byte
Start Pos
End Pos
Level int
StartOffset int
EndOffset int
// Metadata holds f-string/t-string interpolation expression
// text captured during scanning. nil for ordinary tokens.
Metadata []byte
}
Tok is the lexer's emitted token. Distinct from tokenize.Token (the Python-facing surface in 1665) which adds the Bytes/Line strings.
CPython: Parser/lexer/state.h:29 struct token