tokenize

package
v0.5.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

Documentation

Overview

Package tokenize is the Go port of cpython/Python/Python-tokenize.c. The C file is the Python-visible wrapper around the parser's lexer; it exposes the TokenizerIter class that `tokenize.tokenize()` in the stdlib delegates to.

This file is a skeleton. The lexer state machine (indent tracking, f-string handling, encoding detection) lives in cpython/Parser/tokenizer/* and lands with the parser port. The public surface here is the stable Go-idiomatic shape every consumer can program against today; the implementation under Iter fills in once the lexer skeleton catches up.

CPython: Python/Python-tokenize.c

Package tokenize declares the token kind constants and the public iterator surface.

Numeric values for the constants live in types_gen.go, generated from cpython/Grammar/Tokens via tools/tokens_go. Keeping the type itself in a hand-written file lets other files in the package depend on `Type` without requiring the generator to have run yet.

CPython: Include/internal/pycore_token.h Token enum

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Iter

type Iter struct {
	// State will hold the parser-side lexer once the parser port
	// lands. Until then Next returns io.EOF immediately. Exported so
	// the unused-field linter doesn't flip on the placeholder.
	State any
}

Iter is the Go-side TokenizerIter equivalent. Next advances the underlying lexer state by one token; EOF is reported as io.EOF.

CPython: Python/Python-tokenize.c tokenizeriterobject

func New

func New(_ string, _ bool) *Iter

New constructs an Iter over a source string. extraTokens enables the COMMENT / NL / ENCODING / NEWLINE-at-EOF tokens that the stdlib filters out by default.

CPython: Python/Python-tokenize.c tokenizeriter_new (source path)

func NewReadline

func NewReadline(_ func() (string, error), _ bool) *Iter

NewReadline constructs an Iter that pulls source lines from a readline-shaped callable, the same shape io.TextIO.readline has on the Python side.

CPython: Python/Python-tokenize.c tokenizeriter_new (readline path)

func (*Iter) Next

func (it *Iter) Next() (Token, error)

Next returns the next token. The skeleton always reports io.EOF; the parser port fills in real tokens.

CPython: Python/Python-tokenize.c tokenizeriter_next

type Pos

type Pos struct {
	Line int
	Col  int
}

Pos is the (line, column) source position of a token boundary. Both fields are 1-based, matching CPython's tokenize.TokenInfo.

CPython: Python/Python-tokenize.c tokenizeriter_next

type Token

type Token struct {
	Type  Type
	Value string
	Start Pos
	End   Pos
	Line  string
}

Token is one record emitted by the iterator. Mirrors the 5-tuple (type, string, start, end, line) the C wrapper returns.

CPython: Python/Python-tokenize.c tokenizeriter_next

type Type

type Type int

Type is the token kind. Numeric values match CPython 3.14 Grammar/Tokens one for one. The full constant set is in types_gen.go.

const (
	ENDMARKER        Type = 0
	NAME             Type = 1
	NUMBER           Type = 2
	STRING           Type = 3
	NEWLINE          Type = 4
	INDENT           Type = 5
	DEDENT           Type = 6
	LPAR             Type = 7
	RPAR             Type = 8
	LSQB             Type = 9
	RSQB             Type = 10
	COLON            Type = 11
	COMMA            Type = 12
	SEMI             Type = 13
	PLUS             Type = 14
	MINUS            Type = 15
	STAR             Type = 16
	SLASH            Type = 17
	VBAR             Type = 18
	AMPER            Type = 19
	LESS             Type = 20
	GREATER          Type = 21
	EQUAL            Type = 22
	DOT              Type = 23
	PERCENT          Type = 24
	LBRACE           Type = 25
	RBRACE           Type = 26
	EQEQUAL          Type = 27
	NOTEQUAL         Type = 28
	LESSEQUAL        Type = 29
	GREATEREQUAL     Type = 30
	TILDE            Type = 31
	CIRCUMFLEX       Type = 32
	LEFTSHIFT        Type = 33
	RIGHTSHIFT       Type = 34
	DOUBLESTAR       Type = 35
	PLUSEQUAL        Type = 36
	MINEQUAL         Type = 37
	STAREQUAL        Type = 38
	SLASHEQUAL       Type = 39
	PERCENTEQUAL     Type = 40
	AMPEREQUAL       Type = 41
	VBAREQUAL        Type = 42
	CIRCUMFLEXEQUAL  Type = 43
	LEFTSHIFTEQUAL   Type = 44
	RIGHTSHIFTEQUAL  Type = 45
	DOUBLESTAREQUAL  Type = 46
	DOUBLESLASH      Type = 47
	DOUBLESLASHEQUAL Type = 48
	AT               Type = 49
	ATEQUAL          Type = 50
	RARROW           Type = 51
	ELLIPSIS         Type = 52
	COLONEQUAL       Type = 53
	EXCLAMATION      Type = 54
	OP               Type = 55
	TYPE_IGNORE      Type = 56
	TYPE_COMMENT     Type = 57
	SOFT_KEYWORD     Type = 58
	FSTRING_START    Type = 59
	FSTRING_MIDDLE   Type = 60
	FSTRING_END      Type = 61
	TSTRING_START    Type = 62
	TSTRING_MIDDLE   Type = 63
	TSTRING_END      Type = 64
	COMMENT          Type = 65
	NL               Type = 66
	ERRORTOKEN       Type = 67
	ENCODING         Type = 68
	NTokens          Type = 69
)

Token kinds, numeric values pinned to CPython's token.h. The ALL_CAPS spellings preserve parity with `token.tok_name` so fixture comparisons line up byte-for-byte.

func (Type) String

func (t Type) String() string

String returns the CPython-compatible token name (e.g. "NAME", "NUMBER"). Unknown values render as "TYPE(n)".

CPython: Lib/token.py tok_name

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL