tokenizer

package

v1.4.0 Latest Latest Go to latest Published: Feb 21, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/aalpar/wile

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer implements R7RS Scheme lexical analysis.

The tokenizer converts a Unicode rune stream into tokens with source positions:

Token Categories ¶

Delimiters: (, ), (), .
Quotation: ', `, ,, ,@ and syntax variants
Numbers: integers, decimals, rationals, scientific, complex, polar
Special: +inf.0, -inf.0, +nan.0, imaginary variants
Big numbers: #z prefix for BigInteger, #m for BigFloat
Literals: symbols, strings, characters
Booleans: #t, #f, #true, #false
Comments: line (;), block (#|...|#), datum (#;)
Vectors: #(, #u8(
Labels: #n=, #n#

Usage ¶

tok := tokenizer.NewTokenizer(reader, caseInsensitive)
for {
    token, err := tok.Next()
    if err == io.EOF {
        break
    }
    // process token
}

Each Token provides source position via Start() and End(), raw text via String(), and processed value (escapes resolved) via Value().

Package tokenizer implements lexical analysis for Scheme source code.

The tokenizer converts a stream of runes into tokens representing the lexical elements of Scheme: parentheses, quotes, numbers, symbols, strings, characters, booleans, vectors, and comments.

Token Types ¶

Tokens are categorized by TokenizerState values:

Delimiters: OpenParen, CloseParen, EmptyList, Cons
Quotation: Quote, Quasiquote, Unquote, UnquoteSplicing, Syntax variants
Numbers: SignedInteger, UnsignedInteger, DecimalFraction, RationalFraction
Special numbers: SignedInf, SignedNan, SignedImaginary
Radix prefixes: MarkerBase2, MarkerBase8, MarkerBase10, MarkerBase16
Literals: Sym, String, Character (graphic, mnemonic, hex escape)
Booleans: MarkerBooleanTrue, MarkerBooleanFalse
Comments: LineComment, BlockComment, DatumComment
Vectors: OpenVector, OpenVectorUnsignedByteMarker
Labels: LabelReference, LabelAssignment

Usage ¶

tok := tokenizer.NewTokenizer(strings.NewReader("(+ 1 2)"))
for {
    token, err := tok.Next()
    if err == io.EOF {
        break
    }
    // process token
}

Index ¶

Constants
Variables
func HasPrefixCI(s, prefix string) bool
func HasSuffixCI(s, suffix string) bool
func ToLowerASCII(c byte) byte
func TrimPrefixCI(s, prefix string) string
func TrimSuffixCI(s, suffix string) string
type ErrorCode
type SimpleToken
- func NewSimpleToken(typ TokenizerState, src, val string, sti, eni *syntax.SourceIndexes, ...) *SimpleToken
- func (p *SimpleToken) End() syntax.SourceIndexes
- func (p *SimpleToken) EqualTo(v values.Value) bool
- func (p *SimpleToken) HasHashDigit() bool
- func (p *SimpleToken) IsVoid() bool
- func (p *SimpleToken) SchemeString() string
- func (p *SimpleToken) Start() syntax.SourceIndexes
- func (p *SimpleToken) String() string
- func (p *SimpleToken) Type() TokenizerState
- func (p *SimpleToken) Value() string
type Token
- func Tokenize(s string, ci bool) ([]Token, error)
type Tokenizer
- func NewTokenizer(rdr io.RuneReader, ci bool) *Tokenizer
- func NewTokenizerWithComments(rdr io.RuneReader, ci bool) *Tokenizer
- func (p *Tokenizer) Close() error
- func (p *Tokenizer) Next() (Token, error)
- func (p *Tokenizer) Reader() io.RuneReader
- func (p *Tokenizer) Text() string
type TokenizerError
- func NewTokenizerError(mess string, start, end syntax.SourceIndexes) *TokenizerError
- func NewTokenizerErrorWithWrap(err error, mess string, start, end syntax.SourceIndexes) *TokenizerError
- func (p *TokenizerError) Error() string
- func (p *TokenizerError) Is(err error) bool
- func (p *TokenizerError) Unwrap() error
type TokenizerState

Constants ¶

View Source

const (
	MessageRuneError                             = "rune error"
	MessageExpectingNumber                       = "expecting number"
	MessageExpectingExponentMarker               = "expecting exponent marker"
	MessageExpectingExponentDigits               = "expecting exponent digits"
	MessageExpectingImaginary                    = "expecting imaginary"
	MessageExpectingDecimalFraction              = "expecting decimal fraction"
	MessageExpectingNan                          = "expecting NaN"
	MessageExpectingInf                          = "expecting Inf"
	MessageExpectingTrue                         = "expecting true"
	MessageExpectingFalse                        = "expecting false"
	MessageExpectingToken                        = "expecting token"
	MessageExpectingEscape                       = "expecting escape"
	MessageExpectingHexSequenceTerminator        = "expecting hex sequence terminator"
	MessageExpectingLineEnding                   = "expecting line ending"
	MessageExpectingHexDigit                     = "expecting hex digit"
	MessageExpectingCharacterMnemonicOrHexEscape = "expecting character mnemonic or hex escape"
	MessageExpectingDirective                    = "expecting directive"
	MessageCannotParseNumber                     = "cannot parse number"
	MessageCodePointExceedsUnicodeMaximum        = "character code point exceeds Unicode maximum (0x10FFFF)"
	MessageCodePointIsSurrogate                  = "character code point is a surrogate (0xD800-0xDFFF)"
	MessageInvalidHexEscape                      = "invalid hex escape"
	MessageInvalidCharacterHexEscape             = "invalid character hex escape"
	MessageInvalidCharacterMnemonic              = "invalid character mnemonic"
	MessageUnterminatedExtendedSymbol            = "unterminated extended symbol"
	MessageUnterminatedString                    = "unterminated string"
)

Error messages returned by the tokenizer.

Variables ¶

View Source

var (
	ErrNotAnUnsignedByteMarker = values.NewStaticError("not an unsigned byte marker")
	ErrNotALiteral             = values.NewStaticError("not a literal")
)

ErrNotAnUnsignedByteMarker is returned when parsing fails on an unsigned byte marker.

Functions ¶

func HasPrefixCI ¶

func HasPrefixCI(s, prefix string) bool

HasPrefixCI reports whether s begins with prefix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.

func HasSuffixCI ¶

func HasSuffixCI(s, suffix string) bool

HasSuffixCI reports whether s ends with suffix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.

func ToLowerASCII ¶

func ToLowerASCII(c byte) byte

ToLowerASCII converts an ASCII byte to lowercase. Non-ASCII bytes are returned unchanged.

func TrimPrefixCI ¶

func TrimPrefixCI(s, prefix string) string

TrimPrefixCI returns s without the provided leading prefix string, using ASCII case-insensitive comparison. If s doesn't start with prefix (case-insensitively), s is returned unchanged.

func TrimSuffixCI ¶

func TrimSuffixCI(s, suffix string) string

TrimSuffixCI returns s without the provided trailing suffix string, using ASCII case-insensitive comparison. If s doesn't end with suffix (case-insensitively), s is returned unchanged.

Types ¶

type ErrorCode ¶

type ErrorCode int

ErrorCode represents a tokenizer error classification.

type SimpleToken ¶

type SimpleToken struct {
	// contains filtered or unexported fields
}

SimpleToken is the concrete implementation of Token used by the tokenizer.

func NewSimpleToken ¶

func NewSimpleToken(typ TokenizerState, src, val string, sti, eni *syntax.SourceIndexes, signed bool, rad int, hash bool) *SimpleToken

NewSimpleToken creates a new SimpleToken with the given type, source, value, and position.

func (*SimpleToken) End ¶

func (p *SimpleToken) End() syntax.SourceIndexes

End returns the source position where the token ends.

func (*SimpleToken) EqualTo ¶

func (p *SimpleToken) EqualTo(v values.Value) bool

EqualTo returns true if this token equals the given value.

func (*SimpleToken) HasHashDigit ¶

func (p *SimpleToken) HasHashDigit() bool

HasHashDigit returns true if the token contained # as an inexact digit placeholder. R7RS §7.1.1: # can appear in place of digits after at least one real digit, representing an unknown digit (treated as 0). Its presence forces the number to be inexact.

func (*SimpleToken) IsVoid ¶

func (p *SimpleToken) IsVoid() bool

IsVoid returns true if the token is nil.

func (*SimpleToken) SchemeString ¶

func (p *SimpleToken) SchemeString() string

SchemeString returns the Scheme representation of the token.

func (*SimpleToken) Start ¶

func (p *SimpleToken) Start() syntax.SourceIndexes

Start returns the source position where the token begins.

func (*SimpleToken) String ¶

func (p *SimpleToken) String() string

func (*SimpleToken) Type ¶

func (p *SimpleToken) Type() TokenizerState

Type returns the token type.

func (*SimpleToken) Value ¶

func (p *SimpleToken) Value() string

Value returns the processed value of the token (e.g., with escape sequences converted).

type Token ¶

type Token interface {
	Type() TokenizerState
	Start() syntax.SourceIndexes
	End() syntax.SourceIndexes
	String() string
	Value() string      // Returns processed value (e.g., with escape sequences converted)
	HasHashDigit() bool // R7RS §7.1.1: whether # appeared as inexact digit placeholder
}

Token is the interface for tokenizer output tokens.

func Tokenize ¶

func Tokenize(s string, ci bool) ([]Token, error)

Tokenize is a convenience function that tokenizes a complete string. Returns all tokens and any error (typically io.EOF on success).

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer reads Scheme source code and produces a stream of tokens.

func NewTokenizer ¶

func NewTokenizer(rdr io.RuneReader, ci bool) *Tokenizer

NewTokenizer creates a new tokenizer that reads from the given RuneReader. The tokenizer is initialized with the first rune already read.

func NewTokenizerWithComments ¶

func NewTokenizerWithComments(rdr io.RuneReader, ci bool) *Tokenizer

NewTokenizerWithComments creates a tokenizer with optional comment token emission. When emitComments is true, comments are returned as Begin/Body/End token sequences instead of being skipped.

func (*Tokenizer) Close ¶

func (p *Tokenizer) Close() error

Close closes the underlying reader if it implements io.Closer.

func (*Tokenizer) Next ¶

func (p *Tokenizer) Next() (Token, error)

Next returns the next token from the input stream. Returns io.EOF when the input is exhausted. Comment tokens are skipped unless emitComments was set to true.

func (*Tokenizer) Reader ¶

func (p *Tokenizer) Reader() io.RuneReader

Reader returns the underlying RuneReader.

func (*Tokenizer) Text ¶

func (p *Tokenizer) Text() string

Text returns the text of the current token.

type TokenizerError ¶

type TokenizerError struct {
	// contains filtered or unexported fields
}

TokenizerError represents an error that occurred during tokenization, with source location information.

func NewTokenizerError ¶

func NewTokenizerError(mess string, start, end syntax.SourceIndexes) *TokenizerError

NewTokenizerError creates a new tokenizer error with the given message and source location.

func NewTokenizerErrorWithWrap ¶

func NewTokenizerErrorWithWrap(err error, mess string, start, end syntax.SourceIndexes) *TokenizerError

NewTokenizerErrorWithWrap creates a new tokenizer error that wraps another error.

func (*TokenizerError) Error ¶

func (p *TokenizerError) Error() string

func (*TokenizerError) Is ¶

func (p *TokenizerError) Is(err error) bool

Is implements errors.Is for TokenizerError.

func (*TokenizerError) Unwrap ¶

func (p *TokenizerError) Unwrap() error

type TokenizerState ¶

type TokenizerState int

TokenizerState represents the type of token recognized by the tokenizer. Each state corresponds to a distinct lexical element in Scheme syntax.

const (
	// TokenizerStateFailed indicates tokenization failed.
	TokenizerStateFailed TokenizerState = iota

	// TokenizerStateSyntax represents #'expr (syntax quote).
	TokenizerStateSyntax
	// TokenizerStateUnsyntax represents #,expr (unsyntax).
	TokenizerStateUnsyntax
	// TokenizerStateUnsyntaxSplicing represents #,@expr (unsyntax-splicing).
	TokenizerStateUnsyntaxSplicing
	// TokenizerStateQuasisyntax represents #`expr (quasisyntax).
	TokenizerStateQuasisyntax

	// TokenizerStateQuote represents 'expr (quote).
	TokenizerStateQuote
	// TokenizerStateUnquote represents ,expr (unquote).
	TokenizerStateUnquote
	// TokenizerStateUnquoteSplicing represents ,@expr (unquote-splicing).
	TokenizerStateUnquoteSplicing
	// TokenizerStateQuasiquote represents `expr (quasiquote).
	TokenizerStateQuasiquote

	// TokenizerStateSignedInf represents +inf.0 or -inf.0 (infinity).
	TokenizerStateSignedInf
	// TokenizerStateSignedNan represents +nan.0 or -nan.0 (not a number).
	TokenizerStateSignedNan
	// TokenizerStateSignedImaginaryInf represents +inf.0i or -inf.0i (imaginary infinity).
	TokenizerStateSignedImaginaryInf
	// TokenizerStateSignedImaginaryNan represents +nan.0i or -nan.0i (imaginary NaN).
	TokenizerStateSignedImaginaryNan
	// TokenizerStateSignedImaginary represents +i, -i, +3i, -3.5i (pure imaginary).
	TokenizerStateSignedImaginary
	// TokenizerStateSignedComplex represents +1+2i, 3.5-2.5i (rectangular complex).
	TokenizerStateSignedComplex
	// TokenizerStateSignedComplexPolar represents +1@1.5708 (polar complex: magnitude@angle).
	TokenizerStateSignedComplexPolar
	// TokenizerStateUnsignedImaginaryInf represents inf.0i (unsigned imaginary infinity).
	TokenizerStateUnsignedImaginaryInf
	// TokenizerStateUnsignedImaginaryNan represents nan.0i (unsigned imaginary NaN).
	TokenizerStateUnsignedImaginaryNan
	// TokenizerStateUnsignedImaginary represents 3i, 3.5i (unsigned pure imaginary).
	TokenizerStateUnsignedImaginary
	// TokenizerStateUnsignedComplex represents 1+2i (unsigned rectangular complex).
	TokenizerStateUnsignedComplex
	// TokenizerStateUnsignedComplexPolar represents 1@1.5708 (unsigned polar complex).
	TokenizerStateUnsignedComplexPolar

	// TokenizerStateMarker represents a generic # marker.
	TokenizerStateMarker
	// TokenizerStateMarkerBooleanFalse represents #f or #false.
	TokenizerStateMarkerBooleanFalse
	// TokenizerStateMarkerBooleanTrue represents #t or #true.
	TokenizerStateMarkerBooleanTrue
	// TokenizerStateMarkerNumberInexact represents #i prefix (inexact).
	TokenizerStateMarkerNumberInexact
	// TokenizerStateMarkerNumberExact represents #e prefix (exact).
	TokenizerStateMarkerNumberExact

	// TokenizerStateSignedInteger represents -123 or +456 (signed decimal).
	TokenizerStateSignedInteger
	// TokenizerStateUnsignedInteger represents 123 (unsigned decimal).
	TokenizerStateUnsignedInteger

	// TokenizerStateSignedIntegerBase2 represents signed binary integer after #b prefix.
	TokenizerStateSignedIntegerBase2
	// TokenizerStateUnsignedIntegerBase2 represents unsigned binary integer after #b prefix.
	TokenizerStateUnsignedIntegerBase2
	// TokenizerStateSignedIntegerBase8 represents signed octal integer after #o prefix.
	TokenizerStateSignedIntegerBase8
	// TokenizerStateUnsignedIntegerBase8 represents unsigned octal integer after #o prefix.
	TokenizerStateUnsignedIntegerBase8
	// TokenizerStateSignedIntegerBase10 represents signed decimal integer after #d prefix.
	TokenizerStateSignedIntegerBase10
	// TokenizerStateUnsignedIntegerBase10 represents unsigned decimal integer after #d prefix.
	TokenizerStateUnsignedIntegerBase10
	// TokenizerStateSignedIntegerBase16 represents signed hexadecimal integer after #x prefix.
	TokenizerStateSignedIntegerBase16
	// TokenizerStateUnsignedIntegerBase16 represents unsigned hexadecimal integer after #x prefix.
	TokenizerStateUnsignedIntegerBase16

	// TokenizerStateBigFloat represents #m arbitrary-precision decimal.
	TokenizerStateBigFloat
	TokenizerStateBigIntegerDefaultBase // represents #z arbitrary-precision integer (default base).
	TokenizerStateBigIntegerBase2       // represents #b arbitrary-precision binary.
	TokenizerStateBigIntegerBase8       // represents #o arbitrary-precision octal.
	TokenizerStateBigIntegerBase10      // represents #d arbitrary-precision decimal.
	TokenizerStateBigIntegerBase16      // represents #x arbitrary-precision hexadecimal.

	// TokenizerStateMarkerBase2 represents #b prefix (binary).
	TokenizerStateMarkerBase2
	// TokenizerStateMarkerBase8 represents #o prefix (octal).
	TokenizerStateMarkerBase8
	// TokenizerStateMarkerBase10 represents #d prefix (decimal).
	TokenizerStateMarkerBase10
	// TokenizerStateMarkerBase16 represents #x prefix (hexadecimal).
	TokenizerStateMarkerBase16

	// TokenizerStateSignedDecimalFraction represents -1.23 or +4.56.
	TokenizerStateSignedDecimalFraction
	// TokenizerStateSignedRationalFraction represents -1/2 or +3/4.
	TokenizerStateSignedRationalFraction
	// TokenizerStateUnsignedRationalFraction represents 1/2 or 3/4.
	TokenizerStateUnsignedRationalFraction
	// TokenizerStateUnsignedDecimalFraction represents 1.23 or 4.56.
	TokenizerStateUnsignedDecimalFraction

	// TokenizerStateSignedScientificNotation represents integers with exponents like +1e10, -2e-5.
	// Parser determines if result is integer or float based on exponent sign and mantissa.
	TokenizerStateSignedScientificNotation
	// TokenizerStateUnsignedScientificNotation represents integers with exponents like 1e10, 2e-5.
	// Parser determines if result is integer or float based on exponent sign and mantissa.
	TokenizerStateUnsignedScientificNotation

	// TokenizerStateEmptyList represents () (empty list).
	TokenizerStateEmptyList
	// TokenizerStateOpenParen represents ( (open parenthesis).
	TokenizerStateOpenParen
	// TokenizerStateCloseParen represents ) (close parenthesis).
	TokenizerStateCloseParen
	// TokenizerStateOpenBracket represents [ (open square bracket).
	// R7RS §2.1: Square brackets are equivalent to parentheses but must match.
	TokenizerStateOpenBracket
	// TokenizerStateCloseBracket represents ] (close square bracket).
	// R7RS §2.1: Square brackets are equivalent to parentheses but must match.
	TokenizerStateCloseBracket
	// TokenizerStateCons represents . (dot for improper lists).
	TokenizerStateCons

	// TokenizerStateStringStart represents opening " (string start).
	TokenizerStateStringStart
	// TokenizerStateStringSpan represents string content.
	TokenizerStateStringSpan
	// TokenizerStateStringIntraEscape represents escape sequence within string.
	TokenizerStateStringIntraEscape
	// TokenizerStateString represents complete "string".
	TokenizerStateString

	// TokenizerStateCharMnemonicOrHexEscape represents intermediate character state.
	TokenizerStateCharMnemonicOrHexEscape
	// TokenizerStateCharMnemonic represents #\newline, #\space, etc.
	TokenizerStateCharMnemonic
	// TokenizerStateCharHexEscape represents #\x0A (hex escape).
	TokenizerStateCharHexEscape
	// TokenizerStateCharGraphic represents #\a (single graphic char).
	TokenizerStateCharGraphic

	// TokenizerStateLineCommentBody represents comment text (multi-token: body).
	TokenizerStateLineCommentBody
	// TokenizerStateBlockCommentBody represents block content (multi-token: body).
	TokenizerStateBlockCommentBody
	// TokenizerStateDatumCommentBegin represents #; (multi-token mode).
	TokenizerStateDatumCommentBegin

	// TokenizerStateSymbol represents an identifier or symbol.
	TokenizerStateSymbol

	// TokenizerStateOpenVector represents #( (vector).
	TokenizerStateOpenVector
	// TokenizerStateOpenVectorUnsignedByteMarker represents #u8( (bytevector).
	TokenizerStateOpenVectorUnsignedByteMarker

	// TokenizerStateDirective represents #!fold-case, etc.
	TokenizerStateDirective
	// TokenizerStateLabelReference represents #123# (datum label reference).
	TokenizerStateLabelReference
	// TokenizerStateLabelAssignment represents #123= (datum label assignment).
	TokenizerStateLabelAssignment
)

TokenizerState values for different token types.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL