tokenizer

package
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Overview

Package tokenizer implements R7RS Scheme lexical analysis.

The tokenizer converts a Unicode rune stream into tokens with source positions:

Token Categories

  • Delimiters: (, ), (), .
  • Quotation: ', `, ,, ,@ and syntax variants
  • Numbers: integers, decimals, rationals, scientific, complex, polar
  • Special: +inf.0, -inf.0, +nan.0, imaginary variants
  • Big numbers: #z prefix for BigInteger, #m for BigFloat
  • Literals: symbols, strings, characters
  • Booleans: #t, #f, #true, #false
  • Comments: line (;), block (#|...|#), datum (#;)
  • Vectors: #(, #u8(
  • Labels: #n=, #n#

Usage

tok := tokenizer.NewTokenizer(reader, caseInsensitive)
for {
    token, err := tok.Next()
    if err == io.EOF {
        break
    }
    // process token
}

Each Token provides source position via Start() and End(), raw text via String(), and processed value (escapes resolved) via Value().

Package tokenizer implements lexical analysis for Scheme source code.

The tokenizer converts a stream of runes into tokens representing the lexical elements of Scheme: parentheses, quotes, numbers, symbols, strings, characters, booleans, vectors, and comments.

Token Types

Tokens are categorized by TokenizerState values:

  • Delimiters: OpenParen, CloseParen, EmptyList, Cons
  • Quotation: Quote, Quasiquote, Unquote, UnquoteSplicing, Syntax variants
  • Numbers: SignedInteger, UnsignedInteger, DecimalFraction, RationalFraction
  • Special numbers: SignedInf, SignedNan, SignedImaginary
  • Radix prefixes: MarkerBase2, MarkerBase8, MarkerBase10, MarkerBase16
  • Literals: Sym, String, Character (graphic, mnemonic, hex escape)
  • Booleans: MarkerBooleanTrue, MarkerBooleanFalse
  • Comments: LineComment, BlockComment, DatumComment
  • Vectors: OpenVector, OpenVectorUnsignedByteMarker
  • Labels: LabelReference, LabelAssignment

Usage

tok := tokenizer.NewTokenizer(strings.NewReader("(+ 1 2)"))
for {
    token, err := tok.Next()
    if err == io.EOF {
        break
    }
    // process token
}

Index

Constants

View Source
const (
	MessageRuneError                             = "rune error"
	MessageExpectingNumber                       = "expecting number"
	MessageExpectingExponentMarker               = "expecting exponent marker"
	MessageExpectingExponentDigits               = "expecting exponent digits"
	MessageExpectingImaginary                    = "expecting imaginary"
	MessageExpectingDecimalFraction              = "expecting decimal fraction"
	MessageExpectingNan                          = "expecting NaN"
	MessageExpectingInf                          = "expecting Inf"
	MessageExpectingTrue                         = "expecting true"
	MessageExpectingFalse                        = "expecting false"
	MessageExpectingToken                        = "expecting token"
	MessageExpectingEscape                       = "expecting escape"
	MessageExpectingHexSequenceTerminator        = "expecting hex sequence terminator"
	MessageExpectingLineEnding                   = "expecting line ending"
	MessageExpectingHexDigit                     = "expecting hex digit"
	MessageExpectingCharacterMnemonicOrHexEscape = "expecting character mnemonic or hex escape"
	MessageExpectingDirective                    = "expecting directive"
	MessageCannotParseNumber                     = "cannot parse number"
	MessageCodePointExceedsUnicodeMaximum        = "character code point exceeds Unicode maximum (0x10FFFF)"
	MessageCodePointIsSurrogate                  = "character code point is a surrogate (0xD800-0xDFFF)"
	MessageInvalidHexEscape                      = "invalid hex escape"
	MessageInvalidCharacterHexEscape             = "invalid character hex escape"
	MessageInvalidCharacterMnemonic              = "invalid character mnemonic"
	MessageUnterminatedExtendedSymbol            = "unterminated extended symbol"
	MessageUnterminatedString                    = "unterminated string"
)

Error messages returned by the tokenizer.

Variables

View Source
var (
	ErrNotAnUnsignedByteMarker = values.NewStaticError("not an unsigned byte marker")
	ErrNotALiteral             = values.NewStaticError("not a literal")
)

ErrNotAnUnsignedByteMarker is returned when parsing fails on an unsigned byte marker.

Functions

func HasPrefixCI

func HasPrefixCI(s, prefix string) bool

HasPrefixCI reports whether s begins with prefix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.

func HasSuffixCI

func HasSuffixCI(s, suffix string) bool

HasSuffixCI reports whether s ends with suffix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.

func ToLowerASCII

func ToLowerASCII(c byte) byte

ToLowerASCII converts an ASCII byte to lowercase. Non-ASCII bytes are returned unchanged.

func TrimPrefixCI

func TrimPrefixCI(s, prefix string) string

TrimPrefixCI returns s without the provided leading prefix string, using ASCII case-insensitive comparison. If s doesn't start with prefix (case-insensitively), s is returned unchanged.

func TrimSuffixCI

func TrimSuffixCI(s, suffix string) string

TrimSuffixCI returns s without the provided trailing suffix string, using ASCII case-insensitive comparison. If s doesn't end with suffix (case-insensitively), s is returned unchanged.

Types

type ErrorCode

type ErrorCode int

ErrorCode represents a tokenizer error classification.

type SimpleToken

type SimpleToken struct {
	// contains filtered or unexported fields
}

SimpleToken is the concrete implementation of Token used by the tokenizer.

func NewSimpleToken

func NewSimpleToken(typ TokenizerState, src, val string, sti, eni *syntax.SourceIndexes, signed bool, rad int, hash bool) *SimpleToken

NewSimpleToken creates a new SimpleToken with the given type, source, value, and position.

func (*SimpleToken) End

func (p *SimpleToken) End() syntax.SourceIndexes

End returns the source position where the token ends.

func (*SimpleToken) EqualTo

func (p *SimpleToken) EqualTo(v values.Value) bool

EqualTo returns true if this token equals the given value.

func (*SimpleToken) HasHashDigit

func (p *SimpleToken) HasHashDigit() bool

HasHashDigit returns true if the token contained # as an inexact digit placeholder. R7RS §7.1.1: # can appear in place of digits after at least one real digit, representing an unknown digit (treated as 0). Its presence forces the number to be inexact.

func (*SimpleToken) IsVoid

func (p *SimpleToken) IsVoid() bool

IsVoid returns true if the token is nil.

func (*SimpleToken) SchemeString

func (p *SimpleToken) SchemeString() string

SchemeString returns the Scheme representation of the token.

func (*SimpleToken) Start

func (p *SimpleToken) Start() syntax.SourceIndexes

Start returns the source position where the token begins.

func (*SimpleToken) String

func (p *SimpleToken) String() string

func (*SimpleToken) Type

func (p *SimpleToken) Type() TokenizerState

Type returns the token type.

func (*SimpleToken) Value

func (p *SimpleToken) Value() string

Value returns the processed value of the token (e.g., with escape sequences converted).

type Token

type Token interface {
	Type() TokenizerState
	Start() syntax.SourceIndexes
	End() syntax.SourceIndexes
	String() string
	Value() string      // Returns processed value (e.g., with escape sequences converted)
	HasHashDigit() bool // R7RS §7.1.1: whether # appeared as inexact digit placeholder
}

Token is the interface for tokenizer output tokens.

func Tokenize

func Tokenize(s string, ci bool) ([]Token, error)

Tokenize is a convenience function that tokenizes a complete string. Returns all tokens and any error (typically io.EOF on success).

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer reads Scheme source code and produces a stream of tokens.

func NewTokenizer

func NewTokenizer(rdr io.RuneReader, ci bool) *Tokenizer

NewTokenizer creates a new tokenizer that reads from the given RuneReader. The tokenizer is initialized with the first rune already read.

func NewTokenizerWithComments

func NewTokenizerWithComments(rdr io.RuneReader, ci bool) *Tokenizer

NewTokenizerWithComments creates a tokenizer with optional comment token emission. When emitComments is true, comments are returned as Begin/Body/End token sequences instead of being skipped.

func (*Tokenizer) Close

func (p *Tokenizer) Close() error

Close closes the underlying reader if it implements io.Closer.

func (*Tokenizer) Next

func (p *Tokenizer) Next() (Token, error)

Next returns the next token from the input stream. Returns io.EOF when the input is exhausted. Comment tokens are skipped unless emitComments was set to true.

func (*Tokenizer) Reader

func (p *Tokenizer) Reader() io.RuneReader

Reader returns the underlying RuneReader.

func (*Tokenizer) Text

func (p *Tokenizer) Text() string

Text returns the text of the current token.

type TokenizerError

type TokenizerError struct {
	// contains filtered or unexported fields
}

TokenizerError represents an error that occurred during tokenization, with source location information.

func NewTokenizerError

func NewTokenizerError(mess string, start, end syntax.SourceIndexes) *TokenizerError

NewTokenizerError creates a new tokenizer error with the given message and source location.

func NewTokenizerErrorWithWrap

func NewTokenizerErrorWithWrap(err error, mess string, start, end syntax.SourceIndexes) *TokenizerError

NewTokenizerErrorWithWrap creates a new tokenizer error that wraps another error.

func (*TokenizerError) Error

func (p *TokenizerError) Error() string

func (*TokenizerError) Is

func (p *TokenizerError) Is(err error) bool

Is implements errors.Is for TokenizerError.

func (*TokenizerError) Unwrap

func (p *TokenizerError) Unwrap() error

type TokenizerState

type TokenizerState int

TokenizerState represents the type of token recognized by the tokenizer. Each state corresponds to a distinct lexical element in Scheme syntax.

const (
	// TokenizerStateFailed indicates tokenization failed.
	TokenizerStateFailed TokenizerState = iota

	// TokenizerStateSyntax represents #'expr (syntax quote).
	TokenizerStateSyntax
	// TokenizerStateUnsyntax represents #,expr (unsyntax).
	TokenizerStateUnsyntax
	// TokenizerStateUnsyntaxSplicing represents #,@expr (unsyntax-splicing).
	TokenizerStateUnsyntaxSplicing
	// TokenizerStateQuasisyntax represents #`expr (quasisyntax).
	TokenizerStateQuasisyntax

	// TokenizerStateQuote represents 'expr (quote).
	TokenizerStateQuote
	// TokenizerStateUnquote represents ,expr (unquote).
	TokenizerStateUnquote
	// TokenizerStateUnquoteSplicing represents ,@expr (unquote-splicing).
	TokenizerStateUnquoteSplicing
	// TokenizerStateQuasiquote represents `expr (quasiquote).
	TokenizerStateQuasiquote

	// TokenizerStateSignedInf represents +inf.0 or -inf.0 (infinity).
	TokenizerStateSignedInf
	// TokenizerStateSignedNan represents +nan.0 or -nan.0 (not a number).
	TokenizerStateSignedNan
	// TokenizerStateSignedImaginaryInf represents +inf.0i or -inf.0i (imaginary infinity).
	TokenizerStateSignedImaginaryInf
	// TokenizerStateSignedImaginaryNan represents +nan.0i or -nan.0i (imaginary NaN).
	TokenizerStateSignedImaginaryNan
	// TokenizerStateSignedImaginary represents +i, -i, +3i, -3.5i (pure imaginary).
	TokenizerStateSignedImaginary
	// TokenizerStateSignedComplex represents +1+2i, 3.5-2.5i (rectangular complex).
	TokenizerStateSignedComplex
	// TokenizerStateSignedComplexPolar represents +1@1.5708 (polar complex: magnitude@angle).
	TokenizerStateSignedComplexPolar
	// TokenizerStateUnsignedImaginaryInf represents inf.0i (unsigned imaginary infinity).
	TokenizerStateUnsignedImaginaryInf
	// TokenizerStateUnsignedImaginaryNan represents nan.0i (unsigned imaginary NaN).
	TokenizerStateUnsignedImaginaryNan
	// TokenizerStateUnsignedImaginary represents 3i, 3.5i (unsigned pure imaginary).
	TokenizerStateUnsignedImaginary
	// TokenizerStateUnsignedComplex represents 1+2i (unsigned rectangular complex).
	TokenizerStateUnsignedComplex
	// TokenizerStateUnsignedComplexPolar represents 1@1.5708 (unsigned polar complex).
	TokenizerStateUnsignedComplexPolar

	// TokenizerStateMarker represents a generic # marker.
	TokenizerStateMarker
	// TokenizerStateMarkerBooleanFalse represents #f or #false.
	TokenizerStateMarkerBooleanFalse
	// TokenizerStateMarkerBooleanTrue represents #t or #true.
	TokenizerStateMarkerBooleanTrue
	// TokenizerStateMarkerNumberInexact represents #i prefix (inexact).
	TokenizerStateMarkerNumberInexact
	// TokenizerStateMarkerNumberExact represents #e prefix (exact).
	TokenizerStateMarkerNumberExact

	// TokenizerStateSignedInteger represents -123 or +456 (signed decimal).
	TokenizerStateSignedInteger
	// TokenizerStateUnsignedInteger represents 123 (unsigned decimal).
	TokenizerStateUnsignedInteger

	// TokenizerStateSignedIntegerBase2 represents signed binary integer after #b prefix.
	TokenizerStateSignedIntegerBase2
	// TokenizerStateUnsignedIntegerBase2 represents unsigned binary integer after #b prefix.
	TokenizerStateUnsignedIntegerBase2
	// TokenizerStateSignedIntegerBase8 represents signed octal integer after #o prefix.
	TokenizerStateSignedIntegerBase8
	// TokenizerStateUnsignedIntegerBase8 represents unsigned octal integer after #o prefix.
	TokenizerStateUnsignedIntegerBase8
	// TokenizerStateSignedIntegerBase10 represents signed decimal integer after #d prefix.
	TokenizerStateSignedIntegerBase10
	// TokenizerStateUnsignedIntegerBase10 represents unsigned decimal integer after #d prefix.
	TokenizerStateUnsignedIntegerBase10
	// TokenizerStateSignedIntegerBase16 represents signed hexadecimal integer after #x prefix.
	TokenizerStateSignedIntegerBase16
	// TokenizerStateUnsignedIntegerBase16 represents unsigned hexadecimal integer after #x prefix.
	TokenizerStateUnsignedIntegerBase16

	// TokenizerStateBigFloat represents #m arbitrary-precision decimal.
	TokenizerStateBigFloat
	TokenizerStateBigIntegerDefaultBase // represents #z arbitrary-precision integer (default base).
	TokenizerStateBigIntegerBase2       // represents #b arbitrary-precision binary.
	TokenizerStateBigIntegerBase8       // represents #o arbitrary-precision octal.
	TokenizerStateBigIntegerBase10      // represents #d arbitrary-precision decimal.
	TokenizerStateBigIntegerBase16      // represents #x arbitrary-precision hexadecimal.

	// TokenizerStateMarkerBase2 represents #b prefix (binary).
	TokenizerStateMarkerBase2
	// TokenizerStateMarkerBase8 represents #o prefix (octal).
	TokenizerStateMarkerBase8
	// TokenizerStateMarkerBase10 represents #d prefix (decimal).
	TokenizerStateMarkerBase10
	// TokenizerStateMarkerBase16 represents #x prefix (hexadecimal).
	TokenizerStateMarkerBase16

	// TokenizerStateSignedDecimalFraction represents -1.23 or +4.56.
	TokenizerStateSignedDecimalFraction
	// TokenizerStateSignedRationalFraction represents -1/2 or +3/4.
	TokenizerStateSignedRationalFraction
	// TokenizerStateUnsignedRationalFraction represents 1/2 or 3/4.
	TokenizerStateUnsignedRationalFraction
	// TokenizerStateUnsignedDecimalFraction represents 1.23 or 4.56.
	TokenizerStateUnsignedDecimalFraction

	// TokenizerStateSignedScientificNotation represents integers with exponents like +1e10, -2e-5.
	// Parser determines if result is integer or float based on exponent sign and mantissa.
	TokenizerStateSignedScientificNotation
	// TokenizerStateUnsignedScientificNotation represents integers with exponents like 1e10, 2e-5.
	// Parser determines if result is integer or float based on exponent sign and mantissa.
	TokenizerStateUnsignedScientificNotation

	// TokenizerStateEmptyList represents () (empty list).
	TokenizerStateEmptyList
	// TokenizerStateOpenParen represents ( (open parenthesis).
	TokenizerStateOpenParen
	// TokenizerStateCloseParen represents ) (close parenthesis).
	TokenizerStateCloseParen
	// TokenizerStateOpenBracket represents [ (open square bracket).
	// R7RS §2.1: Square brackets are equivalent to parentheses but must match.
	TokenizerStateOpenBracket
	// TokenizerStateCloseBracket represents ] (close square bracket).
	// R7RS §2.1: Square brackets are equivalent to parentheses but must match.
	TokenizerStateCloseBracket
	// TokenizerStateCons represents . (dot for improper lists).
	TokenizerStateCons

	// TokenizerStateStringStart represents opening " (string start).
	TokenizerStateStringStart
	// TokenizerStateStringSpan represents string content.
	TokenizerStateStringSpan
	// TokenizerStateStringIntraEscape represents escape sequence within string.
	TokenizerStateStringIntraEscape
	// TokenizerStateString represents complete "string".
	TokenizerStateString

	// TokenizerStateCharMnemonicOrHexEscape represents intermediate character state.
	TokenizerStateCharMnemonicOrHexEscape
	// TokenizerStateCharMnemonic represents #\newline, #\space, etc.
	TokenizerStateCharMnemonic
	// TokenizerStateCharHexEscape represents #\x0A (hex escape).
	TokenizerStateCharHexEscape
	// TokenizerStateCharGraphic represents #\a (single graphic char).
	TokenizerStateCharGraphic

	// TokenizerStateLineCommentBody represents comment text (multi-token: body).
	TokenizerStateLineCommentBody
	// TokenizerStateBlockCommentBody represents block content (multi-token: body).
	TokenizerStateBlockCommentBody
	// TokenizerStateDatumCommentBegin represents #; (multi-token mode).
	TokenizerStateDatumCommentBegin

	// TokenizerStateSymbol represents an identifier or symbol.
	TokenizerStateSymbol

	// TokenizerStateOpenVector represents #( (vector).
	TokenizerStateOpenVector
	// TokenizerStateOpenVectorUnsignedByteMarker represents #u8( (bytevector).
	TokenizerStateOpenVectorUnsignedByteMarker

	// TokenizerStateDirective represents #!fold-case, etc.
	TokenizerStateDirective
	// TokenizerStateLabelReference represents #123# (datum label reference).
	TokenizerStateLabelReference
	// TokenizerStateLabelAssignment represents #123= (datum label assignment).
	TokenizerStateLabelAssignment
)

TokenizerState values for different token types.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL