Documentation
¶
Overview ¶
Package tokenizer implements R7RS Scheme lexical analysis.
The tokenizer converts a Unicode rune stream into tokens with source positions:
Token Categories ¶
- Delimiters: (, ), (), .
- Quotation: ', `, ,, ,@ and syntax variants
- Numbers: integers, decimals, rationals, scientific, complex, polar
- Special: +inf.0, -inf.0, +nan.0, imaginary variants
- Big numbers: #z prefix for BigInteger, #m for BigFloat
- Literals: symbols, strings, characters
- Booleans: #t, #f, #true, #false
- Comments: line (;), block (#|...|#), datum (#;)
- Vectors: #(, #u8(
- Labels: #n=, #n#
Usage ¶
tok := tokenizer.NewTokenizer(reader, caseInsensitive)
for {
token, err := tok.Next()
if err == io.EOF {
break
}
// process token
}
Each Token provides source position via Start() and End(), raw text via String(), and processed value (escapes resolved) via Value().
Package tokenizer implements lexical analysis for Scheme source code.
The tokenizer converts a stream of runes into tokens representing the lexical elements of Scheme: parentheses, quotes, numbers, symbols, strings, characters, booleans, vectors, and comments.
Token Types ¶
Tokens are categorized by TokenizerState values:
- Delimiters: OpenParen, CloseParen, EmptyList, Cons
- Quotation: Quote, Quasiquote, Unquote, UnquoteSplicing, Syntax variants
- Numbers: SignedInteger, UnsignedInteger, DecimalFraction, RationalFraction
- Special numbers: SignedInf, SignedNan, SignedImaginary
- Radix prefixes: MarkerBase2, MarkerBase8, MarkerBase10, MarkerBase16
- Literals: Sym, String, Character (graphic, mnemonic, hex escape)
- Booleans: MarkerBooleanTrue, MarkerBooleanFalse
- Comments: LineComment, BlockComment, DatumComment
- Vectors: OpenVector, OpenVectorUnsignedByteMarker
- Labels: LabelReference, LabelAssignment
Usage ¶
tok := tokenizer.NewTokenizer(strings.NewReader("(+ 1 2)"))
for {
token, err := tok.Next()
if err == io.EOF {
break
}
// process token
}
Index ¶
- Constants
- Variables
- func HasPrefixCI(s, prefix string) bool
- func HasSuffixCI(s, suffix string) bool
- func ToLowerASCII(c byte) byte
- func TrimPrefixCI(s, prefix string) string
- func TrimSuffixCI(s, suffix string) string
- type ErrorCode
- type SimpleToken
- func (p *SimpleToken) End() syntax.SourceIndexes
- func (p *SimpleToken) EqualTo(v values.Value) bool
- func (p *SimpleToken) HasHashDigit() bool
- func (p *SimpleToken) IsVoid() bool
- func (p *SimpleToken) SchemeString() string
- func (p *SimpleToken) Start() syntax.SourceIndexes
- func (p *SimpleToken) String() string
- func (p *SimpleToken) Type() TokenizerState
- func (p *SimpleToken) Value() string
- type Token
- type Tokenizer
- type TokenizerError
- type TokenizerState
Constants ¶
const ( MessageRuneError = "rune error" MessageExpectingNumber = "expecting number" MessageExpectingExponentMarker = "expecting exponent marker" MessageExpectingExponentDigits = "expecting exponent digits" MessageExpectingImaginary = "expecting imaginary" MessageExpectingDecimalFraction = "expecting decimal fraction" MessageExpectingNan = "expecting NaN" MessageExpectingInf = "expecting Inf" MessageExpectingTrue = "expecting true" MessageExpectingFalse = "expecting false" MessageExpectingToken = "expecting token" MessageExpectingEscape = "expecting escape" MessageExpectingHexSequenceTerminator = "expecting hex sequence terminator" MessageExpectingLineEnding = "expecting line ending" MessageExpectingHexDigit = "expecting hex digit" MessageExpectingCharacterMnemonicOrHexEscape = "expecting character mnemonic or hex escape" MessageExpectingDirective = "expecting directive" MessageCannotParseNumber = "cannot parse number" MessageCodePointExceedsUnicodeMaximum = "character code point exceeds Unicode maximum (0x10FFFF)" MessageCodePointIsSurrogate = "character code point is a surrogate (0xD800-0xDFFF)" MessageInvalidHexEscape = "invalid hex escape" MessageInvalidCharacterHexEscape = "invalid character hex escape" MessageInvalidCharacterMnemonic = "invalid character mnemonic" MessageUnterminatedExtendedSymbol = "unterminated extended symbol" MessageUnterminatedString = "unterminated string" )
Error messages returned by the tokenizer.
Variables ¶
var ( ErrNotAnUnsignedByteMarker = values.NewStaticError("not an unsigned byte marker") ErrNotALiteral = values.NewStaticError("not a literal") )
ErrNotAnUnsignedByteMarker is returned when parsing fails on an unsigned byte marker.
Functions ¶
func HasPrefixCI ¶
HasPrefixCI reports whether s begins with prefix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.
func HasSuffixCI ¶
HasSuffixCI reports whether s ends with suffix, using ASCII case-insensitive comparison. Only ASCII letters (A-Z, a-z) are treated as case-insensitive; all other bytes must match exactly.
func ToLowerASCII ¶
ToLowerASCII converts an ASCII byte to lowercase. Non-ASCII bytes are returned unchanged.
func TrimPrefixCI ¶
TrimPrefixCI returns s without the provided leading prefix string, using ASCII case-insensitive comparison. If s doesn't start with prefix (case-insensitively), s is returned unchanged.
func TrimSuffixCI ¶
TrimSuffixCI returns s without the provided trailing suffix string, using ASCII case-insensitive comparison. If s doesn't end with suffix (case-insensitively), s is returned unchanged.
Types ¶
type SimpleToken ¶
type SimpleToken struct {
// contains filtered or unexported fields
}
SimpleToken is the concrete implementation of Token used by the tokenizer.
func NewSimpleToken ¶
func NewSimpleToken(typ TokenizerState, src, val string, sti, eni *syntax.SourceIndexes, signed bool, rad int, hash bool) *SimpleToken
NewSimpleToken creates a new SimpleToken with the given type, source, value, and position.
func (*SimpleToken) End ¶
func (p *SimpleToken) End() syntax.SourceIndexes
End returns the source position where the token ends.
func (*SimpleToken) EqualTo ¶
func (p *SimpleToken) EqualTo(v values.Value) bool
EqualTo returns true if this token equals the given value.
func (*SimpleToken) HasHashDigit ¶
func (p *SimpleToken) HasHashDigit() bool
HasHashDigit returns true if the token contained # as an inexact digit placeholder. R7RS §7.1.1: # can appear in place of digits after at least one real digit, representing an unknown digit (treated as 0). Its presence forces the number to be inexact.
func (*SimpleToken) IsVoid ¶
func (p *SimpleToken) IsVoid() bool
IsVoid returns true if the token is nil.
func (*SimpleToken) SchemeString ¶
func (p *SimpleToken) SchemeString() string
SchemeString returns the Scheme representation of the token.
func (*SimpleToken) Start ¶
func (p *SimpleToken) Start() syntax.SourceIndexes
Start returns the source position where the token begins.
func (*SimpleToken) String ¶
func (p *SimpleToken) String() string
func (*SimpleToken) Value ¶
func (p *SimpleToken) Value() string
Value returns the processed value of the token (e.g., with escape sequences converted).
type Token ¶
type Token interface {
Type() TokenizerState
Start() syntax.SourceIndexes
End() syntax.SourceIndexes
String() string
Value() string // Returns processed value (e.g., with escape sequences converted)
HasHashDigit() bool // R7RS §7.1.1: whether # appeared as inexact digit placeholder
}
Token is the interface for tokenizer output tokens.
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer reads Scheme source code and produces a stream of tokens.
func NewTokenizer ¶
func NewTokenizer(rdr io.RuneReader, ci bool) *Tokenizer
NewTokenizer creates a new tokenizer that reads from the given RuneReader. The tokenizer is initialized with the first rune already read.
func NewTokenizerWithComments ¶
func NewTokenizerWithComments(rdr io.RuneReader, ci bool) *Tokenizer
NewTokenizerWithComments creates a tokenizer with optional comment token emission. When emitComments is true, comments are returned as Begin/Body/End token sequences instead of being skipped.
func (*Tokenizer) Next ¶
Next returns the next token from the input stream. Returns io.EOF when the input is exhausted. Comment tokens are skipped unless emitComments was set to true.
func (*Tokenizer) Reader ¶
func (p *Tokenizer) Reader() io.RuneReader
Reader returns the underlying RuneReader.
type TokenizerError ¶
type TokenizerError struct {
// contains filtered or unexported fields
}
TokenizerError represents an error that occurred during tokenization, with source location information.
func NewTokenizerError ¶
func NewTokenizerError(mess string, start, end syntax.SourceIndexes) *TokenizerError
NewTokenizerError creates a new tokenizer error with the given message and source location.
func NewTokenizerErrorWithWrap ¶
func NewTokenizerErrorWithWrap(err error, mess string, start, end syntax.SourceIndexes) *TokenizerError
NewTokenizerErrorWithWrap creates a new tokenizer error that wraps another error.
func (*TokenizerError) Error ¶
func (p *TokenizerError) Error() string
func (*TokenizerError) Is ¶
func (p *TokenizerError) Is(err error) bool
Is implements errors.Is for TokenizerError.
func (*TokenizerError) Unwrap ¶
func (p *TokenizerError) Unwrap() error
type TokenizerState ¶
type TokenizerState int
TokenizerState represents the type of token recognized by the tokenizer. Each state corresponds to a distinct lexical element in Scheme syntax.
const ( // TokenizerStateFailed indicates tokenization failed. TokenizerStateFailed TokenizerState = iota // TokenizerStateSyntax represents #'expr (syntax quote). TokenizerStateSyntax // TokenizerStateUnsyntax represents #,expr (unsyntax). TokenizerStateUnsyntax // TokenizerStateUnsyntaxSplicing represents #,@expr (unsyntax-splicing). TokenizerStateUnsyntaxSplicing // TokenizerStateQuasisyntax represents #`expr (quasisyntax). TokenizerStateQuasisyntax // TokenizerStateQuote represents 'expr (quote). TokenizerStateQuote // TokenizerStateUnquote represents ,expr (unquote). TokenizerStateUnquote // TokenizerStateUnquoteSplicing represents ,@expr (unquote-splicing). TokenizerStateUnquoteSplicing // TokenizerStateQuasiquote represents `expr (quasiquote). TokenizerStateQuasiquote // TokenizerStateSignedInf represents +inf.0 or -inf.0 (infinity). TokenizerStateSignedInf // TokenizerStateSignedNan represents +nan.0 or -nan.0 (not a number). TokenizerStateSignedNan // TokenizerStateSignedImaginaryInf represents +inf.0i or -inf.0i (imaginary infinity). TokenizerStateSignedImaginaryInf // TokenizerStateSignedImaginaryNan represents +nan.0i or -nan.0i (imaginary NaN). TokenizerStateSignedImaginaryNan // TokenizerStateSignedImaginary represents +i, -i, +3i, -3.5i (pure imaginary). TokenizerStateSignedImaginary // TokenizerStateSignedComplex represents +1+2i, 3.5-2.5i (rectangular complex). TokenizerStateSignedComplex // TokenizerStateSignedComplexPolar represents +1@1.5708 (polar complex: magnitude@angle). TokenizerStateSignedComplexPolar // TokenizerStateUnsignedImaginaryInf represents inf.0i (unsigned imaginary infinity). TokenizerStateUnsignedImaginaryInf // TokenizerStateUnsignedImaginaryNan represents nan.0i (unsigned imaginary NaN). TokenizerStateUnsignedImaginaryNan // TokenizerStateUnsignedImaginary represents 3i, 3.5i (unsigned pure imaginary). TokenizerStateUnsignedImaginary // TokenizerStateUnsignedComplex represents 1+2i (unsigned rectangular complex). TokenizerStateUnsignedComplex // TokenizerStateUnsignedComplexPolar represents 1@1.5708 (unsigned polar complex). TokenizerStateUnsignedComplexPolar // TokenizerStateMarker represents a generic # marker. TokenizerStateMarker // TokenizerStateMarkerBooleanFalse represents #f or #false. TokenizerStateMarkerBooleanFalse // TokenizerStateMarkerBooleanTrue represents #t or #true. TokenizerStateMarkerBooleanTrue // TokenizerStateMarkerNumberInexact represents #i prefix (inexact). TokenizerStateMarkerNumberInexact // TokenizerStateMarkerNumberExact represents #e prefix (exact). TokenizerStateMarkerNumberExact // TokenizerStateSignedInteger represents -123 or +456 (signed decimal). TokenizerStateSignedInteger // TokenizerStateUnsignedInteger represents 123 (unsigned decimal). TokenizerStateUnsignedInteger // TokenizerStateSignedIntegerBase2 represents signed binary integer after #b prefix. TokenizerStateSignedIntegerBase2 // TokenizerStateUnsignedIntegerBase2 represents unsigned binary integer after #b prefix. TokenizerStateUnsignedIntegerBase2 // TokenizerStateSignedIntegerBase8 represents signed octal integer after #o prefix. TokenizerStateSignedIntegerBase8 // TokenizerStateUnsignedIntegerBase8 represents unsigned octal integer after #o prefix. TokenizerStateUnsignedIntegerBase8 // TokenizerStateSignedIntegerBase10 represents signed decimal integer after #d prefix. TokenizerStateSignedIntegerBase10 // TokenizerStateUnsignedIntegerBase10 represents unsigned decimal integer after #d prefix. TokenizerStateUnsignedIntegerBase10 // TokenizerStateSignedIntegerBase16 represents signed hexadecimal integer after #x prefix. TokenizerStateSignedIntegerBase16 // TokenizerStateUnsignedIntegerBase16 represents unsigned hexadecimal integer after #x prefix. TokenizerStateUnsignedIntegerBase16 // TokenizerStateBigFloat represents #m arbitrary-precision decimal. TokenizerStateBigFloat TokenizerStateBigIntegerDefaultBase // represents #z arbitrary-precision integer (default base). TokenizerStateBigIntegerBase2 // represents #b arbitrary-precision binary. TokenizerStateBigIntegerBase8 // represents #o arbitrary-precision octal. TokenizerStateBigIntegerBase10 // represents #d arbitrary-precision decimal. TokenizerStateBigIntegerBase16 // represents #x arbitrary-precision hexadecimal. // TokenizerStateMarkerBase2 represents #b prefix (binary). TokenizerStateMarkerBase2 // TokenizerStateMarkerBase8 represents #o prefix (octal). TokenizerStateMarkerBase8 // TokenizerStateMarkerBase10 represents #d prefix (decimal). TokenizerStateMarkerBase10 // TokenizerStateMarkerBase16 represents #x prefix (hexadecimal). TokenizerStateMarkerBase16 // TokenizerStateSignedDecimalFraction represents -1.23 or +4.56. TokenizerStateSignedDecimalFraction // TokenizerStateSignedRationalFraction represents -1/2 or +3/4. TokenizerStateSignedRationalFraction // TokenizerStateUnsignedRationalFraction represents 1/2 or 3/4. TokenizerStateUnsignedRationalFraction // TokenizerStateUnsignedDecimalFraction represents 1.23 or 4.56. TokenizerStateUnsignedDecimalFraction // TokenizerStateSignedScientificNotation represents integers with exponents like +1e10, -2e-5. // Parser determines if result is integer or float based on exponent sign and mantissa. TokenizerStateSignedScientificNotation // TokenizerStateUnsignedScientificNotation represents integers with exponents like 1e10, 2e-5. // Parser determines if result is integer or float based on exponent sign and mantissa. TokenizerStateUnsignedScientificNotation // TokenizerStateEmptyList represents () (empty list). TokenizerStateEmptyList // TokenizerStateOpenParen represents ( (open parenthesis). TokenizerStateOpenParen // TokenizerStateCloseParen represents ) (close parenthesis). TokenizerStateCloseParen // TokenizerStateOpenBracket represents [ (open square bracket). // R7RS §2.1: Square brackets are equivalent to parentheses but must match. TokenizerStateOpenBracket // TokenizerStateCloseBracket represents ] (close square bracket). // R7RS §2.1: Square brackets are equivalent to parentheses but must match. TokenizerStateCloseBracket // TokenizerStateCons represents . (dot for improper lists). TokenizerStateCons // TokenizerStateStringStart represents opening " (string start). TokenizerStateStringStart // TokenizerStateStringSpan represents string content. TokenizerStateStringSpan // TokenizerStateStringIntraEscape represents escape sequence within string. TokenizerStateStringIntraEscape // TokenizerStateString represents complete "string". TokenizerStateString // TokenizerStateCharMnemonicOrHexEscape represents intermediate character state. TokenizerStateCharMnemonicOrHexEscape // TokenizerStateCharMnemonic represents #\newline, #\space, etc. TokenizerStateCharMnemonic // TokenizerStateCharHexEscape represents #\x0A (hex escape). TokenizerStateCharHexEscape // TokenizerStateCharGraphic represents #\a (single graphic char). TokenizerStateCharGraphic // TokenizerStateLineCommentBody represents comment text (multi-token: body). TokenizerStateLineCommentBody // TokenizerStateBlockCommentBody represents block content (multi-token: body). TokenizerStateBlockCommentBody // TokenizerStateDatumCommentBegin represents #; (multi-token mode). TokenizerStateDatumCommentBegin // TokenizerStateSymbol represents an identifier or symbol. TokenizerStateSymbol // TokenizerStateOpenVector represents #( (vector). TokenizerStateOpenVector // TokenizerStateOpenVectorUnsignedByteMarker represents #u8( (bytevector). TokenizerStateOpenVectorUnsignedByteMarker // TokenizerStateDirective represents #!fold-case, etc. TokenizerStateDirective // TokenizerStateLabelReference represents #123# (datum label reference). TokenizerStateLabelReference // TokenizerStateLabelAssignment represents #123= (datum label assignment). TokenizerStateLabelAssignment )
TokenizerState values for different token types.