Documentation ¶
Overview ¶
Package lex provides the core of a lexer built as a Deterministic Finite State Automaton whose states and associated actions are implemented as functions.
Clients of the package only need to provide state functions specialized in lexing the target language. The package provides facilities to stream input from a io.Reader with up to 15 bytes look-ahead, as well as utility functions commonly used in lexers.
State functions ¶
The implementation is similar to https://golang.org/src/text/template/parse/lex.go. See also Rob Pike's talk about combining states and actions into state functions: https://talks.golang.org/2011/lex.slide.
TL;DR: State machines are often implemented with a switch statement:
// One iteration: switch state { case sate1: state = action1() case state2: state = action2() // etc. }
States represent where we are and what to expect while actions represent what we do and result in a new state. By taking advantage of the fact that functions are values, we can aggregate state and action. First we define a state function type:
type StateFn func(*State) StateFn
A StateFn is both state and action. It takes a lexer argument (to allow it to read from the input stream and emit tokens) and returns another state function.
Then the state transition loop can be rewritten like this:
// One iteration: state = state()
This package can also be used as a toolbox for a more traditional switch-based lexer (see implementation below).
Implementation details ¶
Unlike the Go text template package which uses Go channels as a means of asynchronous token emission, this package uses a FIFO queue instead. Benchmarks with an earlier implementation that used a channel showed that using a FIFO is about 5 times faster. There is also no need for the parser to take care of cancellation or draining the input in case of error.
The drawback of using a FIFO is that once a StateFn has called Emit, it should return as soon as possible so that the caller of Lex (usually a parser) can dequeue the lexed item. Since the item queue grows dynamically, it is also possible to write a state function that never returns until EOF is reached.
The initial state of the DFA is the state where we expect to read a new token. From that initial state, the lexer transitions to other states until a token is successfully matched or an error occurs. The state function that finds a match or error emits the corresponding token and finally returns nil to transition back to the initial state.
The "nil-return means initial state" convention enables building a library of state functions for common things like quoted strings or numbers where returning to the initial state is as simple as a nil return.
Implementing a lexer ¶
A lexer for any given language is simply a set of state functions referencing each other. The initial state of the DFA is the state where we expect to read a new token. Depending on the input, that initial StateFn returns the appropriate StateFn as the next state. The returned StateFn repeats this process until a match is eventually found for a token, at this point the StateFn calls Emit and returns nil so that the DFA transitions back to the initial state.
EOF conditions must be handled manually. This means that at the very least, the initial state function should always check for EOF and emit an EOF token. Other states should not have to deal with it explicitly. i.e. EOF is not a valid rune, as such unicode.IsDigit() will return false, so will IsSpace. A common exception is tokens that need a terminator (like quoted strings) where EOF should be checked explicitly in order to emit errors in the absence of a terminator.
Error handling ¶
The lex package provides a single built-in Error token. This token is automatically emitted whenever an I/O error occurs or on invalid UTF-8 input.
I/O errors are non-recoverable, that is any subsequent call to Lexer.Lex will return EOF.
State.Next and State.ReadRune always return valid runes. Any invalid UTF-8 input will be skipped and a matching Error token will be automatiocally generated.
Caveat: This may lead to Error tokens appearing alongside seemingly valid tokens in the token stream. For example, lexing numbers in "1\02" will yield an automatically generated Error token followed by some other token with literal value "12". This should be taken into account when implementing error handling strategies.
State sub-package ¶
The state sub-package provides state functions for lexing numbers, quoted strings and quoted characters. By convention, all functions in the state package expect that the first character of the lexed entity has already been read by Lexer.Next. Custom StateFn should also follow this convention.
Index ¶
- Constants
- Variables
- func IsValidOffset(offset int) bool
- type EncodingError
- type File
- type Lexer
- type Position
- type State
- func (s *State) Backup()
- func (s *State) Current() rune
- func (s *State) Emit(offset int, t Token, value interface{})
- func (s *State) Errorf(offset int, format string, args ...interface{})
- func (s *State) Init(initState StateFn) StateFn
- func (s *State) Next() rune
- func (s *State) Peek() rune
- func (s *State) Pos() int
- func (s *State) ReadRune() (rune, int, error)
- func (s *State) StartToken(offset int)
- func (s *State) TokenPos() int
- func (s *State) UnreadRune() error
- type StateFn
- type Token
Examples ¶
Constants ¶
const (
BackupBufferSize = 16 // Size of the undo buffer.
)
Undo buffer constants.
const EOF rune = -1
EOF is the return value from Next() when EOF is reached.
Variables ¶
var ( ErrSeek = errors.New("wrong file offset after seek") ErrNoSeek = errors.New("io.Reader dos not support Seek") ErrLine = errors.New("invalid line number") )
Common errors.
var ( ErrNulChar = &EncodingError{"invalid NUL character"} ErrInvalidRune = &EncodingError{"invalid UTF-8 encoding"} ErrInvalidBOM = &EncodingError{"invalid BOM in the middle of the file"} )
Encoding errors.
var ErrInvalidUnreadRune = errors.New("invalid use of UnreadRune")
ErrInvalidUnreadRune is returned by State.UnreadRune if the undo buffer is empty.
Functions ¶
func IsValidOffset ¶
IsValidOffset returns true if offset is a valid file offset (i.e. p >= 0).
Types ¶
type EncodingError ¶ added in v1.2.0
type EncodingError struct {
// contains filtered or unexported fields
}
An EncodingError may be emitted by State.ReadRune upon reading invalid UTF-8 data.
func (EncodingError) Error ¶ added in v1.2.0
func (e EncodingError) Error() string
type File ¶
A File represents an input file. It's a wrapper around an io.Reader that handles file offset to line/column conversion.
func (*File) AddLine ¶
AddLine adds a new line at the given file offset.
line is the 1-based line index.
If offset is before the offset of the last known line, or if line is not equal to the last know line number plus one, AddLine will panic.
func (*File) GetLineBytes ¶
GetLineBytes returns a string containing the line for the given file offset.
Example ¶
This example shows how one could use File.GetLineBytes to display nicely formatted error messages. For the example's sake, we use a dummy lexer that errors on digits, newlines and EOF.
package main import ( "fmt" "strings" "unicode" "unicode/utf8" "github.com/db47h/lex" "golang.org/x/text/width" ) // This example shows how one could use File.GetLineBytes to display nicely // formatted error messages. // For the example's sake, we use a dummy lexer that errors on digits, newlines // and EOF. func main() { const ( tokEOF = iota tokAny ) expectLT := func(s *lex.State) lex.StateFn { // digits are followed by a < in order to test proper Seek operation in input. if s.Next() != '<' { panic("seek to original pos failed") } return nil } input := "#〄 - Hello 世界 1<\ndéjà vu 2<" f := lex.NewFile("INPUT", strings.NewReader(input)) l := lex.NewLexer(f, func(s *lex.State) lex.StateFn { switch r := s.Next(); r { case lex.EOF: s.Errorf(s.Pos(), "some error @EOF") s.Emit(s.Pos(), tokEOF, nil) case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': s.Errorf(s.Pos(), "digit") return expectLT case '\n': s.Errorf(s.Pos(), "newline") default: s.Emit(s.Pos(), tokAny, r) } return nil }) for { tok, p, v := l.Lex() if tok == tokEOF { break } if tok == lex.Error { reportError(f, p, v.(error).Error()) } } // The following output will display correctly only with monospaced fonts // and a UTF-8 locale. The caret alignment will also be off with some fonts // like Fira Code and East Asian characters. } // reportError reports a lexing error in the form: // // file:line:col: error description // source line where the error occurred followed by a line with a carret at the position of the error. // ^ func reportError(f *lex.File, p int, msg string) { pos := f.Position(p) fmt.Printf("%s: error %s\n", pos, msg) l, err := f.GetLineBytes(p) if err != nil { return } b := pos.Column - 1 if b > len(l) { b = len(l) } fmt.Printf("|%s\n", l) fmt.Printf("|%*c^\n", getWidth(l[:b]), ' ') // or make it red! // fmt.Printf("|%*c\x1b[31m^\x1b[0m\n", getWidth(l[:b]), ' ') } // getWidth computes the width in text cells of a given byte slice. // (supposing rendering with a UTF-8 locale and monospaced font) func getWidth(l []byte) int { w := 0 for i := 0; i < len(l); { r, s := utf8.DecodeRune(l[i:]) i += s if !unicode.IsGraphic(r) { continue } p := width.LookupRune(r) switch p.Kind() { case width.EastAsianFullwidth, width.EastAsianWide: w += 2 case width.EastAsianAmbiguous: w += 1 // depends on user locale. 2 if locale is CJK, 1 otherwise. default: w += 1 } } return w }
Output: INPUT:1:23: error digit |#〄 - Hello 世界 1< | ^ INPUT:1:25: error newline |#〄 - Hello 世界 1< | ^ INPUT:2:11: error digit |déjà vu 2< | ^ INPUT:2:13: error some error @EOF |déjà vu 2< | ^
func (*File) LineOffset ¶
LineOffset returns the file offset of the given line.
type Lexer ¶
type Lexer state
Lexer wraps the public methods of a lexer. This interface is intended for parsers that call New(), then Lex() until EOF.
func NewLexer ¶
NewLexer creates a new lexer associated with the given source file. A new lexer must be created for every source file to be lexed.
func (*Lexer) Lex ¶
Lex reads source text and returns the next item until EOF.
As a convention, once the end of file has been reached (or some non-recoverable error condition), Lex() must return a token type that indicates an EOF condition. A common strategy is to emit an Error token with io.EOF as a value.
type Position ¶
type Position struct { Filename string Line int // 1-based line number Column int // 1-based column number (byte index) }
Position describes an arbitrary source position including the file, line, and column location.
type State ¶
type State state
State holds the internal state of the lexer while processing a given input. Note that the public fields should only be accessed from custom StateFn functions.
func (*State) Backup ¶
func (s *State) Backup()
Backup reverts the last call to Next. Backup can be called at most (BackupBufferSize-1) times in a row (i.e. with no calls to Next in between). Calling Backup beyond the start of the undo buffer or at the beginning of the input stream will fail silently, Pos will return -1 (an invalid offset) and Current will return utf8.RuneSelf, a value impossible to get by any other means.
func (*State) Emit ¶
Emit emits a single token of the given type and value. offset is the file offset for the token (usually s.TokenPos()).
If the emitted token is Error, the value must be an error interface.
func (*State) Errorf ¶
Errorf emits an error token with type Error. The Item value is set to the result of calling fmt.Errorf(format, args...) and offset is the file offset.
func (*State) Init ¶
Init (re-)sets the initial state function for the lexer. It can be used by state functions to implement context switches (e.g. switch from accepting plain text to expressions in a template-like language). This function returns the previous initial state.
func (*State) Next ¶
Next returns the next rune in the input stream. If the end of the input has ben reached it will return EOF. If an I/O error occurs other than io.EOF, it will report the I/O error by calling Errorf then return EOF.
Next only returns valid runes or -1 to indicate EOF. It filters out invalid runes, nul bytes (0x00) and BOMs (U+FEFF) and reports them as errors by calling Errorf (except for a BOM at the beginning of the file which is simply ignored).
func (*State) Peek ¶
Peek returns the next rune in the input stream without consuming it. This is equivalent to calling Next followed by Backup. At EOF, it simply returns EOF.
func (*State) Pos ¶
Pos returns the byte offset of the last rune returned by State.Next. Returns -1 if no input has been read yet.
func (*State) ReadRune ¶ added in v1.2.0
ReadRune reads a single UTF-8 encoded Unicode character and returns the rune and its size in bytes. The returned rune is always valid. Invalid runes, NUL bytes and misplaced BOMs are filtered out and emitted as Error tokens.
This function is here to facilitate interfacing with standard library scanners that need an io.RuneScanner. Custom lexers should use the Next function instead.
func (*State) StartToken ¶
StartToken sets offset as a token start offset. This is a utility function that when used in conjunction with TokenPos enables tracking of a token start position across a StateFn chain without having to manually keep track of it via closures or function parameters.
This is typically called at the start of the initial state function, just after an initial call to next:
func stateInit(s *lexer.State) lexer.StateFn { r := s.Next() s.StartToken(s.Pos()) switch { case r >= 'a' && r <= 'z': return stateIdentifier default: // ... } return nil } func stateIdentifier(s *lexer.State) lexer.StateFn { tokBuf := make([]rune, 0, 64) for r := s.Current(); r >= 'a' && r <= 'z'; r = s.Next() { tokBuf = append(tokBuf, r) } s.Backup() // TokenPos gives us the token starting offset set in stateInit s.Emit(s.TokenPos(), tokTypeIdentifier, string(tokBuf)) return nil }
func (*State) UnreadRune ¶ added in v1.2.0
UnreadRune reverts the last call to ReadRune. It is essentially the same as Backup except for the error return value.