lex

package module
v1.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2020 License: MIT Imports: 5 Imported by: 3

README

lex

godocb

Overview

Package lex provides the core of a lexer built as a Deterministic Finite State Automaton whose states and associated actions are implemented as functions.

Clients of the package only need to provide state functions specialized in lexing the target language. The package provides facilities to stream input from a io.Reader with up to 15 bytes look-ahead, as well as utility functions commonly used in lexers.

The implementation is similar to https://golang.org/src/text/template/parse/lex.go. See also Rob Pike's talk about combining states and actions into state functions: https://talks.golang.org/2011/lex.slide.

Read the full package ducumentation on gpkg.go.dev.

Release notes

v1.2.1

Improvements to error handling:

Lexer.Errorf now generates error values instead of string. lexer.Emit enforces the use of error values for Error tokens.

Scanner now implements the io.RuneScanner interface.

This is a minor API breakage that does not impact any known client code.

v1.0.0

Initial release.

License

Package lex is released under the terms of the MIT license:

Copyright 2017-2020 Denis Bernard db047h@gmail.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Documentation

Overview

Package lex provides the core of a lexer built as a Deterministic Finite State Automaton whose states and associated actions are implemented as functions.

Clients of the package only need to provide state functions specialized in lexing the target language. The package provides facilities to stream input from a io.Reader with up to 15 bytes look-ahead, as well as utility functions commonly used in lexers.

State functions

The implementation is similar to https://golang.org/src/text/template/parse/lex.go. See also Rob Pike's talk about combining states and actions into state functions: https://talks.golang.org/2011/lex.slide.

TL;DR: State machines are often implemented with a switch statement:

// One iteration:
switch state {
case sate1:
	state = action1()
case state2:
	state = action2()
// etc.
}

States represent where we are and what to expect while actions represent what we do and result in a new state. By taking advantage of the fact that functions are values, we can aggregate state and action. First we define a state function type:

type StateFn func(*State) StateFn

A StateFn is both state and action. It takes a lexer argument (to allow it to read from the input stream and emit tokens) and returns another state function.

Then the state transition loop can be rewritten like this:

// One iteration:
state = state()

This package can also be used as a toolbox for a more traditional switch-based lexer (see implementation below).

Implementation details

Unlike the Go text template package which uses Go channels as a means of asynchronous token emission, this package uses a FIFO queue instead. Benchmarks with an earlier implementation that used a channel showed that using a FIFO is about 5 times faster. There is also no need for the parser to take care of cancellation or draining the input in case of error.

The drawback of using a FIFO is that once a StateFn has called Emit, it should return as soon as possible so that the caller of Lex (usually a parser) can dequeue the lexed item. Since the item queue grows dynamically, it is also possible to write a state function that never returns until EOF is reached.

The initial state of the DFA is the state where we expect to read a new token. From that initial state, the lexer transitions to other states until a token is successfully matched or an error occurs. The state function that finds a match or error emits the corresponding token and finally returns nil to transition back to the initial state.

The "nil-return means initial state" convention enables building a library of state functions for common things like quoted strings or numbers where returning to the initial state is as simple as a nil return.

Implementing a lexer

A lexer for any given language is simply a set of state functions referencing each other. The initial state of the DFA is the state where we expect to read a new token. Depending on the input, that initial StateFn returns the appropriate StateFn as the next state. The returned StateFn repeats this process until a match is eventually found for a token, at this point the StateFn calls Emit and returns nil so that the DFA transitions back to the initial state.

EOF conditions must be handled manually. This means that at the very least, the initial state function should always check for EOF and emit an EOF token. Other states should not have to deal with it explicitly. i.e. EOF is not a valid rune, as such unicode.IsDigit() will return false, so will IsSpace. A common exception is tokens that need a terminator (like quoted strings) where EOF should be checked explicitly in order to emit errors in the absence of a terminator.

Error handling

The lex package provides a single built-in Error token. This token is automatically emitted whenever an I/O error occurs or on invalid UTF-8 input.

I/O errors are non-recoverable, that is any subsequent call to Lexer.Lex will return EOF.

State.Next and State.ReadRune always return valid runes. Any invalid UTF-8 input will be skipped and a matching Error token will be automatiocally generated.

Caveat: This may lead to Error tokens appearing alongside seemingly valid tokens in the token stream. For example, lexing numbers in "1\02" will yield an automatically generated Error token followed by some other token with literal value "12". This should be taken into account when implementing error handling strategies.

State sub-package

The state sub-package provides state functions for lexing numbers, quoted strings and quoted characters. By convention, all functions in the state package expect that the first character of the lexed entity has already been read by Lexer.Next. Custom StateFn should also follow this convention.

Index

Examples

Constants

View Source
const (
	BackupBufferSize = 16 // Size of the undo buffer.

)

Undo buffer constants.

View Source
const EOF rune = -1

EOF is the return value from Next() when EOF is reached.

Variables

View Source
var (
	ErrSeek   = errors.New("wrong file offset after seek")
	ErrNoSeek = errors.New("io.Reader dos not support Seek")
	ErrLine   = errors.New("invalid line number")
)

Common errors.

View Source
var (
	ErrNulChar     = &EncodingError{"invalid NUL character"}
	ErrInvalidRune = &EncodingError{"invalid UTF-8 encoding"}
	ErrInvalidBOM  = &EncodingError{"invalid BOM in the middle of the file"}
)

Encoding errors.

View Source
var ErrInvalidUnreadRune = errors.New("invalid use of UnreadRune")

ErrInvalidUnreadRune is returned by State.UnreadRune if the undo buffer is empty.

Functions

func IsValidOffset

func IsValidOffset(offset int) bool

IsValidOffset returns true if offset is a valid file offset (i.e. p >= 0).

Types

type EncodingError added in v1.2.0

type EncodingError struct {
	// contains filtered or unexported fields
}

An EncodingError may be emitted by State.ReadRune upon reading invalid UTF-8 data.

func (EncodingError) Error added in v1.2.0

func (e EncodingError) Error() string

type File

type File struct {
	io.Reader
	// contains filtered or unexported fields
}

A File represents an input file. It's a wrapper around an io.Reader that handles file offset to line/column conversion.

func NewFile

func NewFile(name string, r io.Reader) *File

NewFile returns a new File.

func (*File) AddLine

func (f *File) AddLine(offset int, line int)

AddLine adds a new line at the given file offset.

line is the 1-based line index.

If offset is before the offset of the last known line, or if line is not equal to the last know line number plus one, AddLine will panic.

func (*File) GetLineBytes

func (f *File) GetLineBytes(offset int) (l []byte, err error)

GetLineBytes returns a string containing the line for the given file offset.

Example

This example shows how one could use File.GetLineBytes to display nicely formatted error messages. For the example's sake, we use a dummy lexer that errors on digits, newlines and EOF.

package main

import (
	"fmt"
	"strings"
	"unicode"
	"unicode/utf8"

	"github.com/db47h/lex"
	"golang.org/x/text/width"
)

// This example shows how one could use File.GetLineBytes to display nicely
// formatted error messages.
// For the example's sake, we use a dummy lexer that errors on digits, newlines
// and EOF.
func main() {
	const (
		tokEOF = iota
		tokAny
	)
	expectLT := func(s *lex.State) lex.StateFn {
		// digits are followed by a < in order to test proper Seek operation in input.
		if s.Next() != '<' {
			panic("seek to original pos failed")
		}
		return nil
	}
	input := "#〄 - Hello 世界 1<\ndéjà vu 2<"
	f := lex.NewFile("INPUT", strings.NewReader(input))
	l := lex.NewLexer(f, func(s *lex.State) lex.StateFn {
		switch r := s.Next(); r {
		case lex.EOF:
			s.Errorf(s.Pos(), "some error @EOF")
			s.Emit(s.Pos(), tokEOF, nil)
		case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
			s.Errorf(s.Pos(), "digit")
			return expectLT
		case '\n':
			s.Errorf(s.Pos(), "newline")
		default:
			s.Emit(s.Pos(), tokAny, r)
		}
		return nil
	})
	for {
		tok, p, v := l.Lex()
		if tok == tokEOF {
			break
		}
		if tok == lex.Error {
			reportError(f, p, v.(error).Error())
		}
	}

	// The following output will display correctly only with monospaced fonts
	// and a UTF-8 locale. The caret alignment will also be off with some fonts
	// like Fira Code and East Asian characters.

}

// reportError reports a lexing error in the form:
//
//	file:line:col: error description
//		source line where the error occurred followed by a line with a carret at the position of the error.
//						      ^
func reportError(f *lex.File, p int, msg string) {
	pos := f.Position(p)
	fmt.Printf("%s: error %s\n", pos, msg)
	l, err := f.GetLineBytes(p)
	if err != nil {
		return
	}
	b := pos.Column - 1
	if b > len(l) {
		b = len(l)
	}
	fmt.Printf("|%s\n", l)
	fmt.Printf("|%*c^\n", getWidth(l[:b]), ' ')
	// or make it red!
	// fmt.Printf("|%*c\x1b[31m^\x1b[0m\n", getWidth(l[:b]), ' ')
}

// getWidth computes the width in text cells of a given byte slice.
// (supposing rendering with a UTF-8 locale and monospaced font)
func getWidth(l []byte) int {
	w := 0
	for i := 0; i < len(l); {
		r, s := utf8.DecodeRune(l[i:])
		i += s
		if !unicode.IsGraphic(r) {
			continue
		}
		p := width.LookupRune(r)
		switch p.Kind() {
		case width.EastAsianFullwidth, width.EastAsianWide:
			w += 2
		case width.EastAsianAmbiguous:
			w += 1 // depends on user locale. 2 if locale is CJK, 1 otherwise.
		default:
			w += 1
		}
	}
	return w
}
Output:

INPUT:1:23: error digit
|#〄 - Hello 世界 1<
|                  ^
INPUT:1:25: error newline
|#〄 - Hello 世界 1<
|                    ^
INPUT:2:11: error digit
|déjà vu 2<
|        ^
INPUT:2:13: error some error @EOF
|déjà vu 2<
|          ^

func (*File) LineOffset

func (f *File) LineOffset(line int) int

LineOffset returns the file offset of the given line.

func (*File) Name

func (f *File) Name() string

Name returns the file name.

func (*File) Position

func (f *File) Position(offset int) Position

Position returns the 1-based line and column for a given file offset. The returned column is a byte offset, not a rune offset.

type Lexer

type Lexer state

Lexer wraps the public methods of a lexer. This interface is intended for parsers that call New(), then Lex() until EOF.

func NewLexer

func NewLexer(f *File, init StateFn) *Lexer

NewLexer creates a new lexer associated with the given source file. A new lexer must be created for every source file to be lexed.

func (*Lexer) File

func (l *Lexer) File() *File

File returns the File used as input for the lexer.

func (*Lexer) Lex

func (l *Lexer) Lex() (Token, int, interface{})

Lex reads source text and returns the next item until EOF.

As a convention, once the end of file has been reached (or some non-recoverable error condition), Lex() must return a token type that indicates an EOF condition. A common strategy is to emit an Error token with io.EOF as a value.

type Position

type Position struct {
	Filename string
	Line     int // 1-based line number
	Column   int // 1-based column number (byte index)
}

Position describes an arbitrary source position including the file, line, and column location.

func (Position) String

func (p Position) String() string

type State

type State state

State holds the internal state of the lexer while processing a given input. Note that the public fields should only be accessed from custom StateFn functions.

func (*State) Backup

func (s *State) Backup()

Backup reverts the last call to Next. Backup can be called at most (BackupBufferSize-1) times in a row (i.e. with no calls to Next in between). Calling Backup beyond the start of the undo buffer or at the beginning of the input stream will fail silently, Pos will return -1 (an invalid offset) and Current will return utf8.RuneSelf, a value impossible to get by any other means.

func (*State) Current

func (s *State) Current() rune

Current returns the last rune returned by State.Next.

func (*State) Emit

func (s *State) Emit(offset int, t Token, value interface{})

Emit emits a single token of the given type and value. offset is the file offset for the token (usually s.TokenPos()).

If the emitted token is Error, the value must be an error interface.

func (*State) Errorf

func (s *State) Errorf(offset int, format string, args ...interface{})

Errorf emits an error token with type Error. The Item value is set to the result of calling fmt.Errorf(format, args...) and offset is the file offset.

func (*State) Init

func (s *State) Init(initState StateFn) StateFn

Init (re-)sets the initial state function for the lexer. It can be used by state functions to implement context switches (e.g. switch from accepting plain text to expressions in a template-like language). This function returns the previous initial state.

func (*State) Next

func (s *State) Next() rune

Next returns the next rune in the input stream. If the end of the input has ben reached it will return EOF. If an I/O error occurs other than io.EOF, it will report the I/O error by calling Errorf then return EOF.

Next only returns valid runes or -1 to indicate EOF. It filters out invalid runes, nul bytes (0x00) and BOMs (U+FEFF) and reports them as errors by calling Errorf (except for a BOM at the beginning of the file which is simply ignored).

func (*State) Peek

func (s *State) Peek() rune

Peek returns the next rune in the input stream without consuming it. This is equivalent to calling Next followed by Backup. At EOF, it simply returns EOF.

func (*State) Pos

func (s *State) Pos() int

Pos returns the byte offset of the last rune returned by State.Next. Returns -1 if no input has been read yet.

func (*State) ReadRune added in v1.2.0

func (s *State) ReadRune() (rune, int, error)

ReadRune reads a single UTF-8 encoded Unicode character and returns the rune and its size in bytes. The returned rune is always valid. Invalid runes, NUL bytes and misplaced BOMs are filtered out and emitted as Error tokens.

This function is here to facilitate interfacing with standard library scanners that need an io.RuneScanner. Custom lexers should use the Next function instead.

func (*State) StartToken

func (s *State) StartToken(offset int)

StartToken sets offset as a token start offset. This is a utility function that when used in conjunction with TokenPos enables tracking of a token start position across a StateFn chain without having to manually keep track of it via closures or function parameters.

This is typically called at the start of the initial state function, just after an initial call to next:

func stateInit(s *lexer.State) lexer.StateFn {
	r := s.Next()
	s.StartToken(s.Pos())
	switch {
	case r >= 'a' && r <= 'z':
		return stateIdentifier
	default:
		// ...
	}
	return nil
}

func stateIdentifier(s *lexer.State) lexer.StateFn {
	tokBuf := make([]rune, 0, 64)
	for r := s.Current(); r >= 'a' && r <= 'z'; r = s.Next() {
		tokBuf = append(tokBuf, r)
	}
	s.Backup()
	// TokenPos gives us the token starting offset set in stateInit
	s.Emit(s.TokenPos(), tokTypeIdentifier, string(tokBuf))
	return nil
}

func (*State) TokenPos

func (s *State) TokenPos() int

TokenPos returns the last offset set by StartToken.

func (*State) UnreadRune added in v1.2.0

func (s *State) UnreadRune() error

UnreadRune reverts the last call to ReadRune. It is essentially the same as Backup except for the error return value.

type StateFn

type StateFn func(l *State) StateFn

A StateFn is a state function.

If a StateFn returns nil, the lexer resets the current token starting offset then transitions back to its initial state function.

type Token

type Token int

A Token represents the type of a token. Custom lexers can use any value >= 0.

const Error Token = -1

Error is the token type for error tokens.

Directories

Path Synopsis
Package state provides state functions for lexing numbers, quoted strings and quoted characters.
Package state provides state functions for lexing numbers, quoted strings and quoted characters.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL