Documentation
¶
Overview ¶
Command egg is an Expression Grammar Generator. At the moment it handles LL(1) grammars.
Step by Step Example - JSON parser ¶
We start in a directory containing the grammar file.
~/tmp/json$ ls json.ebnf ~/tmp/json$
The JSON grammar is
~/tmp/json$ cat json.ebnf
Array = '[' [ Value { ',' Value } ] ']' .
JSON = Value .
Member = str ':' Value .
Object = '{' [ Member { ',' Member } ] '}' .
Value = str | number | Object | Array | "true" | "false" | "null" .
digit = `[0-9]` .
digits = `[0-9]+` .
exponent = ( 'e' | 'E' ) [ '+' | '-' ] digits .
fraction = '.' digits .
hex = `[0-9a-fA-F]` .
integer = digit | non_zero_digit digits | '-' digit | '-' non_zero_digit digits .
non_zero_digit = `[1-9]` .
number = ( integer | integer fraction | integer exponent | integer fraction exponent ) .
str = '"' { `[^"\\\x00-\x1F]` | `\\u` hex hex hex hex | `\\["\\\/bfnrt]` } '"'.
white_space = `( |\t|\r|\n)+` .
~/tmp/json$
Generate the parser from the above EBNF.
~/tmp/json$ egg -o parser.go -test parser_test.go -start JSON json.ebnf ~/tmp/json$ ls json.ebnf parser.go parser_test.go ~/tmp/json$
Format the resulting Go files.
~/tmp/json$ gofmt -l -w . parser.go ~/tmp/json$
Check the package docs.
~/tmp/json$ go doc Parser
type Parser struct {
// Has unexported fields.
}
func (p *Parser) Array() (r []int32)
func (p *Parser) EOF() bool
func (p *Parser) JSON() (r []int32)
func (p *Parser) Member() (r []int32)
func (p *Parser) Object() (r []int32)
func (p *Parser) Parse(name string, src []byte) (ast []int32, err error)
func (p *Parser) Scan() (r scanner.Token)
func (p *Parser) Token(n int32) (r scanner.Token)
func (p *Parser) Value() (r []int32)
~/tmp/json$
Initialize the module and pull in any dependencies.
~/tmp/json$ go mod init example.com/json && go mod tidy go: creating new go.mod: module example.com/json go: to add module requirements and sums: go mod tidy go: finding module for package modernc.org/scanner go: found modernc.org/scanner in modernc.org/scanner v1.2.0 ~/tmp/json$
Run some tests.
~/tmp/json$ go test -v -yes '^y_' -no '^n_' -may '^i_' -- ~/src/github.com/nst/JSONTestSuite/test_parsing/*.json
=== RUN Test
parser_test.go:61: tested files: 318
--- PASS: Test (0.34s)
PASS
ok example.com/json 0.357s
~/tmp/json$
Run some benchmarks.
~/tmp/json$ go test -bench . -- ~/src/modernc.org/egg/lib/testdata/json/bench/*.json goos: linux goarch: amd64 pkg: example.com/json cpu: AMD Ryzen 9 3900X 12-Core Processor Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_canada_geometry.json-24 387 3058389 ns/op 88.41 MB/s 4169885 B/op 54 allocs/op Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_citm_catalog.json-24 74 16133539 ns/op 107.06 MB/s 17511805 B/op 90 allocs/op Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_golang_source.json-24 30 34006219 ns/op 57.06 MB/s 44393455 B/op 72 allocs/op Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_string_unicode.json-24 18906 60256 ns/op 300.78 MB/s 11512 B/op 30 allocs/op Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_synthea_fhir.json-24 64 18625119 ns/op 107.84 MB/s 22372796 B/op 92 allocs/op Benchmark//home/jnml/src/modernc.org/egg/lib/testdata/json/bench/y_twitter_status.json-24 211 5556977 ns/op 113.64 MB/s 6411640 B/op 77 allocs/op PASS ok example.com/json 10.128s ~/tmp/json$
EBNF Grammars ¶
Expression grammars are a slightly modified version of the EBNF used by the Go language specification:
Syntax = { Production } .
Production = production_name '=' [ Expression ] '.' .
Expression = Term { '|' Term } .
Term = Factor { Factor } .
Factor = production_name | token | Group | Option | Repetition .
Group = '(' Expression ')' .
Option = '[' Expression ']' .
Repetition = '{' Expression '}' .
Productions are expressions constructed from terms and the following operators, in increasing precedence:
| alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)
Lowercase production names are used to identify lexical (terminal) tokens. Non-terminals are in CamelCase.
Interpreted strings literals, like "foo", are tokens and will match literally, in this example the rune sequence "foo".
Raw string literals, like `[0-9]`, are tokens and are interpreted as regexps, in this example matching a character class '0'-'9'. Repetitions, like in `re{min,max}` are not supported.
Rune literals, like 'a', are tokens and will match literally, in this example the rune 0x61.
Comments in the expression grammar start with the '#' character and stop at the end of the line.
Lexical production names starting with '_' are reserved.
LL(1) Parsers ¶
Generated LL(1) parsers return an []int32. The slice encodes the AST as follows
- A non-terminal symbol NT has a header encoded as two int32 values (-NT, n), where 'n' is the number of following slice items that represent the AST of the non-terminal symbol.
- A terminal symbol T is represented by a positive index 'n'. The token for T can be obtained using the parser's Token method using 'n' as its argument.
This is testdata/parser/expr.ebnf
# A simple list of mathematical expressions.
# 'Start' is the entry point.
Start = Expr { ";" Expr } .
Expr = Term { operator Term } .
Term = operand | "(" Expr ")" .
# Lexical tokens (lowercased)
operator = "+" | "-" | "*" | "/" .
operand = `[0-9]+` .
white_space = `[ \t\n\r]+` .
The parser generated from the above grammar in expr_test.go defines these symbols
// Symbols
const (
expr_TOK_EOF = expr_Symbol(0) // EOF
expr_TOK_0028 = expr_Symbol(1) // '('
expr_TOK_0029 = expr_Symbol(2) // ')'
expr_TOK_003b = expr_Symbol(3) // ';'
expr_operand = expr_Symbol(4) // operand
expr_operator = expr_Symbol(5) // operator
expr_white_space = expr_Symbol(6) // white_space
expr_Start = expr_Symbol(7) // Start
expr_Expr = expr_Symbol(8) // Expr
expr_Term = expr_Symbol(9) // Term
)
Having this example walk function
func walk[S ~int32](p interface{ Token(int32) scanner.Token }, ast []int32, lvl int) {
for len(ast) != 0 {
next := int32(1)
switch n := ast[0]; {
case n < 0:
fmt.Printf("%s%v\n", strings.Repeat("· ", lvl), S(-n))
next = 2 + ast[1]
walk[S](p, ast[2:next], lvl+1)
default:
tok := p.Token(n)
fmt.Printf("%s %s [%v]\n", strings.Repeat("· ", lvl), tok, S(tok.Ch))
}
ast = ast[next:]
}
}
And this AST from parsing testdata/expr/number="42\n" is
[-7 6 -8 3 -9 1 0 1]
Calling
walk[expr_Symbol](p, ast, 0)
Prints (the part starting with # is an explanatory comment, it is not actually printed)
Start # -7, 6 Start=[-8 3 -9 1 0 1] · Expr # · -8, 3 · Expr=[-9 1 0] · · Term # · · -9, 1 · · Term=[0] · · · number:1:1: "" "42" U+0004 [operand] # · · · 0 · · · Token[0] · number:1:4: "\n" "" U+0000 [EOF] # · 1 · Token[1]
The parser's Token method signature is
func(*receiver) Token(n int32) scanner.Token
Usage ¶
$ egg [flags] <file>
The <file> argument names an Expression Grammar (EBNF) file.
Commnand Line Flags ¶
-constprefix <string>
Strings to prepend to names of generated constants. The default is an empty string.
-dbg
Generate a parser producing debug traces during parsing. The parser calls function `trc` with the same signature as fmt.Printf. You need to provide the trc function somewhere in your package.
-methodprefix <string>
Strings to prepend to names of generated parser methods. The default is an empty string.
-o <filename>
Name the output parser file. Example
$ egg -o parser.go ebnf.go
This flag is optional, output defaults to os.Stdout.
-package <string>
The string argument is the package name for the produced parser. The argument defaults to "main".
-start <string>
The string argument names the top level non terminal in the input EBNF file. The argument defaults to "Start".
-test <filename>
Write a rudimentary test/benchmark facility to <filename>. Example
$ egg -o parser.go -test parser_test.go
-type <string>
The string argument names the parser type. The argument defaults to "Parser".
-typeprefix <string>
Strings to prepend to names of other generated types. The default is an empty string.
-varprefix <string>
Strings to prepend to names of generated variables. The default is an empty string.
-white <string>
The string argument names the lexical/terminal production in the input EBNF file which defines white space separating tokens. The argument defaults to "white_space".
Test Facitlity Command Line Flags ¶
When using the produced test facility, all non flag argument on the command line are considered test inputs. Example
$ go test -- ~/foo/*.bar
The produced test command line flags are
-yes <string>
A regular expression for input base names that must parse without errors. Defaults to "" meaning all test inputs are expected to parse without errors.
Example: `-yes '^y_'` means all files with prefix `y_` must parse without errors.
-no <string>
A regular expression for input base names that must parse with errors. Defaults to "" meaning no test inputs are expected to parse with errors.
Example: `-no '^n_'` means all files with prefix `n_` must parse with errors.
-may <string>
A regular expression for input base names that may parse with errors but that is not considered a failure. Defaults to "" meaning no test inputs are expected to parse with errors.
Example: `-may '^i_'` means all files with prefix `i_` may parse with errors without failing the test.
Test Facitlity Benchmarking ¶
The benchmark considers all non flag argument on the command line as test inputs. Example
$ go test -bench . -- ~/foo/*.bar
