ragel

package module
v1.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2018 License: MIT Imports: 3 Imported by: 0

README

ragel-go

godocb goreportb

Obsolete Branch: please use the v2 branch

This package provides a driver for ragel based scanners in Go for streamed input. It takes care of all the boiler plate code, letting the user focus on the ragel state machine definition.

Intro

Streaming input data to ragel -- i.e. reading from an io.Reader as opposed to loading the whole input into memory, requires a driver, a piece of code that buffers input and manipulates pointers for the ragel engine.

Here is the Go version of the main loop from the C-like scanner in the ragel examples:

func scanner(r io.Reader) {
    var (
        buf = make([]byte, BUFSIZE)
        cs, act, ts, te int
        have, curLine = 0, 1
        done bool
    )

    // ragel init code
    %% write init;

    for !done {
        var (
            err error
            p = have
            ln, pe int
            eof = -1
        )

        if ( p == len(buf) ) {
            fmt.Fprintf(os.Stderr, "out of buffer space\n" )
            os.Exit(1)
        }

        for {
            ln, err = r.Read(buf[p:])
            if ln != 0 || err != nil {
                break
            }
        }
        pe = p + ln;

        // trigger end of file on any error
        if err != nil {
            eof = pe
            done = true
        }

        // Run the ragel state machine
        %% write exec;

        if cs == clang_error {
            fmt.Fprintf(os.Stderr, "parse error (invalid char)\n" )
            break
        }

        if ts == 0 {
            have = 0
        } else {
            // There is a prefix to preserve, shift it over.
            have = pe - ts;
            copy(buf, buf[ts:pe])
            te -= ts
            ts = 0
        }
    }
}

In this example, the state machine actions just write tokens to stdout. So we have quite a lot of scaffolding and a few important things are missing, like a convenient way to retrieve the tokens and some mechanism to keep track of line and column position for these tokens.

Implementation

The Scanner type provides all the functionality needed to read data from an io.Reader, buffering, token position tracking (line and column) and error handling.

The scanner reads input data in 32KiB chunks (the default buffer size), then tokenizes the whole buffer and pushes the tokens found in a FIFO queue.

Note that this limits the size of the largest possible token to about 32767 bytes (and any token longer than this will not necessarily cause an error).

An alternative to using a queue would be to run the scanning loop in a goroutine and use a channel to emit tokens (like text/template/parse in Go's standard library). This approach is however slower than using a queue by an order of magnitude.

Error handling is done by returning a ragel.Error token. This can happen in 3 cases: an IO error while reading from the input reader, the input buffer gets full, or an illegal symbol is encountered. In the latter case, scanning resumes at the following byte. The other two are non-recoverable errors; any subsequent call to scanner.Next returns ragel.EOF.

Errors can also be generated from the state machine's actions, in which case the caller decides if an error is fatal or not. For non-fatal errors, callers must update ragel's p variable or issue an appropriate fgoto).

UTF8 input is supported via ragel's contib/unicode2ragel.rb. See lang_test.rl for an example use.

How to

In essence, creating a scanner is as simple as creating a ragel machine definition for your language of choice, along with a tiny stub interface.

Ragel state machine definition

Here is an abbreviated version for the C-like language mentioned above (with support for comments, strings and char literals removed for brevity):

// Package myc provides a scanner for my-C language.
package scan // import "github.com/me/myc/scan"

import "github.com/db47h/ragel"

// Tokens for my-C
//
const (
    Ident ragel.Token = iota
    Float
    Symbol
)

// state machine definition
%%{
    machine lang;
    include WChar "utf8.rl";

    newline = '\n' @{ s.Newline(p) };
    any_count_line = any | newline;

    main := |*

    alnum_u = ualnum | '_';

    alpha_u = ualpha | '_';

    punct {
        s.Emit(ts, Symbol, string(data[ts:te]));
    };

    alpha_u alnum_u* {
        s.Emit(ts, Ident, string(data[ts:te]))
    };

    newline | 0x00..0x20 | 0x7f;

    udigit+ ('.' udigit+)* {
        s.Emit(ts, Float, string(data[ts:te]))
    };
    *|;
}%%

%%write data nofinal;

// FSM implements ragel.FSM. This is the stub interface between the ragel
// generated code for a given machine and ragel.Scanner.
//
type FSM struct {}

func (FSM) Init(s *ragel.Scanner) {
    var cs, ts, te, act int
    %%write init;
    s.SetState(cs, ts, te, act)
}

func (FSM) States() (start, err int) {
    return %%{ write start; }%%, %%{ write error; }%%
}

func (FSM) Run(s *ragel.Scanner, p, pe, eof int) (int, int) {
    cs, ts, te, act, data := s.GetState()
    %%write exec;
    s.SetState(cs, ts, te, act)
    return p, pe
}

Supposing that this is in a file named myc.rl, run it through ragel with the following command:

    ragel -Z -G2 myc.rl

This will generate a myc.go file. I tend to use the -G2 because it yields the fastest running code in most of my use cases. Your mileage may vary.

You might also want add go:generate directives in some other .go file in the same package:

//go:generate ragel -Z -G2 myc.rl
//go:generate sed -i "1s|.*|// Code generated by ragel. DO NOT EDIT.|" myc.go

The second directive changes the first line of the generated code so that it gets ignored by the Go tooling.

The FSM struct defined in the ragel file is the stub interface between the generated code and ragel.Scanner. Besides changing the struct name to suit your needs (e.g. make it private), this code should be used as-is. If you feel that you need to change it for some reason, please file an issue.

Scanning

Create a new ragel.Scanner and pass it the FSM struct defined above:

package main

import (
    "os"
    "fmt"

    "github.com/db47h/ragel"
    "github.com/me/myc/scan"
)

func main() {
    s := ragel.New("stdin", os.Stdin, scan.FSM{})
    for {
        pos, tok, literal := s.Next()
        switch tok {
        case ragel.EOF:
            return
        case ragel.Error:
            fmt.Fprintf(os.Stderr, "scan error: %v: %v\n", s.Pos(pos), literal)
        default:
            fmt.Printf("%v: %d %s\n", s.Pos(), tok, literal)
        }
    }
}

That's it.

API

The Scanner API has three users, each using a specific set of methods:

  • the ragel state machine (actions depending on scanned input)
    • Emit
    • Errorf
    • Newline
  • the stub interface
    • SetState
    • GetState
  • the scanner client (code that wants to scan some input)
    • New
    • Next
    • Pos

This could be abstracted away using interfaces but it would affect performance for no real benefit. Performance is the reason to use ragel in the first place.

LICENSE

Copyright 2018 Denis Bernard db047h@gmail.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Documentation

Overview

Package ragel provides a thin wrapper for ragel based scanners.

The Scanner type provides all the functionality needed to read data from an io.Reader, buffering, token position tracking (line and column) and error handling.

The scanner reads input data in 32KiB chunks (the default buffer size), then tokenizes the whole buffer and pushes the tokens found in a FIFO queue.

Note that this limits the size of the largest possible token to about 32767 bytes (and any token longer than this will not necessarily cause an error).

Error handling is done by returning a ragel.Error token. This can happen in 3 cases: an IO error while reading from the input reader, the input buffer gets full, or an illegal symbol is encountered. In the latter case, scanning resumes at the following byte. The other two are non-recoverable errors; any subsequent call to scanner.Next returns ragel.EOF.

Errors can also be generated from the state machine's actions, in which case the caller decides if an error is fatal or not. For non-fatal errors, callers must update ragel's p variable or issue an appropriate fgoto).

UTF8 input is supported via ragel's contib/unicode2ragel.rb. See lang_test.rl for an example use.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type FSM

type FSM interface {
	Init(l *Scanner)
	States() (start, err int)
	Run(l *Scanner, p, pe, eof int) (np, npe int)
}

A FSM provides an interface to ragel generated code for a given state machine. The implementation is part of the ragel machine definition file.

See lang_test.rl for a typical implementation.

type Position

type Position struct {
	FileName string
	Line     int // 1-based line number
	Column   int // 1-based column number in bytes
}

Position represents the position of a token as its 1-based line and column numbers.

func (Position) String

func (p Position) String() string

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner holds the scanner's internal state while processing its input.

func New

func New(fileName string, r io.Reader, f FSM) *Scanner

New returns a new scanner for the given io.Reader and FSM.

The FSM implementation is provided by the ragel generated code.

func (*Scanner) Emit

func (s *Scanner) Emit(ts int, typ Token, value string)

Emit adds the given token to the output queue. The ts argument is usually the ragel variable ts.

func (*Scanner) Errorf

func (s *Scanner) Errorf(p int, fatal bool, format string, args ...interface{})

Errorf emits an Error token at position p. If the fatal argument is true, any subsequent call to Scanner.Next will return an EOF token, otherwise scanning will resume from position p (which the caller must update accordingly). The format and args arguments are passed through fmt.Sprintf to form the tokens's literal value.

func (*Scanner) GetState

func (s *Scanner) GetState() (cs, ts, te, act int, data []byte)

GetState gets the fsm state and buffer. This function must be called at the beginning of FSM.Run.

func (*Scanner) Newline

func (s *Scanner) Newline(p int)

Newline increments the line count. The p argument should be the ragel variable p.

func (*Scanner) Next

func (s *Scanner) Next() (offset int, token Token, literal string)

Next returns the next token in the input stream.

func (*Scanner) Pos

func (s *Scanner) Pos(offset int) Position

Pos returns the line/column Position for the given offset.

func (*Scanner) Reset

func (s *Scanner) Reset()

Reset resets the scanner to the start of the input stream. It will panic if the source reader is not an io.Seeker.

func (*Scanner) SetState

func (s *Scanner) SetState(cs, ts, te, act int)

SetState saves the fsm state. This function must be called before returning from FSM.Run and FSM.Init.

type Token

type Token int

Token represents a lexical token.

const (
	Error Token = -2 + iota
	EOF
)

Token types.

Directories

Path Synopsis
examples
csv
This example shows how to build a parser on top of a ragel scanner to parse CSV files.
This example shows how to build a parser on top of a ragel scanner to parse CSV files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL