README

vellum vellum

Tests Coverage Status GoDoc Go Report Card License

A Go library implementing an FST (finite state transducer) capable of:

  • mapping between keys ([]byte) and a value (uint64)
  • enumerating keys in lexicographic order

Some additional goals of this implementation:

  • bounded memory use while building the FST
  • streaming out FST data while building
  • mmap FST runtime to support very large FTSs (optional)

Usage

Building an FST

To build an FST, create a new builder using the New() method. This method takes an io.Writer as an argument. As the FST is being built, data will be streamed to the writer as soon as possible. With this builder you MUST insert keys in lexicographic order. Inserting keys out of order will result in an error. After inserting the last key into the builder, you MUST call Close() on the builder. This will flush all remaining data to the underlying writer.

In memory:

  var buf bytes.Buffer
  builder, err := vellum.New(&buf, nil)
  if err != nil {
    log.Fatal(err)
  }

To disk:

  f, err := os.Create("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }
  builder, err := vellum.New(f, nil)
  if err != nil {
    log.Fatal(err)
  }

MUST insert keys in lexicographic order:

err = builder.Insert([]byte("cat"), 1)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("dog"), 2)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("fish"), 3)
if err != nil {
  log.Fatal(err)
}

err = builder.Close()
if err != nil {
  log.Fatal(err)
}
Using an FST

After closing the builder, the data can be used to instantiate an FST. If the data was written to disk, you can use the Open() method to mmap the file. If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the Load() method.

Load in memory:

  fst, err := vellum.Load(buf.Bytes())
  if err != nil {
    log.Fatal(err)
  }

Open from disk:

  fst, err := vellum.Open("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }

Get key/value:

  val, exists, err = fst.Get([]byte("dog"))
  if err != nil {
    log.Fatal(err)
  }
  if exists {
    fmt.Printf("contains dog with val: %d\n", val)
  } else {
    fmt.Printf("does not contain dog")
  }

Iterate key/values:

  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
  for err == nil {
    key, val := itr.Current()
    fmt.Printf("contains key: %s val: %d", key, val)
    err = itr.Next()
  }
  if err != nil {
    log.Fatal(err)
  }
How does the FST get built?

A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.

First we insert "are" with the value 4.

step1

Next, we insert "ate" with the value 2.

step2

Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.

At this point, we see that state 5 looks like state 3, and state 4 looks like state 2. But, we cannot yet combine them because future inserts could change this.

Now, we insert "see" with value 3. Once it has been added, we now know that states 5 and 4 can longer change. Since they are identical to 3 and 2, we replace them.

step3

Again, we see that states 7 and 8 appear to be identical to 2 and 3.

Having inserted our last key, we call Close() on the builder.

step4

Now, states 7 and 8 can safely be replaced with 2 and 3.

For additional information, see the references at the bottom of this document.

What does the serialized format look like?

We've broken out a separate document on the vellum disk format v1.

What if I want to use this on a system that doesn't have mmap?

The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum. If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the nommap build tag. NOTE: if you do this, the entire FST will be read into memory.

Can I use this with Unicode strings?

Yes, however this implementation is only aware of the byte representation you choose. In order to find matches, you must work with some canonical byte representation of the string. In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.

How did this library come to be?

In my work on the Bleve project I became aware of the power of the FST for many search-related tasks. The obvious starting point for such a thing in Go was the mafsa project. While working with mafsa I encountered some issues. First, it did not stream data to disk while building. Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end. My hope is that higher-level encoding-aware traversals will be possible when necessary. Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained. I wanted to build something that could be used in production. As the project advanced more and more techniques from the BurntSushi/fst were adapted to our implementation.

Are there tools to work with vellum files?

Under the cmd/vellum subdirectory, there's a command-line tool which features subcommands that can allow you to create, inspect and query vellum files.

How can I generate a state transition diagram from a vellum file?

The vellum command-line tool has a "dot" subcommand that can emit graphviz dot output data from an input vellum file. The dot file can in turn be converted into an image using graphviz tools. Example...

$ vellum dot myFile.vellum > output.dot
$ dot -Tpng output.dot -o output.png

Much credit goes to two existing projects:

Most of the original implementation here started with my digging into the internals of mafsa. As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.

For a great introduction to this topic, please read the blog post Index 1,600,000,000 Keys with Automata and Rust

Expand ▾ Collapse ▴

Documentation

Overview

    Package vellum is a library for building, serializing and executing an FST (finite state transducer).

    There are two distinct phases, building an FST and using it.

    When building an FST, you insert keys ([]byte) and their associated value (uint64). Insert operations MUST be done in lexicographic order. While building the FST, data is streamed to an underlying Writer. At the conclusion of building, you MUST call Close() on the builder.

    After completion of the build phase, you can either Open() the FST if you serialized it to disk. Alternatively, if you already have the bytes in memory, you can use Load(). By default, Open() will use mmap to avoid loading the entire file into memory.

    Once the FST is ready, you can use the Contains() method to see if a keys is in the FST. You can use the Get() method to see if a key is in the FST and retrieve it's associated value. And, you can use the Iterator method to enumerate key/value pairs within a specified range.

    Example
    Output:
    
    1
    2
    3
    

    Index

    Examples

    Constants

    This section is empty.

    Variables

    View Source
    var ErrIteratorDone = errors.New("iterator-done")

      ErrIteratorDone is returned by Iterator/Next/Seek methods when the Current() value pointed to by the iterator is greater than the last key in this FST, or outside the configured startKeyInclusive/endKeyExclusive range of the Iterator.

      View Source
      var ErrOutOfOrder = errors.New("values not inserted in lexicographic order")

        ErrOutOfOrder is returned when values are not inserted in lexicographic order.

        Functions

        func AutomatonContains

        func AutomatonContains(a Automaton, k []byte) bool

          AutomatonContains implements an generic Contains() method which works on any implementation of Automaton

          func Merge

          func Merge(w io.Writer, opts *BuilderOpts, itrs []Iterator, f MergeFunc) error

            Merge will iterate through the provided Iterators, merge duplicate keys with the provided MergeFunc, and build a new FST to the provided Writer.

            func MergeMax

            func MergeMax(vals []uint64) uint64

              MergeMax chooses the maximum value

              func MergeMin

              func MergeMin(vals []uint64) uint64

                MergeMin chooses the minimum value

                func MergeSum

                func MergeSum(vals []uint64) uint64

                  MergeSum sums the values

                  func TransducerGet

                  func TransducerGet(t Transducer, k []byte) (bool, uint64)

                    TransducerGet implements an generic Get() method which works on any implementation of Transducer The caller MUST check the boolean return value for a match. Zero is a valid value regardless of match status, and if it is NOT a match, the value collected so far is returned.

                    Types

                    type AlwaysMatch

                    type AlwaysMatch struct{}

                      AlwaysMatch is an Automaton implementation which always matches

                      func (*AlwaysMatch) Accept

                      func (m *AlwaysMatch) Accept(int, byte) int

                        Accept returns the next AlwaysMatch state

                        func (*AlwaysMatch) CanMatch

                        func (m *AlwaysMatch) CanMatch(int) bool

                          CanMatch always returns true

                          func (*AlwaysMatch) IsMatch

                          func (m *AlwaysMatch) IsMatch(int) bool

                            IsMatch always returns true

                            func (*AlwaysMatch) Start

                            func (m *AlwaysMatch) Start() int

                              Start returns the AlwaysMatch start state

                              func (*AlwaysMatch) WillAlwaysMatch

                              func (m *AlwaysMatch) WillAlwaysMatch(int) bool

                                WillAlwaysMatch always returns true

                                type Automaton

                                type Automaton interface {
                                
                                	// Start returns the start state
                                	Start() int
                                
                                	// IsMatch returns true if and only if the state is a match
                                	IsMatch(int) bool
                                
                                	// CanMatch returns true if and only if it is possible to reach a match
                                	// in zero or more steps
                                	CanMatch(int) bool
                                
                                	// WillAlwaysMatch returns true if and only if the current state matches
                                	// and will always match no matter what steps are taken
                                	WillAlwaysMatch(int) bool
                                
                                	// Accept returns the next state given the input to the specified state
                                	Accept(int, byte) int
                                }

                                  Automaton represents the general contract of a byte-based finite automaton

                                  type Builder

                                  type Builder struct {
                                  	// contains filtered or unexported fields
                                  }

                                    A Builder is used to build a new FST. When possible data is streamed out to the underlying Writer as soon as possible.

                                    func New

                                    func New(w io.Writer, opts *BuilderOpts) (*Builder, error)

                                      New returns a new Builder which will stream out the underlying representation to the provided Writer as the set is built.

                                      func (*Builder) Close

                                      func (b *Builder) Close() error

                                        Close MUST be called after inserting all values.

                                        func (*Builder) Insert

                                        func (b *Builder) Insert(key []byte, val uint64) error

                                          Insert the provided value to the set being built. NOTE: values must be inserted in lexicographical order.

                                          func (*Builder) Reset

                                          func (b *Builder) Reset(w io.Writer) error

                                          type BuilderOpts

                                          type BuilderOpts struct {
                                          	Encoder           int
                                          	RegistryTableSize int
                                          	RegistryMRUSize   int
                                          }

                                            BuilderOpts is a structure to let advanced users customize the behavior of the builder and some aspects of the generated FST.

                                            type FST

                                            type FST struct {
                                            	// contains filtered or unexported fields
                                            }

                                              FST is an in-memory representation of a finite state transducer, capable of returning the uint64 value associated with each []byte key stored, as well as enumerating all of the keys in order.

                                              func Load

                                              func Load(data []byte) (*FST, error)

                                                Load will return the FST represented by the provided byte slice.

                                                func Open

                                                func Open(path string) (*FST, error)

                                                  Open loads the FST stored in the provided path

                                                  func (*FST) Accept

                                                  func (f *FST) Accept(addr int, b byte) int

                                                    Accept returns the next state for this Automaton on input of byte b

                                                    func (*FST) AcceptWithVal

                                                    func (f *FST) AcceptWithVal(addr int, b byte) (int, uint64)

                                                      AcceptWithVal returns the next state for this Automaton on input of byte b and also returns the output value for the transition

                                                      func (*FST) CanMatch

                                                      func (f *FST) CanMatch(addr int) bool

                                                        CanMatch returns if this state can ever transition to a matching state in this Automaton

                                                        func (*FST) Close

                                                        func (f *FST) Close() error

                                                          Close will unmap any mmap'd data (if managed by vellum) and it will close the backing file (if managed by vellum). You MUST call Close() for any FST instance that is created.

                                                          func (*FST) Contains

                                                          func (f *FST) Contains(val []byte) (bool, error)

                                                            Contains returns true if this FST contains the specified key.

                                                            func (*FST) Debug

                                                            func (f *FST) Debug(callback func(int, interface{}) error) error

                                                              Debug is only intended for debug purposes, it simply asks the underlying decoder visit each state, and pass it to the provided callback.

                                                              func (*FST) Get

                                                              func (f *FST) Get(input []byte) (uint64, bool, error)

                                                                Get returns the value associated with the key. NOTE: a value of zero does not imply the key does not exist, you must consult the second return value as well.

                                                                func (*FST) GetMaxKey

                                                                func (f *FST) GetMaxKey() ([]byte, error)

                                                                func (*FST) GetMinKey

                                                                func (f *FST) GetMinKey() ([]byte, error)

                                                                func (*FST) IsMatch

                                                                func (f *FST) IsMatch(addr int) bool

                                                                  IsMatch returns if this state is a matching state in this Automaton

                                                                  func (*FST) IsMatchWithVal

                                                                  func (f *FST) IsMatchWithVal(addr int) (bool, uint64)

                                                                    IsMatchWithVal returns if this state is a matching state in this Automaton and also returns the final output value for this state

                                                                    func (*FST) Iterator

                                                                    func (f *FST) Iterator(startKeyInclusive, endKeyExclusive []byte) (*FSTIterator, error)

                                                                      Iterator returns a new Iterator capable of enumerating the key/value pairs between the provided startKeyInclusive and endKeyExclusive.

                                                                      func (*FST) Len

                                                                      func (f *FST) Len() int

                                                                        Len returns the number of entries in this FST instance.

                                                                        func (*FST) Reader

                                                                        func (f *FST) Reader() (*Reader, error)

                                                                          Reader() returns a Reader instance that a single thread may use to retrieve data from the FST

                                                                          func (*FST) Search

                                                                          func (f *FST) Search(aut Automaton, startKeyInclusive, endKeyExclusive []byte) (*FSTIterator, error)

                                                                            Search returns a new Iterator capable of enumerating the key/value pairs between the provided startKeyInclusive and endKeyExclusive that also satisfy the provided automaton.

                                                                            func (*FST) Start

                                                                            func (f *FST) Start() int

                                                                              Start returns the start state of this Automaton

                                                                              func (*FST) Type

                                                                              func (f *FST) Type() int

                                                                                Type returns the type of this FST instance.

                                                                                func (*FST) Version

                                                                                func (f *FST) Version() int

                                                                                  Version returns the encoding version used by this FST instance.

                                                                                  func (*FST) WillAlwaysMatch

                                                                                  func (f *FST) WillAlwaysMatch(int) bool

                                                                                    WillAlwaysMatch returns if from this state the Automaton will always be in a matching state

                                                                                    type FSTIterator

                                                                                    type FSTIterator struct {
                                                                                    	// contains filtered or unexported fields
                                                                                    }

                                                                                      FSTIterator is a structure for iterating key/value pairs in this FST in lexicographic order. Iterators should be constructed with the FSTIterator method on the parent FST structure.

                                                                                      func (*FSTIterator) Close

                                                                                      func (i *FSTIterator) Close() error

                                                                                        Close will free any resources held by this iterator.

                                                                                        func (*FSTIterator) Current

                                                                                        func (i *FSTIterator) Current() ([]byte, uint64)

                                                                                          Current returns the key and value currently pointed to by the iterator. If the iterator is not pointing at a valid value (because Iterator/Next/Seek) returned an error previously, it may return nil,0.

                                                                                          func (*FSTIterator) Next

                                                                                          func (i *FSTIterator) Next() error

                                                                                            Next advances this iterator to the next key/value pair. If there is none or the advancement goes beyond the configured endKeyExclusive, then ErrIteratorDone is returned.

                                                                                            func (*FSTIterator) Reset

                                                                                            func (i *FSTIterator) Reset(f *FST,
                                                                                            	startKeyInclusive, endKeyExclusive []byte, aut Automaton) error

                                                                                              Reset resets the Iterator' internal state to allow for iterator reuse (e.g. pooling).

                                                                                              func (*FSTIterator) Seek

                                                                                              func (i *FSTIterator) Seek(key []byte) error

                                                                                                Seek advances this iterator to the specified key/value pair. If this key is not in the FST, Current() will return the next largest key. If this seek operation would go past the last key, or outside the configured startKeyInclusive/endKeyExclusive then ErrIteratorDone is returned.

                                                                                                type Iterator

                                                                                                type Iterator interface {
                                                                                                
                                                                                                	// Current() returns the key/value pair currently pointed to.
                                                                                                	// The []byte of the key is ONLY guaranteed to be valid until
                                                                                                	// another call to Next/Seek/Close.  If you need it beyond that
                                                                                                	// point you MUST make a copy.
                                                                                                	Current() ([]byte, uint64)
                                                                                                
                                                                                                	// Next() advances the iterator to the next key/value pair.
                                                                                                	// If no more key/value pairs exist, ErrIteratorDone is returned.
                                                                                                	Next() error
                                                                                                
                                                                                                	// Seek() advances the iterator the specified key, or the next key
                                                                                                	// if it does not exist.
                                                                                                	// If no keys exist after that point, ErrIteratorDone is returned.
                                                                                                	Seek(key []byte) error
                                                                                                
                                                                                                	// Reset resets the Iterator' internal state to allow for iterator
                                                                                                	// reuse (e.g. pooling).
                                                                                                	Reset(f *FST, startKeyInclusive, endKeyExclusive []byte, aut Automaton) error
                                                                                                
                                                                                                	// Close() frees any resources held by this iterator.
                                                                                                	Close() error
                                                                                                }

                                                                                                  Iterator represents a means of visiting key/value pairs in order.

                                                                                                  type MergeFunc

                                                                                                  type MergeFunc func([]uint64) uint64

                                                                                                    MergeFunc is used to choose the new value for a key when merging a slice of iterators, and the same key is observed with multiple values. Values presented to the MergeFunc will be in the same order as the original slice creating the MergeIterator. This allows some MergeFunc implementations to prioritize one iterator over another.

                                                                                                    type MergeIterator

                                                                                                    type MergeIterator struct {
                                                                                                    	// contains filtered or unexported fields
                                                                                                    }

                                                                                                      MergeIterator implements the Iterator interface by traversing a slice of iterators and merging the contents of them. If the same key exists in mulitipe underlying iterators, a user-provided MergeFunc will be invoked to choose the new value.

                                                                                                      func NewMergeIterator

                                                                                                      func NewMergeIterator(itrs []Iterator, f MergeFunc) (*MergeIterator, error)

                                                                                                        NewMergeIterator creates a new MergeIterator over the provided slice of Iterators and with the specified MergeFunc to resolve duplicate keys.

                                                                                                        func (*MergeIterator) Close

                                                                                                        func (m *MergeIterator) Close() error

                                                                                                          Close will attempt to close all the underlying Iterators. If any errors are encountered, the first will be returned.

                                                                                                          func (*MergeIterator) Current

                                                                                                          func (m *MergeIterator) Current() ([]byte, uint64)

                                                                                                            Current returns the key and value currently pointed to by this iterator. If the iterator is not pointing at a valid value (because Iterator/Next/Seek) returned an error previously, it may return nil,0.

                                                                                                            func (*MergeIterator) Next

                                                                                                            func (m *MergeIterator) Next() error

                                                                                                              Next advances this iterator to the next key/value pair. If there is none, then ErrIteratorDone is returned.

                                                                                                              func (*MergeIterator) Seek

                                                                                                              func (m *MergeIterator) Seek(key []byte) error

                                                                                                                Seek advances this iterator to the specified key/value pair. If this key is not in the FST, Current() will return the next largest key. If this seek operation would go past the last key, then ErrIteratorDone is returned.

                                                                                                                type Reader

                                                                                                                type Reader struct {
                                                                                                                	// contains filtered or unexported fields
                                                                                                                }

                                                                                                                  A Reader is meant for a single threaded use

                                                                                                                  func (*Reader) Get

                                                                                                                  func (r *Reader) Get(input []byte) (uint64, bool, error)

                                                                                                                  type Transducer

                                                                                                                  type Transducer interface {
                                                                                                                  
                                                                                                                  	// all transducers are also automatons
                                                                                                                  	Automaton
                                                                                                                  
                                                                                                                  	// IsMatchWithValue returns true if and only if the state is a match
                                                                                                                  	// additionally it returns a states final value (if any)
                                                                                                                  	IsMatchWithVal(int) (bool, uint64)
                                                                                                                  
                                                                                                                  	// Accept returns the next state given the input to the specified state
                                                                                                                  	// additionally it returns the value associated with the transition
                                                                                                                  	AcceptWithVal(int, byte) (int, uint64)
                                                                                                                  }

                                                                                                                    Transducer represents the general contract of a byte-based finite transducer

                                                                                                                    Directories

                                                                                                                    Path Synopsis
                                                                                                                    cmd