Documentation ¶
Overview ¶
Package mafsa implements Minimal Acyclic Finite State Automata (MA-FSA) in a space-optimized way as described by Dacuik, Mihov, Watson, and Watson in their paper, "Incremental Construction of Minimal Acyclic Finite-State Automata" (2000). It also implements Minimal Perfect Hashing (MPH) as described by Lucceshi and Kowaltowski in their paper, "Applications of Finite Automata Representing Large Vocabularies" (1992).
Unscientifically speaking, this package lets you store large amounts of strings (with Unicode) in memory so that membership queries, prefix lookups, and fuzzy searches are fast. And because minimal perfect hashing is included, you can associate each entry in the tree with more data used by your application. See the README or the end of this documentation for a brief tutorial.
MA-FSA structures are a specific type of Deterministic Acyclic Finite State Automaton (DAFSA) which fold equivalent state transitions into each other starting from the suffix of each entry. Typical construction algorithms involve building out the entire tree first, then minimizing the completed tree. However, the method described in the paper above allows the tree to be minimized after every word insertion, provided the insertions are performed in lexicographical order, which drastically reduces memory usage compared to regular prefix trees ("tries").
The goal of this package is to provide a simple, useful, and correct implementation of MA-FSA. Though more complex algorithms exist for removal of items and unordered insertion, these features may be outside the scope of this package.
Membership queries are on the order of O(n), where n is the length of the input string, so basically O(1). It is advisable to keep n small since long entries without much in common, especially in the beginning or end of the string, will quickly overrun the optimizations that are available. In those cases, n-gram implementations might be preferable, though these will use more CPU.
This package provides two kinds of MA-FSA implementations. One, the BuildTree, facilitates the construction of an optimized tree and allows ordered insertions. The other, MinTree, is effectively read-only but uses significantly less memory and is ideal for production environments where only reads will be occurring.
Usually your build process will be separate from your production application, which will make heavy use of reading the structure.
To use this package, create a BuildTree and insert your items in lexicographical order:
bt := mafsa.New() bt.Insert("cities") // an error will be returned if input is < last input bt.Insert("city") bt.Insert("pities") bt.Insert("pity") bt.Finish()
The tree is now compressed to a minimum number of nodes and is ready to be saved.
err := bt.Save("filename") if err != nil { log.Fatal("Could not save data to file:", err) }
In your production application, then, you can read the file into a MinTree directly:
mt, err := mafsa.Load("filename") if err != nil { log.Fatal("Could not load data from file:", err) }
The mt variable is a *MinTree which has the same data as the original BuildTree, but without all the extra "scaffolding" that was required for adding new elements.
The package provides some basic read mechanisms.
// See if a string is a member of the set fmt.Println("Does tree contain 'cities'?", mt.Contains("cities")) fmt.Println("Does tree contain 'pitiful'?", mt.Contains("pitiful")) // You can traverse down to a certain node, if it exists fmt.Printf("'y' node is at: %p\n", mt.Traverse([]rune("city"))) // To traverse the tree and get the number of elements inserted // before the prefix specified _, idx := mt.IndexedTraverse([]rune("pit")) fmt.Println("Elements inserted before entries starting with 'pit':", idx)
Index ¶
- type BuildTree
- func (t *BuildTree) Contains(word string) bool
- func (t *BuildTree) Finish()
- func (t *BuildTree) Insert(val string) error
- func (t *BuildTree) MarshalBinary() ([]byte, error)
- func (t *BuildTree) Save(filename string) error
- func (t *BuildTree) String() string
- func (t *BuildTree) Traverse(word []rune) *BuildTreeNode
- type BuildTreeNode
- type Decoder
- type Encoder
- type Entry
- type MinTree
- func (t *MinTree) Contains(word string) bool
- func (t *MinTree) IndexedTraverse(word []rune) (*MinTreeNode, int)
- func (t *MinTree) IterateDepthFirst() chan string
- func (t *MinTree) String() string
- func (t *MinTree) Traverse(word []rune) *MinTreeNode
- func (t *MinTree) UnmarshalBinary(data []byte) error
- type MinTreeNode
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BuildTree ¶
type BuildTree struct { Root *BuildTreeNode // contains filtered or unexported fields }
BuildTree implements an MA-FSA that contains all the "scaffolding" (state) necessary to perform optimizations after inserting each item, which prevents the tree from growing too big.
func (*BuildTree) Finish ¶
func (t *BuildTree) Finish()
Finish completes the optimizations on a tree. You must call Finish at least once, like immediately after all entries have been inserted.
func (*BuildTree) Insert ¶
Insert adds val to the tree and performs optimizations to minimize the number of nodes in the tree. The inserted val must be lexicographically equal to or higher than the last inserted val.
func (*BuildTree) MarshalBinary ¶
MarshalBinary encodes t into a binary format. It implements the functionality described by the encoding.BinaryMarhsaler interface. The associated encoding.BinaryUnmarshaler type is MinTree.
func (*BuildTree) Save ¶
Save encodes the MA-FSA into a binary format and writes it to a file. The tree can later be restored by calling the package's Load function.
func (*BuildTree) String ¶
String returns a roughly human-readable string representing the basic structure of the tree. For debugging only. Do not use with very large trees.
func (*BuildTree) Traverse ¶
func (t *BuildTree) Traverse(word []rune) *BuildTreeNode
Traverse follows nodes down the tree according to word and returns the ending node if there was one or nil if there wasn't one. Note that this method alone does not indicate membership; some node may still be reached even if the word is not in the structure.
type BuildTreeNode ¶
type BuildTreeNode struct { Edges map[rune]*BuildTreeNode // contains filtered or unexported fields }
BuildTreeNode is a node in a BuildTree that contains all the info necessary to be a part of optimizations during item insertions.
func (*BuildTreeNode) OrderedEdges ¶
func (tn *BuildTreeNode) OrderedEdges() []rune
OrderedEdges returns the keys of the outgoing edges in lexigocraphical sorted order. It enables deterministic traversal of edges to child nodes.
type Decoder ¶
type Decoder struct {
// contains filtered or unexported fields
}
Decoder is a type which can decode a byte slice into a MinTree.
type Encoder ¶
type Encoder struct {
// contains filtered or unexported fields
}
Encoder is a type which can encode a BuildTree into a byte slice which can be written to a file.
type Entry ¶
type Entry struct { // The actual entry that was added to the tree Value string // Represents how many entries were added to the tree before this one Index int }
Entry represents a value in the tree along with its index number (minimal perfect hashing).
type MinTree ¶
type MinTree struct {
Root *MinTreeNode
}
MinTree implements a simple prefix tree with read-only functionality. It supports Minimal Perfect Hashing (MPH) so you can lookup entries by their index in a slice.
func Load ¶
Load loads an existing MA-FSA from a file specified by filename. It returns a read-only MA-FSA, or an error if loading failed.
func (*MinTree) IndexedTraverse ¶
func (t *MinTree) IndexedTraverse(word []rune) (*MinTreeNode, int)
IndexedTraverse visits nodes according to word and returns the last node, if there was one, along with its index number. The index number represents how many entries, including the returned node, are in the tree before entries beginning with the specified prefix. Such an indexed traversal is slightly slower than a regular traversal since this kind of traversal has to iterate more nodes and count the numbers on each node. Note that returning a non-nil node is not indicative of membership; a node may be be returned even if the word is not in the set. If last node is nil, the returned index is -1. The algorithm is adapted directly from the paper by Lucceshi and Kowaltowski, 1992 (section 5.2), with one slight modification for correctness because of a slight difference in the way our tree is implemented.
func (*MinTree) IterateDepthFirst ¶ added in v1.1.0
IterateDepthFirst returns an unbuffered channel which will be loaded with all items in the tree in lexicographical order. The caller need only range over the channel (which will be closed whan all items have been loaded).
func (*MinTree) String ¶
String serializes the tree roughly as a human-readable string for debugging purposes. Don't try this on large trees.
func (*MinTree) Traverse ¶
func (t *MinTree) Traverse(word []rune) *MinTreeNode
Traverse visits nodes according to word and returns the last node, if there was one. Note that returning a non-nil node is not indicative of membership; a node may be returned even if the word is not in the set. Traverse implements a fast traversal which does not count the index number of the last node.
func (*MinTree) UnmarshalBinary ¶
UnmarshalBinary decodes binary data into this MinTree. It implements the functionality described by the encoding.BinaryMarhsaler interface. The associated encoding.BinaryMarshaler type is BuildTree.
type MinTreeNode ¶
type MinTreeNode struct { Edges map[rune]*MinTreeNode Final bool Number int }
MinTreeNode is a node in a MinTree.
func (*MinTreeNode) OrderedEdges ¶
func (n *MinTreeNode) OrderedEdges() []rune
OrderedEdges returns the keys of the outgoing edges in lexigocraphical sorted order. It enables deterministic traversal of edges to child nodes.