README
¶
Fuzzy
Fuzzy is a very fast spell checker and query suggester written in Golang.
Motivation:
- Sajari uses very large queries (hundreds of words) but needs to respond sub-second to these queries where possible. Common spell check algorithms are quite slow or very resource intensive.
- The aim was to achieve spell checks in sub 100usec per word (10,000 / second single core) with at least 60% accuracy and multi-language support.
- Currently we see sub 40usec per word and ~70% accuracy for a Levenshtein distance of 2 chars on a 2012 macbook pro (english test set comes from Peter Norvig's article, see http://norvig.com/spell-correct.html).
- A 500 word query can be spell checked in ~0.02 sec / cpu cores, which is good enough for us.
Notes:
- It is currently executed as a single goroutine per lookup, so undoubtedly this could be much faster using multiple cores, but currently the speed is quite good.
- Accuracy is hit slightly because several correct words don't appear at all in the training text (data/big.txt).
- Fuzzy is a "Symmetric Delete Spelling Corrector", which relates to some blogs by Wolf Garbe at Faroo.com (see http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/)
Config:
- Generally no config is required, but you can tweak the model for your application.
"threshold"
is the trigger point when a word becomes popular enough to build lookup keys for it. Setting this to "1" means any instance of a given word makes it a legitimate spelling. This typically corrects the most errors, but can also cause false positives if incorrect spellings exist in the training data. It also causes a much larger index to be built. By default this is set to 4."depth"
is the Levenshtein distance the model builds lookup keys for. For spelling correction, a setting of "2" is typically very good. At a distance of "3" the potential number of words is much, much larger, but adds little benefit to accuracy. For query prediction a larger number can be useful, but again is much more expensive. A depth of "1" and threshold of "1" for the 1st Norvig test set gives ~70% correction accuracy at ~5usec per check (e.g. ~200kHz), for many applications this will be good enough. At depths > 2, the false positives begin to hurt the accuracy.
Future improvements:
- Make some of the expensive processes concurrent.
- Add spelling checks for different languages. If you have misspellings in different languages please add them or send to us.
- Allow the term-score map to be read from an external term set (e.g. integrating this currently may double up on keeping a term count).
- Currently there is no method to delete lookup keys, so potentially this may cause bloating over time if the dictionary changes signficantly.
- Add right to left deletion beyond Levenshtein config depth (e.g. don't process all deletes accept for query predictors).
Usage:
- Below is some example code showing how to use the package.
- An example showing how to train with a static set of words is contained in the fuzzy_test.go file, which uses the "big.text" file to create an english dictionary.
- To integrate with your application (e.g. custom dictionary / word popularity), use the single word and multiword training functions shown in the example below. Each time you add a new instance of a given word, pass it to this function. The model will keep a count and
- We haven't tested with other langauges, but this should work fine. Please let us know how you go?
support@sajari.com
package main
import(
"github.com/sajari/fuzzy"
"fmt"
)
func main() {
model := fuzzy.NewModel()
// For testing only, this is not advisable on production
model.SetThreshold(1)
// This expands the distance searched, but costs more resources (memory and time).
// For spell checking, "2" is typically enough, for query suggestions this can be higher
model.SetDepth(5)
// Train multiple words simultaneously by passing an array of strings to the "Train" function
words := []string{"bob", "your", "uncle", "dynamite", "delicate", "biggest", "big", "bigger", "aunty", "you're"}
model.Train(words)
// Train word by word (typically triggered in your application once a given word is popular enough)
model.TrainWord("single")
// Check Spelling
fmt.Println("\nSPELL CHECKS")
fmt.Println(" Deletion test (yor) : ", model.SpellCheck("yor"))
fmt.Println(" Swap test (uncel) : ", model.SpellCheck("uncel"))
fmt.Println(" Replace test (dynemite) : ", model.SpellCheck("dynemite"))
fmt.Println(" Insert test (dellicate) : ", model.SpellCheck("dellicate"))
fmt.Println(" Two char test (dellicade) : ", model.SpellCheck("dellicade"))
// Suggest completions
fmt.Println("\nQUERY SUGGESTIONS")
fmt.Println(" \"bigge\". Did you mean?: ", model.Suggestions("bigge", false))
fmt.Println(" \"bo\". Did you mean?: ", model.Suggestions("bo", false))
fmt.Println(" \"dyn\". Did you mean?: ", model.Suggestions("dyn", false))
// Autocomplete suggestions
suggested, _ := model.Autocomplete("bi")
fmt.Printf(" \"bi\". Suggestions: %v", suggested)
}
Documentation
¶
Overview ¶
Eventually this should be removed. Currently it gives backwards compatability to old versions that did not store the query count, which is now used for autocomplete.
Index ¶
- Constants
- func Edits1(word string) []string
- func Levenshtein(a, b *string) int
- func SampleEnglish() []string
- type Autos
- type Counts
- type Method
- type Model
- func (model *Model) Autocomplete(input string) ([]string, error)
- func (model *Model) CheckKnown(input string, correct string) bool
- func (model *Model) EditsMulti(term string, depth int) []string
- func (model *Model) Init() *Model
- func (model *Model) Potentials(input string, exhaustive bool) map[string]*Potential
- func (model *Model) Save(filename string) error
- func (model *Model) SaveLight(filename string) error
- func (model *Model) SetCount(term string, count int, suggest bool)
- func (model *Model) SetDepth(val int)
- func (model *Model) SetDivergenceThreshold(val int)
- func (model *Model) SetThreshold(val int)
- func (model *Model) SetUseAutocomplete(val bool)
- func (model *Model) SpellCheck(input string) string
- func (model *Model) SpellCheckSuggestions(input string, n int) []string
- func (model *Model) Suggestions(input string, exhaustive bool) []string
- func (model *Model) Train(terms []string)
- func (model *Model) TrainQuery(term string)
- func (model *Model) TrainWord(term string)
- func (model *Model) WriteTo(w io.Writer) (int64, error)
- type OldModel
- type Pair
- type Potential
Constants ¶
const ( SpellDepthDefault = 2 SpellThresholdDefault = 5 SuffDivergenceThresholdDefault = 100 )
const ( MethodIsWord Method = 0 MethodSuggestMapsToInput = 1 MethodInputDeleteMapsToDict = 2 MethodInputDeleteMapsToSuggest = 3 )
Variables ¶
This section is empty.
Functions ¶
func Levenshtein ¶
Calculate the Levenshtein distance between two strings
func SampleEnglish ¶
func SampleEnglish() []string
Types ¶
type Model ¶
type Model struct { Data map[string]*Counts `json:"data"` Maxcount int `json:"maxcount"` Suggest map[string][]string `json:"suggest"` Depth int `json:"depth"` Threshold int `json:"threshold"` UseAutocomplete bool `json:"autocomplete"` SuffDivergence int `json:"-"` SuffDivergenceThreshold int `json:"suff_threshold"` SuffixArr *suffixarray.Index `json:"-"` SuffixArrConcat string `json:"-"` sync.RWMutex }
func FromReader ¶
FromReader loads a model from a Reader
func (*Model) Autocomplete ¶
For a given string, autocomplete using the suffix array model
func (*Model) CheckKnown ¶
Test an input, if we get it wrong, look at why it is wrong. This function returns a bool indicating if the guess was correct as well as the term it is suggesting. Typically this function would be used for testing, not for production
func (*Model) EditsMulti ¶
Edits at any depth for a given term. The depth of the model is used
func (*Model) Potentials ¶
Return the raw potential terms so they can be ranked externally to this package
func (*Model) SaveLight ¶
Save a spelling model to disk, but discard all entries less than the threshold number of occurences Much smaller and all that is used when generated as a once off, but not useful for incremental usage
func (*Model) SetCount ¶
Manually set the count of a word. Optionally trigger the creation of suggestion keys for the term. This function lets you build a model from an existing dictionary with word popularity counts without needing to run "TrainWord" repeatedly
func (*Model) SetDepth ¶
Change the default depth value of the model. This sets how many character differences are indexed. The default is 2.
func (*Model) SetDivergenceThreshold ¶
Optionally set the suffix array divergence threshold. This is the number of query training steps between rebuilds of the suffix array. A low number will be more accurate but will use resources and create more garbage.
func (*Model) SetThreshold ¶
Change the default threshold of the model. This is how many times a term must be seen before suggestions are created for it
func (*Model) SetUseAutocomplete ¶
Optionally disabled suffixarray based autocomplete support
func (*Model) SpellCheck ¶
Return the most likely correction for the input term
func (*Model) SpellCheckSuggestions ¶
Return the most likely corrections in order from best to worst
func (*Model) Suggestions ¶
For a given input string, suggests potential replacements
func (*Model) TrainQuery ¶
Train using a search query term. This builds a second popularity index of terms used to search, as opposed to generally occurring in corpus text