levenshtein

package module
v1.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 12, 2020 License: Apache-2.0 Imports: 0 Imported by: 96

README

A Go package for calculating the Levenshtein distance between two strings

Release GoDoc  Build Status Coverage Status Go Report Card

This package implements distance and similarity metrics for strings, based on the Levenshtein measure, in Go.

Project Status

v1.2.3 Stable: Guaranteed no breaking changes to the API in future v1.x releases. Probably safe to use in production, though provided on "AS IS" basis.

This package is being actively maintained. If you encounter any problems or have any suggestions for improvement, please open an issue. Pull requests are welcome.

Overview

The Levenshtein Distance between two strings is the minimum total cost of edits that would convert the first string into the second. The allowed edit operations are insertions, deletions, and substitutions, all at character (one UTF-8 code point) level. Each operation has a default cost of 1, but each can be assigned its own cost equal to or greater than 0.

A Distance of 0 means the two strings are identical, and the higher the value the more different the strings. Since in practice we are interested in finding if the two strings are "close enough", it often does not make sense to continue the calculation once the result is mathematically guaranteed to exceed a desired threshold. Providing this value to the Distance function allows it to take a shortcut and return a lower bound instead of an exact cost when the threshold is exceeded.

The Similarity function calculates the distance, then converts it into a normalized metric within the range 0..1, with 1 meaning the strings are identical, and 0 that they have nothing in common. A minimum similarity threshold can be provided to speed up the calculation of the metric for strings that are far too dissimilar for the purpose at hand. All values under this threshold are rounded down to 0.

The Match function provides a similarity metric, with the same range and meaning as Similarity, but with a bonus for string pairs that share a common prefix and have a similarity above a "bonus threshold". It uses the same method as proposed by Winkler for the Jaro distance, and the reasoning behind it is that these string pairs are very likely spelling variations or errors, and they are more closely linked than the edit distance alone would suggest.

The underlying Calculate function is also exported, to allow the building of other derivative metrics, if needed.

Installation

go get github.com/agext/levenshtein

License

Package levenshtein is released under the Apache 2.0 license. See the LICENSE file for details.

Documentation

Overview

Package levenshtein implements distance and similarity metrics for strings, based on the Levenshtein measure.

The Levenshtein `Distance` between two strings is the minimum total cost of edits that would convert the first string into the second. The allowed edit operations are insertions, deletions, and substitutions, all at character (one UTF-8 code point) level. Each operation has a default cost of 1, but each can be assigned its own cost equal to or greater than 0.

A `Distance` of 0 means the two strings are identical, and the higher the value the more different the strings. Since in practice we are interested in finding if the two strings are "close enough", it often does not make sense to continue the calculation once the result is mathematically guaranteed to exceed a desired threshold. Providing this value to the `Distance` function allows it to take a shortcut and return a lower bound instead of an exact cost when the threshold is exceeded.

The `Similarity` function calculates the distance, then converts it into a normalized metric within the range 0..1, with 1 meaning the strings are identical, and 0 that they have nothing in common. A minimum similarity threshold can be provided to speed up the calculation of the metric for strings that are far too dissimilar for the purpose at hand. All values under this threshold are rounded down to 0.

The `Match` function provides a similarity metric, with the same range and meaning as `Similarity`, but with a bonus for string pairs that share a common prefix and have a similarity above a "bonus threshold". It uses the same method as proposed by Winkler for the Jaro distance, and the reasoning behind it is that these string pairs are very likely spelling variations or errors, and they are more closely linked than the edit distance alone would suggest.

The underlying `Calculate` function is also exported, to allow the building of other derivative metrics, if needed.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Calculate

func Calculate(str1, str2 []rune, maxCost, insCost, subCost, delCost int) (dist, prefixLen, suffixLen int)

Calculate determines the Levenshtein distance between two strings, using the given costs for each edit operation. It returns the distance along with the lengths of the longest common prefix and suffix.

If maxCost is non-zero, the calculation stops as soon as the distance is determined to be greater than maxCost. Therefore, any return value higher than maxCost is a lower bound for the actual distance.

func Distance

func Distance(str1, str2 string, p *Params) int

Distance returns the Levenshtein distance between str1 and str2, using the default or provided cost values. Pass nil for the third argument to use the default cost of 1 for all three operations, with no maximum.

func Match

func Match(str1, str2 string, p *Params) float64

Match returns a similarity score adjusted by the same method as proposed by Winkler for the Jaro distance - giving a bonus to string pairs that share a common prefix, only if their similarity score is already over a threshold.

The score is in the range of 0..1, with 1 meaning the strings are identical, and 0 meaning they have nothing in common.

A nil third argument uses the default cost of 1 for all three operations, maximum length of common prefix to consider for bonus of 4, scaling factor of 0.1, and bonus threshold of 0.7.

If a non-zero MinScore value is provided in the parameters, scores lower than it will be returned as 0.

func Similarity

func Similarity(str1, str2 string, p *Params) float64

Similarity returns a score in the range of 0..1 for how similar the two strings are. A score of 1 means the strings are identical, and 0 means they have nothing in common.

A nil third argument uses the default cost of 1 for all three operations.

If a non-zero MinScore value is provided in the parameters, scores lower than it will be returned as 0.

Types

type Params

type Params struct {
	// contains filtered or unexported fields
}

Params represents a set of parameter values for the various formulas involved in the calculation of the Levenshtein string metrics.

func NewParams

func NewParams() *Params

NewParams creates a new set of parameters and initializes it with the default values.

func (*Params) BonusPrefix

func (p *Params) BonusPrefix(v int) *Params

BonusPrefix overrides the default value for the maximum length of common prefix to be considered for bonus by Match(). The new value must be zero or positive.

func (*Params) BonusScale

func (p *Params) BonusScale(v float64) *Params

BonusScale overrides the default value for the scaling factor used by Match() in calculating the bonus. The new value must be zero or positive. To guarantee that the similarity score remains in the interval 0..1, this scaling factor is not allowed to exceed 1 / BonusPrefix.

func (*Params) BonusThreshold

func (p *Params) BonusThreshold(v float64) *Params

BonusThreshold overrides the default value for the minimum similarity score for which Match() can assign a bonus. The new value must be zero or positive. Note that a threshold greater than 1 effectively makes Match() become the equivalent of Similarity().

func (*Params) Clone

func (p *Params) Clone() *Params

Clone returns a pointer to a copy of the receiver parameter set, or of a new default parameter set if the receiver is nil.

func (*Params) DelCost

func (p *Params) DelCost(v int) *Params

DelCost overrides the default value of 1 for the cost of deletion. The new value must be zero or positive.

func (*Params) InsCost

func (p *Params) InsCost(v int) *Params

InsCost overrides the default value of 1 for the cost of insertion. The new value must be zero or positive.

func (*Params) MaxCost

func (p *Params) MaxCost(v int) *Params

MaxCost overrides the default value of 0 (meaning unlimited) for the maximum cost. The calculation of Distance() stops when the result is guaranteed to exceed this maximum, returning a lower-bound rather than exact value. The new value must be zero or positive.

func (*Params) MinScore

func (p *Params) MinScore(v float64) *Params

MinScore overrides the default value of 0 for the minimum similarity score. Scores below this threshold are returned as 0 by Similarity() and Match(). The new value must be zero or positive. Note that a minimum greater than 1 can never be satisfied, resulting in a score of 0 for any pair of strings.

func (*Params) SubCost

func (p *Params) SubCost(v int) *Params

SubCost overrides the default value of 1 for the cost of substitution. The new value must be zero or positive.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL