Documentation ¶
Overview ¶
Package editdist provides a set of string comparison functions to compute the distance between a pair of strings under various distance metrics in string space. The higher the distance, the greater the difference between both strings.
There are two main distance metrics implemented in this package.
Levenshtein Distance ¶
In the original Levenshtein distance metric, the distance between any two string is measured by the minimum "edit operations" required to transform one string into the other. These edit operations are limited to (1) an insertion of a character, (2) a deletion of a character, and (3) a substitution of a character by another character.
For example, when using UnitPenalty indicator function as the character substitution penalty, the Levenshtein distance between "Hello" and "Hola" is 3, whereas the distance between "Hi" and "" (empty string) is 2.
Damerau–Levenshtein Distance ¶
The Damerau–Levenshtein distance metric is an improvement upon the original Levenshtein distance by allowing another kind of edit operation: a transposition of two adjacent characters.
For example, when using UnitPenalty indicator function as both the character substitution and character transposition penalty, the Damerau–Levenshtein distance between "Thrust" and "Thursday" is 4. Without transposition penalty, the distance under the original Levenshtein metric would have become 5.
Note: The efficient implementation of Damerau-Levenshtein distance is hard to come by, hence we use the weaker version of Damerau-Levenshtein distance called Optimal Alignment, which is described next.
Optimal Alignment Distance ¶
The optimal alignment distance is the restricted version of the Damerau–Levenshtein distance; specifically, each rune character in both the original and the target string is subjected to at most one edit operation, and only characters that are adjacent in the original input are allowed to be transposed.
Therefore, Damerau–Levenshtein distance between "trout" and "turn" is 3 (as shown by "trout" → "trut" → "turt" → "turn"). But the second operation would not be allowed in the optimal alignment metric, and thus yield a worse distance of 4 (as shown by "trout" → "tout" → "tut" → "turt" → "turn").
Index ¶
- func LevenshteinDist(fst, snd string, substPenalty RunePenaltyFunction) float64
- func MakeStringSimilarityFunction(inputFunction StringDistFunction) factory.SimilarityScoreFunction
- func OptimalAlignmentDist(fst, snd string, substPenalty, transPenalty RunePenaltyFunction) float64
- func UnitPenalty(c, d rune) float64
- type RunePenaltyFunction
- type StringDistFunction
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func LevenshteinDist ¶ added in v0.5.0
func LevenshteinDist(fst, snd string, substPenalty RunePenaltyFunction) float64
OptimalAlignmentDistance computes the original "Levenshtein distance" between two provided string, viewed as a sequence of rune characters.
Character substitution penalty function substPenalty has to be a symmetric function which takes in two rune character and outputs their substitution penalty between 0 an 1. If two identical runes are provided, ideally the output penalty has to be 0. As a special case to account for character insertions and deletions, if one of the input runes is rune 0, then substDistFunction should return the insertion/deletion penalty of the other rune provided to the function.
Time complexity ¶
The implementation used to compute the optimal alignment distance is a simple dynamic programming algorithm that takes O(|fst| * |snd|) time where |fst| and |snd| are the lengths of two input strings respectively. Additionally, the memory usage for this function is within the order of O(|snd|).
func MakeStringSimilarityFunction ¶ added in v0.5.0
func MakeStringSimilarityFunction(inputFunction StringDistFunction) factory.SimilarityScoreFunction
MakeStringSimilarityFunction is a higher-order function which converts the provided typical edit-distance computation function (i.e. function which receives two strings and returns edit distance between them) into the normalized string similarity function. The resulting function would return a score within the range from 0 to 1 where 1 indicates that both strings are identical and 0 indicates that both strings are totally distinct.
Implementation details: the denominator of the fraction is _not_ the length of the longer string. The reason for this is that some insertion/deletion errors incur sub-unit penalties. Without the size-fitting denominator, a malicious user may attack by saturating those insertions/deletions in order to decrease the total edit distances.
func OptimalAlignmentDist ¶
func OptimalAlignmentDist(fst, snd string, substPenalty, transPenalty RunePenaltyFunction) float64
OptimalAlignmentDistance computes the "optimal alignment distance" between two provided string, viewed as a sequence of rune characters.
Character substitution penalty function substPenalty has to be a symmetric function which takes in two rune character and outputs their substitution penalty between 0 an 1. If two identical runes are provided, ideally the output penalty has to be 0. As a special case to account for character insertions and deletions, if one of the input runes is rune 0, then substDistFunction should return the insertion/deletion penalty of the other rune provided to the function.
Adjacent character transposition penalty function transPenalty also has to be a symmetric function which takes in two non-zero rune characters and outputs their transposition penalty score between 0 and 2. Please note that if for a pair of runes, x and y, satisfies the equation
transDistFunction(x, y) >= 2 * substDistFunction(x, y)
then the transposition of those characters are effectively ignored since it is cheaper to just perform two substitutions in a row instead.
Time complexity ¶
The implementation used to compute the optimal alignment distance is a simple dynamic programming algorithm that takes O(|fst| * |snd|) time where |fst| and |snd| are the lengths of two input strings respectively. Additionally, the memory usage for this function is within the order of O(|snd|).
func UnitPenalty ¶
UnitPenalty is an indicator function which returns 1 if two input runes differ and returns 0 otherwise. This function can be used for both the substitution and the transposition penalties.
UnitPenalty is a symmetric function by nature, meaning that the order of the input arguments do not matter.
For example, UnitPenalty('a', 'b') is the penalty for substituting character 'a' with another character 'b' (the output of which is 1 in this particular case). On the other hand, UnitPenalty('c', 0) is the penalty for an insertion (or deletion) of the character 'c'.
Types ¶
type RunePenaltyFunction ¶ added in v0.5.0
RunePenaltyFunction is an umbrella type alias representing a _symmetric_ function which computes a penalty score in the context of two rune characters. There are two kinds of penalties: substitution penalty and transposition penalty.
Substitution Penalty ¶
Substitution Penalty refers to the cost of substituting one rune for the other (which includes character insertions and deletions as special cases). Specifically, a function of this kind takes in two runes as input arguments and returns the penalty value within 0 to 1 range, where 0 indicates that two runes are identical and 1 indicates that two rune are totally distinct and can never be mistakenly mixed up. The insertion and deletion penalties are represented by assigning rune 0 as one of the input arguments to the function.
Transposition Penalty ¶
Transposition Penalty refers to the cost of adjacent character transpositions (i.e. swapping location of two adjacent characters in the string). Specifically, a function of this kind takes in two runes as input arguments and returns the penalty value within 0 to 2 range, where 0 indicates that both runes are commonly and indistinctively interchangeable and 2 indicates that both runes are completely different (and thus the penalty is no different than performing two consecutive substitutions).
type StringDistFunction ¶ added in v0.5.0
StringDistFunction is a type alias for edit distance functions between two strings
func MakeLevenshteinDistFunction ¶ added in v0.5.0
func MakeLevenshteinDistFunction(substPenaltyFunction RunePenaltyFunction) StringDistFunction
MakeLevenshteinDistFunction is a higher-order function which uses the provided substitution penalty function to construct the modified version of the otherwise generic OptimalAlignmentDist function with those specified edit distance metrics. The resulting function would receive two input strings and output the distance between them.
func MakeOptimalAlignmentDistFunction ¶ added in v0.5.0
func MakeOptimalAlignmentDistFunction(substPenaltyFunction, transPenaltyFunction RunePenaltyFunction) StringDistFunction
MakeOptimalAlignmentDistFunction is a higher-order function which uses the provided substitution and the transposition penalty functions to construct the modified version of the otherwise generic OptimalAlignmentDist function with those specified edit distance metrics. The resulting function will receive two input strings and output the distance between them.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package thai is a subpackage that provides additional editdist-related functions that are customized for strings containing Thai characters, etc.
|
Package thai is a subpackage that provides additional editdist-related functions that are customized for strings containing Thai characters, etc. |