editdist

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2020 License: LGPL-3.0 Imports: 2 Imported by: 0

Documentation

Overview

Package editdist provides a set of string comparison functions to compute the distance between a pair of strings under various distance metrics in string space. The higher the distance, the greater the difference between both strings.

There are two main distance metrics implemented in this package.

Levenshtein Distance

In the original Levenshtein distance metric, the distance between any two string is measured by the minimum "edit operations" required to transform one string into the other. These edit operations are limited to (1) an insertion of a character, (2) a deletion of a character, and (3) a substitution of a character by another character.

For example, when using UnitPenalty indicator function as the character substitution penalty, the Levenshtein distance between "Hello" and "Hola" is 3, whereas the distance between "Hi" and "" (empty string) is 2.

Damerau–Levenshtein Distance

The Damerau–Levenshtein distance metric is an improvement upon the original Levenshtein distance by allowing another kind of edit operation: a transposition of two adjacent characters.

For example, when using UnitPenalty indicator function as both the character substitution and character transposition penalty, the Damerau–Levenshtein distance between "Thrust" and "Thursday" is 4. Without transposition penalty, the distance under the original Levenshtein metric would have become 5.

Note: The efficient implementation of Damerau-Levenshtein distance is hard to come by, hence we use the weaker version of Damerau-Levenshtein distance called Optimal Alignment, which is described next.

Optimal Alignment Distance

The optimal alignment distance is the restricted version of the Damerau–Levenshtein distance; specifically, each rune character in both the original and the target string is subjected to at most one edit operation, and only characters that are adjacent in the original input are allowed to be transposed.

Therefore, Damerau–Levenshtein distance between "trout" and "turn" is 3 (as shown by "trout" → "trut" → "turt" → "turn"). But the second operation would not be allowed in the optimal alignment metric, and thus yield a worse distance of 4 (as shown by "trout" → "tout" → "tut" → "turt" → "turn").

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func LevenshteinDist added in v0.5.0

func LevenshteinDist(fst, snd string, substPenalty RunePenaltyFunction) float64

OptimalAlignmentDistance computes the original "Levenshtein distance" between two provided string, viewed as a sequence of rune characters.

Character substitution penalty function substPenalty has to be a symmetric function which takes in two rune character and outputs their substitution penalty between 0 an 1. If two identical runes are provided, ideally the output penalty has to be 0. As a special case to account for character insertions and deletions, if one of the input runes is rune 0, then substDistFunction should return the insertion/deletion penalty of the other rune provided to the function.

Time complexity

The implementation used to compute the optimal alignment distance is a simple dynamic programming algorithm that takes O(|fst| * |snd|) time where |fst| and |snd| are the lengths of two input strings respectively. Additionally, the memory usage for this function is within the order of O(|snd|).

func MakeStringSimilarityFunction added in v0.5.0

func MakeStringSimilarityFunction(inputFunction StringDistFunction) factory.SimilarityScoreFunction

MakeStringSimilarityFunction is a higher-order function which converts the provided typical edit-distance computation function (i.e. function which receives two strings and returns edit distance between them) into the normalized string similarity function. The resulting function would return a score within the range from 0 to 1 where 1 indicates that both strings are identical and 0 indicates that both strings are totally distinct.

Implementation details: the denominator of the fraction is _not_ the length of the longer string. The reason for this is that some insertion/deletion errors incur sub-unit penalties. Without the size-fitting denominator, a malicious user may attack by saturating those insertions/deletions in order to decrease the total edit distances.

func OptimalAlignmentDist

func OptimalAlignmentDist(fst, snd string, substPenalty, transPenalty RunePenaltyFunction) float64

OptimalAlignmentDistance computes the "optimal alignment distance" between two provided string, viewed as a sequence of rune characters.

Character substitution penalty function substPenalty has to be a symmetric function which takes in two rune character and outputs their substitution penalty between 0 an 1. If two identical runes are provided, ideally the output penalty has to be 0. As a special case to account for character insertions and deletions, if one of the input runes is rune 0, then substDistFunction should return the insertion/deletion penalty of the other rune provided to the function.

Adjacent character transposition penalty function transPenalty also has to be a symmetric function which takes in two non-zero rune characters and outputs their transposition penalty score between 0 and 2. Please note that if for a pair of runes, x and y, satisfies the equation

transDistFunction(x, y) >= 2 * substDistFunction(x, y)

then the transposition of those characters are effectively ignored since it is cheaper to just perform two substitutions in a row instead.

Time complexity

The implementation used to compute the optimal alignment distance is a simple dynamic programming algorithm that takes O(|fst| * |snd|) time where |fst| and |snd| are the lengths of two input strings respectively. Additionally, the memory usage for this function is within the order of O(|snd|).

func UnitPenalty

func UnitPenalty(c, d rune) float64

UnitPenalty is an indicator function which returns 1 if two input runes differ and returns 0 otherwise. This function can be used for both the substitution and the transposition penalties.

UnitPenalty is a symmetric function by nature, meaning that the order of the input arguments do not matter.

For example, UnitPenalty('a', 'b') is the penalty for substituting character 'a' with another character 'b' (the output of which is 1 in this particular case). On the other hand, UnitPenalty('c', 0) is the penalty for an insertion (or deletion) of the character 'c'.

Types

type RunePenaltyFunction added in v0.5.0

type RunePenaltyFunction = func(rune, rune) float64

RunePenaltyFunction is an umbrella type alias representing a _symmetric_ function which computes a penalty score in the context of two rune characters. There are two kinds of penalties: substitution penalty and transposition penalty.

Substitution Penalty

Substitution Penalty refers to the cost of substituting one rune for the other (which includes character insertions and deletions as special cases). Specifically, a function of this kind takes in two runes as input arguments and returns the penalty value within 0 to 1 range, where 0 indicates that two runes are identical and 1 indicates that two rune are totally distinct and can never be mistakenly mixed up. The insertion and deletion penalties are represented by assigning rune 0 as one of the input arguments to the function.

Transposition Penalty

Transposition Penalty refers to the cost of adjacent character transpositions (i.e. swapping location of two adjacent characters in the string). Specifically, a function of this kind takes in two runes as input arguments and returns the penalty value within 0 to 2 range, where 0 indicates that both runes are commonly and indistinctively interchangeable and 2 indicates that both runes are completely different (and thus the penalty is no different than performing two consecutive substitutions).

type StringDistFunction added in v0.5.0

type StringDistFunction = func(string, string) float64

StringDistFunction is a type alias for edit distance functions between two strings

func MakeLevenshteinDistFunction added in v0.5.0

func MakeLevenshteinDistFunction(substPenaltyFunction RunePenaltyFunction) StringDistFunction

MakeLevenshteinDistFunction is a higher-order function which uses the provided substitution penalty function to construct the modified version of the otherwise generic OptimalAlignmentDist function with those specified edit distance metrics. The resulting function would receive two input strings and output the distance between them.

func MakeOptimalAlignmentDistFunction added in v0.5.0

func MakeOptimalAlignmentDistFunction(substPenaltyFunction, transPenaltyFunction RunePenaltyFunction) StringDistFunction

MakeOptimalAlignmentDistFunction is a higher-order function which uses the provided substitution and the transposition penalty functions to construct the modified version of the otherwise generic OptimalAlignmentDist function with those specified edit distance metrics. The resulting function will receive two input strings and output the distance between them.

Directories

Path Synopsis
Package thai is a subpackage that provides additional editdist-related functions that are customized for strings containing Thai characters, etc.
Package thai is a subpackage that provides additional editdist-related functions that are customized for strings containing Thai characters, etc.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL