Documentation
¶
Overview ¶
Package strsim provides comprehensive string similarity metrics, distance functions, and phonetic encoding algorithms.
It combines edit-distance, token-based, and phonetic algorithms in a single package with a unified API. Every metric returns both raw distance and normalized similarity in [0, 1], where 1.0 means identical.
Quick Start ¶
// Edit distance
d := strsim.Levenshtein("kitten", "sitting") // 3
// Normalized similarity [0, 1]
s := strsim.JaroWinklerSimilarity("martha", "marhta") // ~0.961
// Phonetic encoding
code := strsim.Soundex("Robert") // "R163"
// Find best match from a list
m := strsim.FindBestMatch("golang", candidates, strsim.NewJaroWinkler())
Interfaces ¶
All similarity metrics implement the Metric interface, allowing them to be used interchangeably. Distance metrics additionally implement DistanceMetric. Phonetic encoders implement the Encoder interface.
Index ¶
- func CosineSimilarity(a, b string) float64
- func DamerauLevenshtein(a, b string) int
- func DamerauLevenshteinSimilarity(a, b string) float64
- func DiceSimilarity(a, b string) float64
- func DoubleMetaphone(s string) (primary, alternate string)
- func DoubleMetaphoneMatch(a, b string) bool
- func Hamming(a, b string) (int, error)
- func HammingSimilarity(a, b string) float64
- func JaccardSimilarity(a, b string) float64
- func JaroSimilarity(a, b string) float64
- func JaroWinklerSimilarity(a, b string) float64
- func LCS(a, b string) int
- func LCSDistance(a, b string) int
- func LCSSimilarity(a, b string) float64
- func Levenshtein(a, b string) int
- func LevenshteinSimilarity(a, b string) float64
- func Metaphone(s string) string
- func MetaphoneMatch(a, b string) bool
- func NYSIIS(s string) string
- func NYSIISMatch(a, b string) bool
- func OSA(a, b string) int
- func OSASimilarity(a, b string) float64
- func OverlapSimilarity(a, b string) float64
- func Soundex(s string) string
- func SoundexMatch(a, b string) bool
- type DamerauLevenshteinMetric
- type DistanceMetric
- type DoubleMetaphoneEncoder
- type DualEncoder
- type Encoder
- type HammingMetric
- type JaroMetric
- type JaroWinklerMetric
- type LCSMetric
- type LevenshteinMetric
- type Match
- type MetaphoneEncoder
- type Metric
- type NYSIISEncoder
- type NgramMetric
- type OSAMetric
- type SoundexEncoder
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CosineSimilarity ¶
CosineSimilarity returns the cosine similarity between a and b using bigrams.
func DamerauLevenshtein ¶
DamerauLevenshtein returns the true Damerau-Levenshtein distance between a and b.
func DamerauLevenshteinSimilarity ¶
DamerauLevenshteinSimilarity returns the normalized similarity between a and b using the true Damerau-Levenshtein distance as a value in [0, 1].
func DiceSimilarity ¶
DiceSimilarity returns the Sorensen-Dice similarity between a and b using bigrams.
func DoubleMetaphone ¶
DoubleMetaphone returns the primary and alternate Double Metaphone codes for the given string using the default code length of 4.
func DoubleMetaphoneMatch ¶
DoubleMetaphoneMatch reports whether two strings share at least one common Double Metaphone code (primary or alternate).
func Hamming ¶
Hamming returns the Hamming distance between two strings. It returns an error if the strings have different rune lengths, since Hamming distance is only defined for strings of equal length.
func HammingSimilarity ¶
HammingSimilarity returns the normalized Hamming similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty. Returns 0.0 when the strings have different rune lengths.
func JaccardSimilarity ¶
JaccardSimilarity returns the Jaccard similarity between a and b using bigrams.
func JaroSimilarity ¶
JaroSimilarity returns the Jaro similarity between a and b as a value in [0, 1], where 1.0 means identical. The algorithm considers characters as matching if they appear within a window of max(len(a), len(b))/2 - 1 positions and counts transpositions among matched characters.
func JaroWinklerSimilarity ¶
JaroWinklerSimilarity returns the Jaro-Winkler similarity between a and b using default settings (BoostThreshold = 0.7, PrefixSize = 4).
func LCS ¶
LCS returns the length of the longest common subsequence of a and b. It uses O(min(m, n)) space with a two-row optimization.
func LCSDistance ¶
LCSDistance returns the LCS distance between a and b, defined as runeLen(a) + runeLen(b) - 2*LCS(a, b).
func LCSSimilarity ¶
LCSSimilarity returns the normalized similarity between a and b based on the LCS length as a value in [0, 1]. Returns 1.0 when both strings are empty.
func Levenshtein ¶
Levenshtein returns the standard Levenshtein distance between a and b using unit costs for all operations.
func LevenshteinSimilarity ¶
LevenshteinSimilarity returns the normalized similarity between a and b as a value in [0, 1] using the standard Levenshtein distance.
func Metaphone ¶
Metaphone returns the Metaphone code for the given string using the default code length of 4.
func MetaphoneMatch ¶
MetaphoneMatch reports whether two strings have the same Metaphone code.
func NYSIIS ¶
NYSIIS returns the NYSIIS code for the given string using the default code length of 6.
func NYSIISMatch ¶
NYSIISMatch reports whether two strings produce the same NYSIIS code.
func OSASimilarity ¶
OSASimilarity returns the normalized similarity between a and b using the OSA distance as a value in [0, 1].
func OverlapSimilarity ¶
OverlapSimilarity returns the overlap coefficient between a and b using bigrams.
func Soundex ¶
Soundex returns the American Soundex code for the given string using the default code length of 4.
func SoundexMatch ¶
SoundexMatch reports whether two strings have the same Soundex code using the default code length.
Types ¶
type DamerauLevenshteinMetric ¶
type DamerauLevenshteinMetric struct {
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
DamerauLevenshteinMetric computes the true Damerau-Levenshtein distance between two strings. Unlike OSA, it allows unrestricted transpositions, meaning substrings may be edited more than once.
Set ASCIIOnly to true for faster processing of pure-ASCII strings.
func NewDamerauLevenshtein ¶
func NewDamerauLevenshtein() *DamerauLevenshteinMetric
NewDamerauLevenshtein returns a new DamerauLevenshteinMetric instance.
func (*DamerauLevenshteinMetric) Distance ¶
func (m *DamerauLevenshteinMetric) Distance(a, b string) int
Distance returns the true Damerau-Levenshtein distance between a and b using the algorithm with a DA (last-row-seen) map. Time and space are O(m * n).
func (*DamerauLevenshteinMetric) Similarity ¶
func (m *DamerauLevenshteinMetric) Similarity(a, b string) float64
Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.
type DistanceMetric ¶
DistanceMetric computes both raw edit distance and normalized similarity.
type DoubleMetaphoneEncoder ¶
type DoubleMetaphoneEncoder struct {
// MaxLength is the maximum length of each generated code. Default is 4.
MaxLength int
}
DoubleMetaphoneEncoder implements the Double Metaphone algorithm by Lawrence Philips. It produces two phonetic codes (primary and alternate) that capture different possible pronunciations, accounting for non-English name origins including Germanic, Slavic, Celtic, Greek, Italian, Spanish, and Chinese.
func NewDoubleMetaphone ¶
func NewDoubleMetaphone() *DoubleMetaphoneEncoder
NewDoubleMetaphone returns a new DoubleMetaphoneEncoder with the default code length of 4.
func (*DoubleMetaphoneEncoder) Encode ¶
func (e *DoubleMetaphoneEncoder) Encode(s string) (primary, alternate string)
Encode returns the primary and alternate Double Metaphone codes for the given string. Both codes are empty if the input contains no ASCII letters.
func (*DoubleMetaphoneEncoder) Match ¶
func (e *DoubleMetaphoneEncoder) Match(a, b string) bool
Match reports whether two strings share at least one common Double Metaphone code.
type DualEncoder ¶
DualEncoder produces primary and alternate phonetic encodings.
type HammingMetric ¶
type HammingMetric struct {
// ASCIIOnly skips rune conversion for faster ASCII processing.
// In this mode, length is measured in bytes instead of runes.
ASCIIOnly bool
}
HammingMetric computes the Hamming distance between two strings of equal length. Hamming distance is the number of positions at which the corresponding characters differ.
Set ASCIIOnly to true for faster processing of pure-ASCII strings. In this mode, length is measured in bytes instead of runes.
func (*HammingMetric) Distance ¶
func (m *HammingMetric) Distance(a, b string) int
Distance returns the Hamming distance between a and b. It panics if the strings have different lengths (rune length, or byte length if ASCIIOnly).
func (*HammingMetric) Similarity ¶
func (m *HammingMetric) Similarity(a, b string) float64
Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty. Returns 0.0 when the strings have different lengths.
type JaroMetric ¶
type JaroMetric struct {
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
JaroMetric computes the Jaro similarity between two strings. Jaro similarity is based on the number of matching characters and transpositions.
Set ASCIIOnly to true for faster processing of pure-ASCII strings.
func (*JaroMetric) Similarity ¶
func (m *JaroMetric) Similarity(a, b string) float64
Similarity returns the Jaro similarity between a and b.
type JaroWinklerMetric ¶
type JaroWinklerMetric struct {
// BoostThreshold is the minimum Jaro score required to apply the prefix
// bonus. Default: 0.7.
BoostThreshold float64
// PrefixSize is the maximum number of prefix characters considered for
// the bonus. Default: 4.
PrefixSize int
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
JaroWinklerMetric computes the Jaro-Winkler similarity between two strings. Jaro-Winkler extends Jaro with a prefix bonus that increases the score when the strings share a common prefix.
Set ASCIIOnly to true for faster processing of pure-ASCII strings.
func NewJaroWinkler ¶
func NewJaroWinkler() *JaroWinklerMetric
NewJaroWinkler returns a new JaroWinklerMetric with default settings (BoostThreshold = 0.7, PrefixSize = 4).
func (*JaroWinklerMetric) Similarity ¶
func (m *JaroWinklerMetric) Similarity(a, b string) float64
Similarity returns the Jaro-Winkler similarity between a and b as a value in [0, 1], where 1.0 means identical.
type LCSMetric ¶
type LCSMetric struct {
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
LCSMetric computes the Longest Common Subsequence (LCS) between two strings. The LCS distance is defined as len(a) + len(b) - 2*LCS(a, b), representing the minimum number of characters that must be deleted from both strings to make them equal.
Set ASCIIOnly to true for faster processing of pure-ASCII strings.
func (*LCSMetric) Similarity ¶
Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.
type LevenshteinMetric ¶
type LevenshteinMetric struct {
// InsertCost is the cost of inserting a character. Default: 1.
InsertCost int
// DeleteCost is the cost of deleting a character. Default: 1.
DeleteCost int
// ReplaceCost is the cost of replacing a character. Default: 1.
ReplaceCost int
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
LevenshteinMetric computes the Levenshtein edit distance between two strings. Levenshtein distance counts the minimum number of single-character insertions, deletions, and substitutions needed to transform one string into the other.
The metric supports configurable operation costs via InsertCost, DeleteCost, and ReplaceCost fields.
Set ASCIIOnly to true for faster processing of pure-ASCII strings. This skips the rune conversion but produces incorrect results for multi-byte UTF-8 input.
func NewLevenshtein ¶
func NewLevenshtein() *LevenshteinMetric
NewLevenshtein returns a new LevenshteinMetric with all operation costs set to 1 (standard Levenshtein distance).
func (*LevenshteinMetric) Distance ¶
func (m *LevenshteinMetric) Distance(a, b string) int
Distance returns the Levenshtein distance between a and b using the configured operation costs. It uses O(min(m, n)) space.
func (*LevenshteinMetric) Similarity ¶
func (m *LevenshteinMetric) Similarity(a, b string) float64
Similarity returns the normalized similarity between a and b as a value in [0, 1]. It normalizes by dividing the distance by the maximum possible distance (max rune length * max operation cost). Returns 1.0 when both strings are empty.
type Match ¶
type Match struct {
// Value is the matched candidate string.
Value string
// Similarity is the similarity score in [0, 1].
Similarity float64
// Index is the position of this candidate in the original slice.
Index int
}
Match represents a candidate string and its similarity score.
func FindAboveThreshold ¶
FindAboveThreshold returns all candidates whose similarity to query meets or exceeds the given threshold, sorted by similarity descending.
func FindBestMatch ¶
FindBestMatch returns the candidate with the highest similarity to query according to the given Metric. If candidates is empty, it returns a Match with Index -1 and Similarity 0.
type MetaphoneEncoder ¶
type MetaphoneEncoder struct {
// MaxLength is the maximum length of the generated code. Default is 4.
MaxLength int
}
MetaphoneEncoder implements the original Metaphone phonetic algorithm by Lawrence Philips. It transforms an English word into a phonetic key that represents its approximate pronunciation.
func NewMetaphone ¶
func NewMetaphone() *MetaphoneEncoder
NewMetaphone returns a new MetaphoneEncoder with the default code length of 4.
func (*MetaphoneEncoder) Encode ¶
func (e *MetaphoneEncoder) Encode(s string) string
Encode returns the Metaphone code for the given string. It returns an empty string if the input contains no ASCII letters.
func (*MetaphoneEncoder) Match ¶
func (e *MetaphoneEncoder) Match(a, b string) bool
Match reports whether two strings produce the same Metaphone code.
type Metric ¶
Metric computes similarity between two strings. Implementations return values in [0, 1] where 1.0 means identical.
type NYSIISEncoder ¶
type NYSIISEncoder struct {
// MaxLength is the maximum length of the generated code. Default is 6.
// Set to 0 for unlimited length.
MaxLength int
}
NYSIISEncoder implements the New York State Identification and Intelligence System (NYSIIS) phonetic algorithm. It produces a code that groups similar-sounding names together. The algorithm handles common English name patterns and is particularly effective for American names.
func NewNYSIIS ¶
func NewNYSIIS() *NYSIISEncoder
NewNYSIIS returns a new NYSIISEncoder with the default code length of 6.
func (*NYSIISEncoder) Encode ¶
func (e *NYSIISEncoder) Encode(s string) string
Encode returns the NYSIIS code for the given string. It returns an empty string if the input contains no ASCII letters.
func (*NYSIISEncoder) Match ¶
func (e *NYSIISEncoder) Match(a, b string) bool
Match reports whether two strings produce the same NYSIIS code.
type NgramMetric ¶
type NgramMetric struct {
// Size is the n-gram size. Default: 2 (bigrams).
Size int
}
NgramMetric computes string similarity using n-gram based coefficients. It supports Cosine, Jaccard, Sorensen-Dice, and Overlap similarity.
func NewNgram ¶
func NewNgram() *NgramMetric
NewNgram returns a new NgramMetric with Size = 2 (bigrams).
func (*NgramMetric) Cosine ¶
func (m *NgramMetric) Cosine(a, b string) float64
Cosine returns the cosine similarity between a and b using term-frequency n-gram vectors: dot(A, B) / (||A|| * ||B||). Returns 1.0 when both strings are empty, 0.0 when only one is empty.
func (*NgramMetric) Dice ¶
func (m *NgramMetric) Dice(a, b string) float64
Dice returns the Sorensen-Dice similarity between a and b using set semantics: 2 * |intersection| / (|A| + |B|). Returns 1.0 when both strings are empty.
func (*NgramMetric) Jaccard ¶
func (m *NgramMetric) Jaccard(a, b string) float64
Jaccard returns the Jaccard similarity between a and b using set semantics: |intersection| / |union|. Returns 1.0 when both strings are empty.
func (*NgramMetric) Overlap ¶
func (m *NgramMetric) Overlap(a, b string) float64
Overlap returns the overlap coefficient between a and b: |intersection| / min(|A|, |B|). Returns 1.0 when both strings are empty.
func (*NgramMetric) Similarity ¶
func (m *NgramMetric) Similarity(a, b string) float64
Similarity returns the Sorensen-Dice similarity, which is the default similarity coefficient for n-gram based comparison.
type OSAMetric ¶
type OSAMetric struct {
// ASCIIOnly skips rune conversion for faster ASCII processing.
// Produces incorrect results for multi-byte UTF-8 input.
ASCIIOnly bool
}
OSAMetric computes the Optimal String Alignment (restricted edit) distance between two strings. OSA extends the Levenshtein distance by also counting transpositions of two adjacent characters as a single operation, with the restriction that no substring is edited more than once.
Set ASCIIOnly to true for faster processing of pure-ASCII strings.
func (*OSAMetric) Distance ¶
Distance returns the OSA distance between a and b. It uses a 3-row optimization for O(3 * min(m, n)) space.
func (*OSAMetric) Similarity ¶
Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.
type SoundexEncoder ¶
type SoundexEncoder struct {
// MaxLength is the length of the generated Soundex code. Default is 4.
MaxLength int
}
SoundexEncoder implements the American Soundex phonetic algorithm. It maps a name to a four-character code consisting of one letter followed by three digits, enabling approximate matching of names that sound alike.
func NewSoundex ¶
func NewSoundex() *SoundexEncoder
NewSoundex returns a new SoundexEncoder with the default code length of 4.
func (*SoundexEncoder) Encode ¶
func (e *SoundexEncoder) Encode(s string) string
Encode returns the Soundex code for the given string. It returns an empty string if the input contains no ASCII letters.
func (*SoundexEncoder) Match ¶
func (e *SoundexEncoder) Match(a, b string) bool
Match reports whether two strings produce the same Soundex code.