strsim

package module

v0.1.0 Latest Latest Go to latest Published: Mar 16, 2026 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jcoruiz/strsim

Links

Open Source Insights

README ¶

strsim

Comprehensive string similarity metrics for Go. Edit distance, token-based similarity, and phonetic encoding — all in one package with a unified API.

Features

15+ algorithms in a single zero-dependency package
Unified interfaces — every metric implements Metric (similarity) or DistanceMetric (distance + similarity)
Correct Unicode handling — all algorithms operate on runes, not bytes
Safe edge case handling — empty strings return 1.0 similarity (never NaN), mismatched lengths return 0.0
Batch operations — FindBestMatch, FindTopN, FindAboveThreshold
Zero dependencies — stdlib only

Quick Start

go get github.com/jcoruiz/strsim

package main

import (
    "fmt"
    "github.com/jcoruiz/strsim"
)

func main() {
    // Edit distance
    fmt.Println(strsim.Levenshtein("kitten", "sitting"))  // 3

    // Normalized similarity [0, 1]
    fmt.Println(strsim.JaroWinklerSimilarity("martha", "marhta"))  // ~0.961

    // Phonetic matching
    fmt.Println(strsim.SoundexMatch("Robert", "Rupert"))  // true

    // Find best match from candidates
    candidates := []string{"golang", "gold", "golem", "gopher", "python"}
    match := strsim.FindBestMatch("go", candidates, strsim.NewJaroWinkler())
    fmt.Printf("%s (%.2f)\n", match.Value, match.Similarity)
}

Algorithms

Edit Distance

Algorithm	Function	Returns
Hamming	`Hamming(a, b)`	`int, error` (equal-length only)
Levenshtein	`Levenshtein(a, b)`	`int`
Optimal String Alignment	`OSA(a, b)`	`int`
Damerau-Levenshtein	`DamerauLevenshtein(a, b)`	`int`
Longest Common Subsequence	`LCSDistance(a, b)`	`int`

All edit distance functions have a corresponding *Similarity variant returning a normalized float64 in [0, 1].

Token-based / Set-based Similarity

Algorithm	Function	Notes
Jaro	`JaroSimilarity(a, b)`	Matching characters + transpositions
Jaro-Winkler	`JaroWinklerSimilarity(a, b)`	Jaro + prefix bonus
Cosine (n-gram)	`CosineSimilarity(a, b)`	TF-weighted n-gram vectors
Jaccard (n-gram)	`JaccardSimilarity(a, b)`	Set intersection / union
Sørensen-Dice (n-gram)	`DiceSimilarity(a, b)`	2·intersection / (
Overlap Coefficient	`OverlapSimilarity(a, b)`	Intersection / min(

Phonetic Encoding

Algorithm	Function	Returns
American Soundex	`Soundex(s)`	`string` (4-char code)
Metaphone	`Metaphone(s)`	`string`
Double Metaphone	`DoubleMetaphone(s)`	`string, string` (primary, alternate)
NYSIIS	`NYSIIS(s)`	`string`

All phonetic functions have a corresponding *Match(a, b) variant returning bool.

Interfaces

Every metric implements at least one interface, making them interchangeable:

// Similarity metric — returns [0, 1] where 1.0 = identical.
type Metric interface {
    Similarity(a, b string) float64
}

// Distance metric — also returns raw edit distance.
type DistanceMetric interface {
    Metric
    Distance(a, b string) int
}

// Phonetic encoder.
type Encoder interface {
    Encode(s string) string
}

Using interfaces

// Use any Metric interchangeably
func findSimilar(query string, items []string, m strsim.Metric) {
    for _, item := range items {
        if m.Similarity(query, item) > 0.8 {
            fmt.Println(item)
        }
    }
}

// Works with any metric
findSimilar("golang", items, strsim.NewJaroWinkler())
findSimilar("golang", items, strsim.NewLevenshtein())
findSimilar("golang", items, strsim.NewDamerauLevenshtein())

Batch Operations

candidates := []string{"golang", "gold", "golem", "gopher", "python", "ruby"}
m := strsim.NewJaroWinkler()

// Best single match
best := strsim.FindBestMatch("go", candidates, m)

// Top N matches, sorted by similarity descending
top3 := strsim.FindTopN("go", candidates, 3, m)

// All matches above threshold
matches := strsim.FindAboveThreshold("go", candidates, 0.7, m)

Configurable Metrics

Most algorithms have configurable variants:

// Custom Levenshtein costs
lev := &strsim.LevenshteinMetric{
    InsertCost:  1,
    DeleteCost:  1,
    ReplaceCost: 2,
}
dist := lev.Distance("kitten", "sitting")

// Custom Jaro-Winkler parameters
jw := &strsim.JaroWinklerMetric{
    BoostThreshold: 0.7,
    PrefixSize:     4,
}
sim := jw.Similarity("martha", "marhta")

// Custom n-gram size
ng := &strsim.NgramMetric{Size: 3}  // trigrams instead of bigrams
sim = ng.Cosine("night", "nacht")

// Custom phonetic code length
enc := &strsim.SoundexEncoder{MaxLength: 6}
code := enc.Encode("Washington")

ASCII Fast Path

All metrics support an ASCIIOnly mode that skips rune conversion for faster processing of pure-ASCII input. This produces incorrect results for multi-byte UTF-8 strings — use only when you know your input is ASCII (identifiers, codes, URLs, English text).

// Up to 11x faster for ASCII-only input
m := &strsim.LevenshteinMetric{
    InsertCost: 1, DeleteCost: 1, ReplaceCost: 1,
    ASCIIOnly: true,
}
dist := m.Distance("kitten", "sitting")

// Works with batch operations too
jw := &strsim.JaroWinklerMetric{
    BoostThreshold: 0.7, PrefixSize: 4,
    ASCIIOnly: true,
}
match := strsim.FindBestMatch("query", candidates, jw)

Algorithm	Rune (ns/op)	ASCII (ns/op)	Speedup
Hamming	113	10	11x
Damerau-Levenshtein	17,337	7,113	2.4x
Jaro-Winkler	474	341	1.4x
LCS	1,574	1,373	1.15x
Levenshtein	3,039	2,885	1.05x

Inputs: 43-char ASCII strings.

Benchmarks

Measured on AMD Ryzen, Go 1.22 (go test -bench=. -benchmem):

Algorithm	ns/op	B/op	allocs/op
Soundex	30	21	3
Hamming	39	0	0
Jaro-Winkler	48	16	2
Jaro	54	32	2
Metaphone	56	32	3
NYSIIS	86	64	4
Double Metaphone	134	32	4
LCS	534	448	2
Cosine (bigram)	850	507	18
Levenshtein	1017	448	2
OSA	1030	672	3
Jaccard (bigram)	1172	1084	22

Inputs: "kitten"/"sitting" for edit distance, "Schneider" for phonetic, "night"/"nacht" for n-gram.

Why strsim?

The Go ecosystem for string similarity is fragmented across 8+ libraries, each with a subset of algorithms, inconsistent APIs, and known bugs:

go-edlib — NaN on empty strings, float32 precision, no phonetic
strutil — NaN on empty strings, negative similarity with custom costs
smetrics — operates on bytes not runes (broken Unicode), Soundex panics on empty input
matchr — most complete phonetic support but GPLv3 licensed

strsim consolidates everything into one MIT-licensed package with correct Unicode handling, consistent APIs, and no edge-case surprises.

License

MIT

Documentation ¶

Overview ¶

Package strsim provides comprehensive string similarity metrics, distance functions, and phonetic encoding algorithms.

It combines edit-distance, token-based, and phonetic algorithms in a single package with a unified API. Every metric returns both raw distance and normalized similarity in [0, 1], where 1.0 means identical.

Quick Start ¶

// Edit distance
d := strsim.Levenshtein("kitten", "sitting")  // 3

// Normalized similarity [0, 1]
s := strsim.JaroWinklerSimilarity("martha", "marhta")  // ~0.961

// Phonetic encoding
code := strsim.Soundex("Robert")  // "R163"

// Find best match from a list
m := strsim.FindBestMatch("golang", candidates, strsim.NewJaroWinkler())

Interfaces ¶

All similarity metrics implement the Metric interface, allowing them to be used interchangeably. Distance metrics additionally implement DistanceMetric. Phonetic encoders implement the Encoder interface.

Index ¶

func CosineSimilarity(a, b string) float64
func DamerauLevenshtein(a, b string) int
func DamerauLevenshteinSimilarity(a, b string) float64
func DiceSimilarity(a, b string) float64
func DoubleMetaphone(s string) (primary, alternate string)
func DoubleMetaphoneMatch(a, b string) bool
func Hamming(a, b string) (int, error)
func HammingSimilarity(a, b string) float64
func JaccardSimilarity(a, b string) float64
func JaroSimilarity(a, b string) float64
func JaroWinklerSimilarity(a, b string) float64
func LCS(a, b string) int
func LCSDistance(a, b string) int
func LCSSimilarity(a, b string) float64
func Levenshtein(a, b string) int
func LevenshteinSimilarity(a, b string) float64
func Metaphone(s string) string
func MetaphoneMatch(a, b string) bool
func NYSIIS(s string) string
func NYSIISMatch(a, b string) bool
func OSA(a, b string) int
func OSASimilarity(a, b string) float64
func OverlapSimilarity(a, b string) float64
func Soundex(s string) string
func SoundexMatch(a, b string) bool
type DamerauLevenshteinMetric
- func NewDamerauLevenshtein() *DamerauLevenshteinMetric
- func (m *DamerauLevenshteinMetric) Distance(a, b string) int
- func (m *DamerauLevenshteinMetric) Similarity(a, b string) float64
type DistanceMetric
type DoubleMetaphoneEncoder
- func NewDoubleMetaphone() *DoubleMetaphoneEncoder
- func (e *DoubleMetaphoneEncoder) Encode(s string) (primary, alternate string)
- func (e *DoubleMetaphoneEncoder) Match(a, b string) bool
type DualEncoder
type Encoder
type HammingMetric
- func NewHamming() *HammingMetric
- func (m *HammingMetric) Distance(a, b string) int
- func (m *HammingMetric) Similarity(a, b string) float64
type JaroMetric
- func NewJaro() *JaroMetric
- func (m *JaroMetric) Similarity(a, b string) float64
type JaroWinklerMetric
- func NewJaroWinkler() *JaroWinklerMetric
- func (m *JaroWinklerMetric) Similarity(a, b string) float64
type LCSMetric
- func NewLCS() *LCSMetric
- func (m *LCSMetric) Distance(a, b string) int
- func (m *LCSMetric) Similarity(a, b string) float64
type LevenshteinMetric
- func NewLevenshtein() *LevenshteinMetric
- func (m *LevenshteinMetric) Distance(a, b string) int
- func (m *LevenshteinMetric) Similarity(a, b string) float64
type Match
- func FindAboveThreshold(query string, candidates []string, threshold float64, m Metric) []Match
- func FindBestMatch(query string, candidates []string, m Metric) Match
- func FindTopN(query string, candidates []string, n int, m Metric) []Match
type MetaphoneEncoder
- func NewMetaphone() *MetaphoneEncoder
- func (e *MetaphoneEncoder) Encode(s string) string
- func (e *MetaphoneEncoder) Match(a, b string) bool
type Metric
type NYSIISEncoder
- func NewNYSIIS() *NYSIISEncoder
- func (e *NYSIISEncoder) Encode(s string) string
- func (e *NYSIISEncoder) Match(a, b string) bool
type NgramMetric
- func NewNgram() *NgramMetric
- func (m *NgramMetric) Cosine(a, b string) float64
- func (m *NgramMetric) Dice(a, b string) float64
- func (m *NgramMetric) Jaccard(a, b string) float64
- func (m *NgramMetric) Overlap(a, b string) float64
- func (m *NgramMetric) Similarity(a, b string) float64
type OSAMetric
- func NewOSA() *OSAMetric
- func (m *OSAMetric) Distance(a, b string) int
- func (m *OSAMetric) Similarity(a, b string) float64
type SoundexEncoder
- func NewSoundex() *SoundexEncoder
- func (e *SoundexEncoder) Encode(s string) string
- func (e *SoundexEncoder) Match(a, b string) bool

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CosineSimilarity ¶

func CosineSimilarity(a, b string) float64

CosineSimilarity returns the cosine similarity between a and b using bigrams.

func DamerauLevenshtein ¶

func DamerauLevenshtein(a, b string) int

DamerauLevenshtein returns the true Damerau-Levenshtein distance between a and b.

func DamerauLevenshteinSimilarity ¶

func DamerauLevenshteinSimilarity(a, b string) float64

DamerauLevenshteinSimilarity returns the normalized similarity between a and b using the true Damerau-Levenshtein distance as a value in [0, 1].

func DiceSimilarity ¶

func DiceSimilarity(a, b string) float64

DiceSimilarity returns the Sorensen-Dice similarity between a and b using bigrams.

func DoubleMetaphone ¶

func DoubleMetaphone(s string) (primary, alternate string)

DoubleMetaphone returns the primary and alternate Double Metaphone codes for the given string using the default code length of 4.

func DoubleMetaphoneMatch ¶

func DoubleMetaphoneMatch(a, b string) bool

DoubleMetaphoneMatch reports whether two strings share at least one common Double Metaphone code (primary or alternate).

func Hamming ¶

func Hamming(a, b string) (int, error)

Hamming returns the Hamming distance between two strings. It returns an error if the strings have different rune lengths, since Hamming distance is only defined for strings of equal length.

func HammingSimilarity ¶

func HammingSimilarity(a, b string) float64

HammingSimilarity returns the normalized Hamming similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty. Returns 0.0 when the strings have different rune lengths.

func JaccardSimilarity ¶

func JaccardSimilarity(a, b string) float64

JaccardSimilarity returns the Jaccard similarity between a and b using bigrams.

func JaroSimilarity ¶

func JaroSimilarity(a, b string) float64

JaroSimilarity returns the Jaro similarity between a and b as a value in [0, 1], where 1.0 means identical. The algorithm considers characters as matching if they appear within a window of max(len(a), len(b))/2 - 1 positions and counts transpositions among matched characters.

func JaroWinklerSimilarity ¶

func JaroWinklerSimilarity(a, b string) float64

JaroWinklerSimilarity returns the Jaro-Winkler similarity between a and b using default settings (BoostThreshold = 0.7, PrefixSize = 4).

func LCS ¶

func LCS(a, b string) int

LCS returns the length of the longest common subsequence of a and b. It uses O(min(m, n)) space with a two-row optimization.

func LCSDistance ¶

func LCSDistance(a, b string) int

LCSDistance returns the LCS distance between a and b, defined as runeLen(a) + runeLen(b) - 2*LCS(a, b).

func LCSSimilarity ¶

func LCSSimilarity(a, b string) float64

LCSSimilarity returns the normalized similarity between a and b based on the LCS length as a value in [0, 1]. Returns 1.0 when both strings are empty.

func Levenshtein ¶

func Levenshtein(a, b string) int

Levenshtein returns the standard Levenshtein distance between a and b using unit costs for all operations.

func LevenshteinSimilarity ¶

func LevenshteinSimilarity(a, b string) float64

LevenshteinSimilarity returns the normalized similarity between a and b as a value in [0, 1] using the standard Levenshtein distance.

func Metaphone ¶

func Metaphone(s string) string

Metaphone returns the Metaphone code for the given string using the default code length of 4.

func MetaphoneMatch ¶

func MetaphoneMatch(a, b string) bool

MetaphoneMatch reports whether two strings have the same Metaphone code.

func NYSIIS ¶

func NYSIIS(s string) string

NYSIIS returns the NYSIIS code for the given string using the default code length of 6.

func NYSIISMatch ¶

func NYSIISMatch(a, b string) bool

NYSIISMatch reports whether two strings produce the same NYSIIS code.

func OSA ¶

func OSA(a, b string) int

OSA returns the Optimal String Alignment distance between a and b.

func OSASimilarity ¶

func OSASimilarity(a, b string) float64

OSASimilarity returns the normalized similarity between a and b using the OSA distance as a value in [0, 1].

func OverlapSimilarity ¶

func OverlapSimilarity(a, b string) float64

OverlapSimilarity returns the overlap coefficient between a and b using bigrams.

func Soundex ¶

func Soundex(s string) string

Soundex returns the American Soundex code for the given string using the default code length of 4.

func SoundexMatch ¶

func SoundexMatch(a, b string) bool

SoundexMatch reports whether two strings have the same Soundex code using the default code length.

Types ¶

type DamerauLevenshteinMetric ¶

type DamerauLevenshteinMetric struct {
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

DamerauLevenshteinMetric computes the true Damerau-Levenshtein distance between two strings. Unlike OSA, it allows unrestricted transpositions, meaning substrings may be edited more than once.

Set ASCIIOnly to true for faster processing of pure-ASCII strings.

func NewDamerauLevenshtein ¶

func NewDamerauLevenshtein() *DamerauLevenshteinMetric

NewDamerauLevenshtein returns a new DamerauLevenshteinMetric instance.

func (*DamerauLevenshteinMetric) Distance ¶

func (m *DamerauLevenshteinMetric) Distance(a, b string) int

Distance returns the true Damerau-Levenshtein distance between a and b using the algorithm with a DA (last-row-seen) map. Time and space are O(m * n).

func (*DamerauLevenshteinMetric) Similarity ¶

func (m *DamerauLevenshteinMetric) Similarity(a, b string) float64

Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.

type DistanceMetric ¶

type DistanceMetric interface {
	Metric
	Distance(a, b string) int
}

DistanceMetric computes both raw edit distance and normalized similarity.

type DoubleMetaphoneEncoder ¶

type DoubleMetaphoneEncoder struct {
	// MaxLength is the maximum length of each generated code. Default is 4.
	MaxLength int
}

DoubleMetaphoneEncoder implements the Double Metaphone algorithm by Lawrence Philips. It produces two phonetic codes (primary and alternate) that capture different possible pronunciations, accounting for non-English name origins including Germanic, Slavic, Celtic, Greek, Italian, Spanish, and Chinese.

func NewDoubleMetaphone ¶

func NewDoubleMetaphone() *DoubleMetaphoneEncoder

NewDoubleMetaphone returns a new DoubleMetaphoneEncoder with the default code length of 4.

func (*DoubleMetaphoneEncoder) Encode ¶

func (e *DoubleMetaphoneEncoder) Encode(s string) (primary, alternate string)

Encode returns the primary and alternate Double Metaphone codes for the given string. Both codes are empty if the input contains no ASCII letters.

func (*DoubleMetaphoneEncoder) Match ¶

func (e *DoubleMetaphoneEncoder) Match(a, b string) bool

Match reports whether two strings share at least one common Double Metaphone code.

type DualEncoder ¶

type DualEncoder interface {
	Encode(s string) (primary, alternate string)
}

DualEncoder produces primary and alternate phonetic encodings.

type Encoder ¶

type Encoder interface {
	Encode(s string) string
}

Encoder produces a phonetic encoding of a string.

type HammingMetric ¶

type HammingMetric struct {
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// In this mode, length is measured in bytes instead of runes.
	ASCIIOnly bool
}

HammingMetric computes the Hamming distance between two strings of equal length. Hamming distance is the number of positions at which the corresponding characters differ.

Set ASCIIOnly to true for faster processing of pure-ASCII strings. In this mode, length is measured in bytes instead of runes.

func NewHamming ¶

func NewHamming() *HammingMetric

NewHamming returns a new HammingMetric instance.

func (*HammingMetric) Distance ¶

func (m *HammingMetric) Distance(a, b string) int

Distance returns the Hamming distance between a and b. It panics if the strings have different lengths (rune length, or byte length if ASCIIOnly).

func (*HammingMetric) Similarity ¶

func (m *HammingMetric) Similarity(a, b string) float64

Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty. Returns 0.0 when the strings have different lengths.

type JaroMetric ¶

type JaroMetric struct {
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

JaroMetric computes the Jaro similarity between two strings. Jaro similarity is based on the number of matching characters and transpositions.

Set ASCIIOnly to true for faster processing of pure-ASCII strings.

func NewJaro ¶

func NewJaro() *JaroMetric

NewJaro returns a new JaroMetric instance.

func (*JaroMetric) Similarity ¶

func (m *JaroMetric) Similarity(a, b string) float64

Similarity returns the Jaro similarity between a and b.

type JaroWinklerMetric ¶

type JaroWinklerMetric struct {
	// BoostThreshold is the minimum Jaro score required to apply the prefix
	// bonus. Default: 0.7.
	BoostThreshold float64

	// PrefixSize is the maximum number of prefix characters considered for
	// the bonus. Default: 4.
	PrefixSize int

	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

JaroWinklerMetric computes the Jaro-Winkler similarity between two strings. Jaro-Winkler extends Jaro with a prefix bonus that increases the score when the strings share a common prefix.

Set ASCIIOnly to true for faster processing of pure-ASCII strings.

func NewJaroWinkler ¶

func NewJaroWinkler() *JaroWinklerMetric

NewJaroWinkler returns a new JaroWinklerMetric with default settings (BoostThreshold = 0.7, PrefixSize = 4).

func (*JaroWinklerMetric) Similarity ¶

func (m *JaroWinklerMetric) Similarity(a, b string) float64

Similarity returns the Jaro-Winkler similarity between a and b as a value in [0, 1], where 1.0 means identical.

type LCSMetric ¶

type LCSMetric struct {
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

LCSMetric computes the Longest Common Subsequence (LCS) between two strings. The LCS distance is defined as len(a) + len(b) - 2*LCS(a, b), representing the minimum number of characters that must be deleted from both strings to make them equal.

Set ASCIIOnly to true for faster processing of pure-ASCII strings.

func NewLCS ¶

func NewLCS() *LCSMetric

NewLCS returns a new LCSMetric instance.

func (*LCSMetric) Distance ¶

func (m *LCSMetric) Distance(a, b string) int

Distance returns the LCS distance between a and b.

func (*LCSMetric) Similarity ¶

func (m *LCSMetric) Similarity(a, b string) float64

Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.

type LevenshteinMetric ¶

type LevenshteinMetric struct {
	// InsertCost is the cost of inserting a character. Default: 1.
	InsertCost int
	// DeleteCost is the cost of deleting a character. Default: 1.
	DeleteCost int
	// ReplaceCost is the cost of replacing a character. Default: 1.
	ReplaceCost int
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

LevenshteinMetric computes the Levenshtein edit distance between two strings. Levenshtein distance counts the minimum number of single-character insertions, deletions, and substitutions needed to transform one string into the other.

The metric supports configurable operation costs via InsertCost, DeleteCost, and ReplaceCost fields.

Set ASCIIOnly to true for faster processing of pure-ASCII strings. This skips the rune conversion but produces incorrect results for multi-byte UTF-8 input.

func NewLevenshtein ¶

func NewLevenshtein() *LevenshteinMetric

NewLevenshtein returns a new LevenshteinMetric with all operation costs set to 1 (standard Levenshtein distance).

func (*LevenshteinMetric) Distance ¶

func (m *LevenshteinMetric) Distance(a, b string) int

Distance returns the Levenshtein distance between a and b using the configured operation costs. It uses O(min(m, n)) space.

func (*LevenshteinMetric) Similarity ¶

func (m *LevenshteinMetric) Similarity(a, b string) float64

Similarity returns the normalized similarity between a and b as a value in [0, 1]. It normalizes by dividing the distance by the maximum possible distance (max rune length * max operation cost). Returns 1.0 when both strings are empty.

type Match ¶

type Match struct {
	// Value is the matched candidate string.
	Value string
	// Similarity is the similarity score in [0, 1].
	Similarity float64
	// Index is the position of this candidate in the original slice.
	Index int
}

Match represents a candidate string and its similarity score.

func FindAboveThreshold ¶

func FindAboveThreshold(query string, candidates []string, threshold float64, m Metric) []Match

FindAboveThreshold returns all candidates whose similarity to query meets or exceeds the given threshold, sorted by similarity descending.

func FindBestMatch ¶

func FindBestMatch(query string, candidates []string, m Metric) Match

FindBestMatch returns the candidate with the highest similarity to query according to the given Metric. If candidates is empty, it returns a Match with Index -1 and Similarity 0.

func FindTopN ¶

func FindTopN(query string, candidates []string, n int, m Metric) []Match

FindTopN returns the top n candidates with the highest similarity to query, sorted by similarity descending. If n exceeds the number of candidates, all candidates are returned.

type MetaphoneEncoder ¶

type MetaphoneEncoder struct {
	// MaxLength is the maximum length of the generated code. Default is 4.
	MaxLength int
}

MetaphoneEncoder implements the original Metaphone phonetic algorithm by Lawrence Philips. It transforms an English word into a phonetic key that represents its approximate pronunciation.

func NewMetaphone ¶

func NewMetaphone() *MetaphoneEncoder

NewMetaphone returns a new MetaphoneEncoder with the default code length of 4.

func (*MetaphoneEncoder) Encode ¶

func (e *MetaphoneEncoder) Encode(s string) string

Encode returns the Metaphone code for the given string. It returns an empty string if the input contains no ASCII letters.

func (*MetaphoneEncoder) Match ¶

func (e *MetaphoneEncoder) Match(a, b string) bool

Match reports whether two strings produce the same Metaphone code.

type Metric ¶

type Metric interface {
	Similarity(a, b string) float64
}

Metric computes similarity between two strings. Implementations return values in [0, 1] where 1.0 means identical.

type NYSIISEncoder ¶

type NYSIISEncoder struct {
	// MaxLength is the maximum length of the generated code. Default is 6.
	// Set to 0 for unlimited length.
	MaxLength int
}

NYSIISEncoder implements the New York State Identification and Intelligence System (NYSIIS) phonetic algorithm. It produces a code that groups similar-sounding names together. The algorithm handles common English name patterns and is particularly effective for American names.

func NewNYSIIS ¶

func NewNYSIIS() *NYSIISEncoder

NewNYSIIS returns a new NYSIISEncoder with the default code length of 6.

func (*NYSIISEncoder) Encode ¶

func (e *NYSIISEncoder) Encode(s string) string

Encode returns the NYSIIS code for the given string. It returns an empty string if the input contains no ASCII letters.

func (*NYSIISEncoder) Match ¶

func (e *NYSIISEncoder) Match(a, b string) bool

Match reports whether two strings produce the same NYSIIS code.

type NgramMetric ¶

type NgramMetric struct {
	// Size is the n-gram size. Default: 2 (bigrams).
	Size int
}

NgramMetric computes string similarity using n-gram based coefficients. It supports Cosine, Jaccard, Sorensen-Dice, and Overlap similarity.

func NewNgram ¶

func NewNgram() *NgramMetric

NewNgram returns a new NgramMetric with Size = 2 (bigrams).

func (*NgramMetric) Cosine ¶

func (m *NgramMetric) Cosine(a, b string) float64

Cosine returns the cosine similarity between a and b using term-frequency n-gram vectors: dot(A, B) / (||A|| * ||B||). Returns 1.0 when both strings are empty, 0.0 when only one is empty.

func (*NgramMetric) Dice ¶

func (m *NgramMetric) Dice(a, b string) float64

Dice returns the Sorensen-Dice similarity between a and b using set semantics: 2 * |intersection| / (|A| + |B|). Returns 1.0 when both strings are empty.

func (*NgramMetric) Jaccard ¶

func (m *NgramMetric) Jaccard(a, b string) float64

Jaccard returns the Jaccard similarity between a and b using set semantics: |intersection| / |union|. Returns 1.0 when both strings are empty.

func (*NgramMetric) Overlap ¶

func (m *NgramMetric) Overlap(a, b string) float64

Overlap returns the overlap coefficient between a and b: |intersection| / min(|A|, |B|). Returns 1.0 when both strings are empty.

func (*NgramMetric) Similarity ¶

func (m *NgramMetric) Similarity(a, b string) float64

Similarity returns the Sorensen-Dice similarity, which is the default similarity coefficient for n-gram based comparison.

type OSAMetric ¶

type OSAMetric struct {
	// ASCIIOnly skips rune conversion for faster ASCII processing.
	// Produces incorrect results for multi-byte UTF-8 input.
	ASCIIOnly bool
}

OSAMetric computes the Optimal String Alignment (restricted edit) distance between two strings. OSA extends the Levenshtein distance by also counting transpositions of two adjacent characters as a single operation, with the restriction that no substring is edited more than once.

Set ASCIIOnly to true for faster processing of pure-ASCII strings.

func NewOSA ¶

func NewOSA() *OSAMetric

NewOSA returns a new OSAMetric instance.

func (*OSAMetric) Distance ¶

func (m *OSAMetric) Distance(a, b string) int

Distance returns the OSA distance between a and b. It uses a 3-row optimization for O(3 * min(m, n)) space.

func (*OSAMetric) Similarity ¶

func (m *OSAMetric) Similarity(a, b string) float64

Similarity returns the normalized similarity between a and b as a value in [0, 1]. Returns 1.0 when both strings are empty.

type SoundexEncoder ¶

type SoundexEncoder struct {
	// MaxLength is the length of the generated Soundex code. Default is 4.
	MaxLength int
}

SoundexEncoder implements the American Soundex phonetic algorithm. It maps a name to a four-character code consisting of one letter followed by three digits, enabling approximate matching of names that sound alike.

func NewSoundex ¶

func NewSoundex() *SoundexEncoder

NewSoundex returns a new SoundexEncoder with the default code length of 4.

func (*SoundexEncoder) Encode ¶

func (e *SoundexEncoder) Encode(s string) string

Encode returns the Soundex code for the given string. It returns an empty string if the input contains no ASCII letters.

func (*SoundexEncoder) Match ¶

func (e *SoundexEncoder) Match(a, b string) bool

Match reports whether two strings produce the same Soundex code.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
basic command Package main demonstrates basic usage of the strsim library.	Package main demonstrates basic usage of the strsim library.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL