ldist

package module
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 4, 2026 License: MIT Imports: 4 Imported by: 0

README

ldist

Go Reference Test / CI codecov

A Go implementation of a Levenshtein distance based string matching algorithm. Designed to be easy, fast and flexible to use with no dependencies.

Installation

go get github.com/atomflunder/ldist

Usage

Basic Usage

You can calculate the Levenshtein distance between two strings using the Distance function.
By default, it uses the standard weights for substitution, insertion, and deletion.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "kitten"
s2 := "  SITTING  "

// Gets you the default weights for the Levenshtein algorithm.
// The default weights are: Substitution: 1, Insertion: 1, Deletion: 1.
w := ldist.DefaultWeights()

dist := ldist.Distance(s1, s2, w)
fmt.Printf("Normal Distance: %d\n", dist)
// Output: Normal Distance: 11
Custom Weights

You can also specify custom weights for the Levenshtein algorithm by creating a Weights struct and passing it to the Distance function.
This allows you to assign different costs to substitutions, insertions, and deletions.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "kitten"
s2 := "  SITTING  "

// Create custom weights for the Levenshtein algorithm.
w := ldist.Weights{
    Substitution: 3,
    Insertion: 1,
    Deletion: 2,
}

dist := ldist.Distance(s1, s2, w)
fmt.Printf("Custom Distance: %d\n", dist)
// Output: Custom Distance: 23
With Options

You can also specify some options for pre-processing the strings before calculating the distance.
For example, you can choose to ignore case and whitespace.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "kitten"
s2 := "  SITTING  "

w := ldist.DefaultWeights()

// Uses the ToLowercase and RemoveWhitespace options to pre-process the strings before calculating the distance.
// This would convert "  SITTING  " into "sitting".
dist := ldist.Distance(s1, s2, w, ldist.ToLowercase, ldist.RemoveWhitespace)
fmt.Printf("Distance with Options: %d\n", dist)
// Output: Distance with Options: 3
Normalized Functions

The package also provides normalized versions of the distance and similarity functions, which return values between 0 and 1 based on the maximum possible distance for the given strings.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "kitten"
s2 := "  SITTING  "

w := ldist.DefaultWeights()

// A normalized distance of 0 means the strings are identical, while a normalized distance of 1 means they are completely different.
// The normalized distance is calculated as the actual distance divided by the maximum possible distance for the given strings.
normalizedDist := ldist.NormalizedDistance(s1, s2, w, ldist.ToLowercase, ldist.RemoveWhitespace)

// A normalized similarity of 1 means the strings are identical, while a normalized similarity of 0 means they are completely different.
// The normalized similarity is calculated as 1 - normalized distance.
normalizedSim := ldist.NormalizedSimilarity(s1, s2, w, ldist.ToLowercase, ldist.RemoveWhitespace)

fmt.Printf("Normalized Distance: %.2f\n", normalizedDist)
fmt.Printf("Normalized Similarity: %.2f\n", normalizedSim)
// Output:
// Normalized Distance: 0.23
// Normalized Similarity: 0.77
Partial Similarity

You can also calculate the distance between two strings using partial matching, which finds the best matching substring in the longer string and calculates the distance based on that, with a penalty based on differing lengths.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "test"
s2 := "this is a test"

w := ldist.DefaultWeights()

// The normal similarity for comparision.
normalizedSim := ldist.NormalizedSimilarity(s1, s2, w)

// The partial similarity finds the best matching substring in the longer string and calculates the similarity based on that.
partialSim := ldist.PartialSimilarity(s1, s2, w)

fmt.Printf("Normalized Similarity: %.2f\n", normalizedSim)
fmt.Printf("Partial Similarity: %.2f\n", partialSim)
// Output:
// Normalized Similarity: 0.44
// Partial Similarity: 0.75
Matches

There are also helper functions to check for matches based on a similarity threshold, and to find the best match from a list of candidates.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "test"
s2 := "this is a test"

w := ldist.DefaultWeights()

// You can check if two strings are a match based on a similarity threshold using the Match function.
// You can specify the similarity function to use (e.g. NormalizedSimilarity or PartialSimilarity) and the threshold for a match (e.g. 0.4).
isMatch := ldist.Match(s1, s2, w, 0.4, ldist.NormalizedSimilarity)

fmt.Printf("Is Match: %t\n", isMatch)
// Output:
// Is Match: true

// Or you can find the best match from a list of candidates using the BestMatch function, 
// which returns the candidate with the highest similarity above the given threshold.
bestMatch := ldist.GetBestMatch(s1, []string{"this", "is", "a", "test"}, w, 0.8, ldist.NormalizedSimilarity)

fmt.Printf("Best Match: %+v\n", bestMatch)
// Output:
// Best Match: &{Candidate:test Similarity:1}

// Or you can get all matches above a certain threshold.
// (There is also a GetBestMatchesSorted function to auto-sort the matches by similarity.)
matches := ldist.GetBestMatches(s1, []string{"this", "is", "a", "test", "and", "test2"}, w, 0.8, ldist.NormalizedSimilarity)
fmt.Printf("Matches: %+v\n", matches)
// Output:
// Matches: [{Candidate:test Similarity:1} {Candidate:test2 Similarity:0.8888888888888888}]
Custom Options & Similarity Functions

Any Option is a function that takes in two string pointers to modify them in place. You can create your own custom options to pre-process the strings in any way you like before calculating the distance.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "kitten"
s2 := "  SITTING  "

w := ldist.DefaultWeights()

// Custom option to modify each string in place.
// In a real use case you would probably want to do something more useful.
myOption := func (s1, s2 *string) {
	*s1, *s2 = "Option 1", "Option 2"
}

dist := ldist.Distance(s1, s2, w, myOption)
fmt.Printf("Distance with Custom Option: %d\n", dist)
// Output: Distance with Custom Option: 1

Akin to the Options, you can also create your own custom similarity functions to use with the matching functions.

import (
    "fmt"
    "github.com/atomflunder/ldist"
)

s1 := "test"
sl := []string{"this", "is", "a", "test"}

// Custom similarity function that returns 1 if the strings are exactly equal, and 0 otherwise.
mySimilarity := func(s1, s2 string, w ldist.Weights, options ...ldist.Option) float64 {
    if s1 == s2 {
        return 1
    }
    return 0
}

bestMatch := ldist.GetBestMatch(s1, sl, ldist.DefaultWeights(), 0.5, mySimilarity)
fmt.Printf("Best Match with Custom Similarity: %+v\n", bestMatch)
// Output:
// Best Match with Custom Similarity: &{Candidate:test Similarity:1}

API Reference

Can be found in the GoDoc.

Testing & Coverage

You can run the tests using the following command:

go test -v

Check the code coverage with:

go test -coverprofile -cover.out .
go tool cover -html cover.out -o cover.html

Performance & Benchmarks

You can run the benchmarks with:

go test -bench=.

Results (on my machine):

BenchmarkDistance-32                	11110988	        98.05 ns/op
BenchmarkNormalizedDistance-32      	12607795	        96.33 ns/op
BenchmarkNormalizedSimilarity-32    	12237752	        98.92 ns/op
BenchmarkLongStrings-32             	  981541	         1227 ns/op
BenchmarkSimilarLongStrings-32      	  141498	         8510 ns/op
BenchmarkPartialSimilarity-32       	 4566796	        260.0 ns/op

Contributing

Contributions are welcome! If you have any ideas for improvements or new features, please feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Distance

func Distance(s1, s2 string, weights Weights, opts ...Option) int

Distance calculates the distance between two strings s1 and s2 using the provided weights for substitution, insertion, and deletion. Can use options to modify the input strings before calculating the distance, such as converting to lowercase, removing whitespace, or removing punctuation.

Example
weights := DefaultWeights()

s1 := "kitten"
s2 := "  SITTING  "

fmt.Printf("Normal Distance: %d\n", Distance(s1, s2, weights))
fmt.Printf("Distance With Options: %d\n", Distance(s1, s2, weights, ToLowercase, RemoveWhitespace))

weightsCustom := Weights{
	Substitution: 35,
	Insertion:    1,
	Deletion:     35,
}
s2Clean := "sitting"
fmt.Printf("Distance With Custom Weights: %d\n", Distance(s1, s2Clean, weightsCustom))
Output:
Normal Distance: 11
Distance With Options: 3
Distance With Custom Weights: 71

func Match added in v0.6.0

func Match(s1, s2 string, weights Weights, cutoff float64, similarityFunc SimilarityFunc, opts ...Option) bool

Match returns true if the normalized similarity between s1 and s2 is greater than or equal to the cutoff.

Example
weights := IndelWeights()

fmt.Println(Match("a test", "this is a test", weights, 0.5, NormalizedSimilarity))
fmt.Println(Match("a test", "this is a test", weights, 0.9, NormalizedSimilarity))
Output:
true
false

func NormalizedDistance

func NormalizedDistance(s1, s2 string, weights Weights, opts ...Option) float64

NormalizedDistance calculates the normalized distance between two strings s1 and s2 using the provided weights for substitution, insertion, and deletion. The normalized distance is the distance divided by the sum of the lengths of the two strings, resulting in a value between 0 and 1. This means that the normalized distance is 0 when the strings are identical and approaches 1 as the strings become more different.

Example
weights := DefaultWeights()

s1 := "kitten"
s2 := "sitting"

fmt.Printf("Normalized Distance: %.8f\n", NormalizedDistance(s1, s2, weights))
Output:
Normalized Distance: 0.23076923

func NormalizedSimilarity

func NormalizedSimilarity(s1, s2 string, weights Weights, opts ...Option) float64

NormalizedSimilarity calculates the normalized similarity between two strings s1 and s2 using the provided weights for substitution, insertion, and deletion. The normalized similarity is 1 minus the normalized distance, resulting in a value between 0 and 1. This means that the normalized similarity is 1 when the strings are identical and approaches 0 as the strings become more different.

Example
weights := DefaultWeights()

s1 := "kitten"
s2 := "sitting"

fmt.Printf("Normalized Similarity: %.8f\n", NormalizedSimilarity(s1, s2, weights))
Output:
Normalized Similarity: 0.76923077

func PartialSimilarity added in v0.4.0

func PartialSimilarity(s1, s2 string, weights Weights, opts ...Option) float64

PartialSimilarity calculates the partial similarity between two strings s1 and s2 using the provided weights for substitution, insertion, and deletion. The partial similarity is a measure of how similar the shorter string is to any substring of the longer string, with a penalty based on differing lengths. This may yield more desirable results when comparing strings of vastly differing lengths, depending on the use-case.

Example
weights := IndelWeights()

s1 := "a test"
s2 := "this is a test"

fmt.Printf("Partial Similarity: %.8f\n", PartialSimilarity(s1, s2, weights))
Output:
Partial Similarity: 0.85000000

func RemovePunctuation

func RemovePunctuation(s1, s2 *string)

RemovePunctuation removes common punctuation characters from both strings.

func RemoveWhitespace

func RemoveWhitespace(s1, s2 *string)

RemoveWhitespace removes all whitespace characters from both strings.

func ToAlphanumeric added in v0.2.0

func ToAlphanumeric(s1, s2 *string)

ToAlphanumeric removes all non-alphanumeric characters from both strings.

func ToLowercase

func ToLowercase(s1, s2 *string)

ToLowercase converts both strings to lowercase.

Types

type BestMatchResult added in v0.6.0

type BestMatchResult struct {
	Candidate  string  `json:"candidate"`
	Similarity float64 `json:"similarity"`
}

BestMatchResult represents a candidate string and its similarity score to the input string.

func GetBestMatch added in v0.6.0

func GetBestMatch(s1 string, candidates []string, weights Weights, cutoff float64, similarityFunc SimilarityFunc, opts ...Option) *BestMatchResult

GetBestMatch returns the candidate string with the highest similarity to s1 that meets the cutoff threshold. If no candidate meets the cutoff, it returns nil.

Example
weights := IndelWeights()

candidates := []string{"kitten", "sitting", "bitten", "written"}
result := GetBestMatch("kittens", candidates, weights, 0.5, NormalizedSimilarity)

if result != nil {
	fmt.Printf("Best Match: %s (Similarity: %.8f)\n", result.Candidate, result.Similarity)
}

noMatch := GetBestMatch("xyz", candidates, weights, 0.99, NormalizedSimilarity)
fmt.Printf("No Match: %v\n", noMatch)
Output:
Best Match: kitten (Similarity: 0.92307692)
No Match: <nil>

func GetBestMatches added in v0.6.0

func GetBestMatches(s1 string, candidates []string, weights Weights, cutoff float64, similarityFunc SimilarityFunc, opts ...Option) []BestMatchResult

GetBestMatches returns a slice of BestMatchResult for all candidates that have a similarity to s1 greater than or equal to the cutoff.

Example
weights := IndelWeights()

candidates := []string{"kitten", "sitting", "bitten", "written"}
results := GetBestMatches("kittens", candidates, weights, 0.5, NormalizedSimilarity)

fmt.Printf("Matches: %d\n", len(results))
for _, r := range results {
	fmt.Printf("  %s: %.8f\n", r.Candidate, r.Similarity)
}
Output:
Matches: 4
  kitten: 0.92307692
  sitting: 0.57142857
  bitten: 0.76923077
  written: 0.71428571

func GetBestMatchesSorted added in v0.6.0

func GetBestMatchesSorted(s1 string, candidates []string, weights Weights, cutoff float64, similarityFunc SimilarityFunc, opts ...Option) []BestMatchResult

GetBestMatchesSorted returns a slice of BestMatchResult for all candidates that have a similarity to s1 greater than or equal to the cutoff, sorted in descending order of similarity.

Example
weights := IndelWeights()

candidates := []string{"kitten", "sitting", "sitting", "bitten", "written"}
results := GetBestMatchesSorted("kittens", candidates, weights, 0.5, NormalizedSimilarity)

fmt.Printf("Sorted Matches: %d\n", len(results))
for _, r := range results {
	fmt.Printf("  %s: %.8f\n", r.Candidate, r.Similarity)
}
Output:
Sorted Matches: 5
  kitten: 0.92307692
  bitten: 0.76923077
  written: 0.71428571
  sitting: 0.57142857
  sitting: 0.57142857

type Option

type Option func(s1, s2 *string)

Option is a type alias for a function that takes two string pointers which should modify them in place.

type SimilarityFunc added in v0.6.0

type SimilarityFunc func(s1, s2 string, weights Weights, opts ...Option) float64

type Weights

type Weights struct {
	// Substitution is the cost of substituting one character for another.
	// By default it should be set to 1.
	Substitution int `json:"substitution"`
	// Insertion is the cost of inserting a character into a string.
	// By default it should be set to 1.
	Insertion int `json:"insertion"`
	// Deletion is the cost of deleting a character from a string.
	// By default it should be set to 1.
	Deletion int `json:"deletion"`
}

Weights defines the weights for substitution, insertion, and deletion operations.

func DefaultWeights added in v0.6.0

func DefaultWeights() Weights

DefaultWeights returns the default distance weights for substitution, insertion, and deletion. Substitution: 1, Insertion: 1, Deletion: 1

func IndelWeights added in v0.6.0

func IndelWeights() Weights

IndelWeights returns the distance weights for substitution, insertion, and deletion where substitutions are more expensive than insertions and deletions. Substitution: 2, Insertion: 1, Deletion: 1

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL