edlib

package module
v1.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 31, 2022 License: MIT Imports: 5 Imported by: 40

README

Go-edlib : Edit distance and string comparison library

Coverage Status Go Report Card PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringsSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}

Result: testing 
2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}

Result for 'testnig': testing
Result for 'hello':
3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}

Results: testing, test, tester 
4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}

Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3
2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD
3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD
4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CosineSimilarity added in v1.3.0

func CosineSimilarity(str1, str2 string, splitLength int) float32

CosineSimilarity use cosine algorithm to return a similarity index between string vectors Takes two strings as parameters, a split length which define the k-gram single length (if zero split string on whitespaces) and return an index.

func DamerauLevenshteinDistance

func DamerauLevenshteinDistance(str1, str2 string) int

DamerauLevenshteinDistance calculate the distance between two string This algorithm computes the true Damerau–Levenshtein distance with adjacent transpositions Allowing insertions, deletions, substitutions and transpositions to change one string to the second Compatible with non-ASCII characters

func FuzzySearch added in v1.2.0

func FuzzySearch(str string, strList []string, algo Algorithm) (string, error)

FuzzySearch realize an approximate search on a string list and return the closest one compared to the string input

func FuzzySearchSet added in v1.2.0

func FuzzySearchSet(str string, strList []string, quantity int, algo Algorithm) ([]string, error)

FuzzySearchSet realize an approximate search on a string list and return a set composed with x strings compared to the string input sorted by similarity with the base string. Takes the a quantity parameter to define the number of output strings desired (For example 3 in the case of the Google Keyboard word suggestion).

func FuzzySearchSetThreshold added in v1.2.0

func FuzzySearchSetThreshold(str string, strList []string, quantity int, minSim float32, algo Algorithm) ([]string, error)

FuzzySearchSetThreshold realize an approximate search on a string list and return a set composed with x strings compared to the string input sorted by similarity with the base string. Take a similarity threshold in parameter. Takes the a quantity parameter to define the number of output strings desired (For example 3 in the case of the Google Keyboard word suggestion). Takes also a threshold parameter for similarity with base string.

func FuzzySearchThreshold added in v1.2.0

func FuzzySearchThreshold(str string, strList []string, minSim float32, algo Algorithm) (string, error)

FuzzySearchThreshold realize an approximate search on a string list and return the closest one compared to the string input. Takes a similarity threshold in parameter.

func HammingDistance

func HammingDistance(str1, str2 string) (int, error)

HammingDistance calculate the edit distance between two given strings using only substitutions Return edit distance integer and an error

func JaccardSimilarity added in v1.4.0

func JaccardSimilarity(str1, str2 string, splitLength int) float32

JaccardSimilarity compute the jaccard similarity coeffecient between two strings Takes two strings as parameters, a split length which define the k-gram single length (if zero split string on whitespaces) and return an index.

func JaroSimilarity added in v1.1.0

func JaroSimilarity(str1, str2 string) float32

JaroSimilarity return a similarity index (between 0 and 1) It use Jaro distance algorithm and allow only transposition operation

func JaroWinklerSimilarity added in v1.1.0

func JaroWinklerSimilarity(str1, str2 string) float32

JaroWinklerSimilarity return a similarity index (between 0 and 1) Use Jaro similarity and after look for a common prefix (length <= 4)

func LCS

func LCS(str1, str2 string) int

LCS takes two strings and compute their LCS(Longuest Common Subsequence)

func LCSBacktrack added in v1.1.0

func LCSBacktrack(str1, str2 string) (string, error)

LCSBacktrack returns all choices taken during LCS process

func LCSBacktrackAll added in v1.1.0

func LCSBacktrackAll(str1, str2 string) ([]string, error)

LCSBacktrackAll returns an array containing all common substrings between str1 and str2

func LCSDiff added in v1.1.0

func LCSDiff(str1, str2 string) ([]string, error)

LCSDiff will backtrack through the lcs matrix and return the diff between the two sequences

func LCSEditDistance

func LCSEditDistance(str1, str2 string) int

LCSEditDistance determines the edit distance between two strings using LCS function (allow only insert and delete operations)

func LevenshteinDistance

func LevenshteinDistance(str1, str2 string) int

LevenshteinDistance calculate the distance between two string This algorithm allow insertions, deletions and substitutions to change one string to the second Compatible with non-ASCII characters

func OSADamerauLevenshteinDistance

func OSADamerauLevenshteinDistance(str1, str2 string) int

OSADamerauLevenshteinDistance calculate the distance between two string Optimal string alignment distance variant that use extension of the Wagner-Fisher dynamic programming algorithm Doesn't allow multiple transformations on a same substring Allowing insertions, deletions, substitutions and transpositions to change one string to the second Compatible with non-ASCII characters

func QgramDistance added in v1.6.0

func QgramDistance(str1, str2 string, splitLength int) int

QgramDistance compute the q-gram similarity between two strings Takes two strings as parameters, a split length which defines the k-gram shingle length

func QgramDistanceCustomNgram added in v1.6.0

func QgramDistanceCustomNgram(splittedStr1, splittedStr2 map[string]int) int

QgramDistanceCustomNgram compute the q-gram similarity between two custom set of individuals Takes two n-gram map as parameters

func QgramSimilarity added in v1.6.0

func QgramSimilarity(str1, str2 string, splitLength int) float32

QgramSimilarity compute a similarity index (between 0 and 1) between two strings from a Qgram distance Takes two strings as parameters, a split length which defines the k-gram shingle length

func Shingle added in v1.5.0

func Shingle(s string, k int) map[string]int

Shingle Find the k-gram of a string for a given k Takes a string and an integer as parameters and return a map. Returns an empty map if the string is empty or if k is 0

func ShingleSlice added in v1.5.0

func ShingleSlice(s string, k int) []string

ShingleSlice Find the k-gram of a string for a given k Takes a string and an integer as parameters and return a slice. Returns an empty slice if the string is empty or if k is 0

func SorensenDiceCoefficient added in v1.6.0

func SorensenDiceCoefficient(str1, str2 string, splitLength int) float32

SorensenDiceCoefficient computes the Sorensen-Dice coefficient between two strings Takes two strings as parameters, a split length which defines the k-gram shingle length

func StringsSimilarity added in v1.1.0

func StringsSimilarity(str1 string, str2 string, algo Algorithm) (float32, error)

StringsSimilarity return a similarity index [0..1] between two strings based on given edit distance algorithm in parameter. Use defined Algorithm type. Through this function, Cosine and Jaccard algorithms are used with Shingle split method with a length of 2.

Types

type Algorithm added in v1.3.0

type Algorithm uint8

Algorithm is an Integer type used to identify edit distance algorithms

const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
	Jaccard
	SorensenDice
	Qgram
)

Algorithm identifiers

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL