disfun

package module
v0.0.0-...-62faa31 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 12, 2016 License: MIT Imports: 8 Imported by: 0

README

Go Report Card GoDoc

DisFun

Inspired by the Ruby Gem Distance-Measures and Measurable and my need to have available a number of distance functions for things like k-Nearest-Neighbor, k-Means, string metrics, and the like....

Algos

I'll add algorithm implementations as they arise. Some imlementations have been forked from existing projects where I have made modifcations to suite my purposes (see external_licenses).

Test Coverage

coverage: 87.3% of statements

Run benchmarks

go test -bench .

## or

go test -bench=.

## or for levenshtien

go test -bench Lev

## or for euclidean
go test -bench Euc

## or use gobenchui
gobenchui .

Documentation

Overview

Package disfun implements various distance functions.

Index

Constants

View Source
const (
	EarthRadius  = float64(6371)
	Substitution = float64(1)
	Insertion    = float64(1)
	Deletion     = float64(1)
	GapCost      = float64(0.5)
)

Variables

This section is empty.

Functions

func BrayCurtis

func BrayCurtis(x, y *mat64.Dense) (result float64)

BrayCurtis finds the distance by the ratio of the absolute sum of all differences of points in vectors [u,v]:

Example:
 dist((u, v) = dist(v, u)) =
	|v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n| / |v_1 + u_1| + |v_2 + u_2| + ... + |v_n + u_n|

References:
http://mathworld.wolfram.com/TaxicabMetric.html
http://demonstrations.wolfram.com/TaxicabGeometry/

func DamerauLevenshtein

func DamerauLevenshtein(s1, s2 string) (distance int)

DamerauLevenshtein computes the Damerau-Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to the Optimal String Alignment, algorithm, but is more complex because it allows multiple edits on substrings.

References:
This implementation is based off of the one found on Wikipedia at
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions
as well as KevinStern's Java implementation found at
https://github.com/KevinStern/software-and-algorithms.

func DoubleMetaphone

func DoubleMetaphone(strInput string, maxLength int) (string, string)

DoubleMetaphone computes the Double-Metaphone value of the input string. This value is a phonetic representation of how the string sounds, with affordances for many different language dialects. It was originally developed by Lawrence Phillips in the 1990s.

More information about this algorithm can be found on Wikipedia at http://en.wikipedia.org/wiki/Metaphone.

func Hamming

func Hamming(s1 string, s2 string) (distance int, err error)

Hamming computes the Hamming distance between two equal-length strings. This is the number of times the two strings differ between characters at the same index. This implementation is based off of the algorithm description found at http://en.wikipedia.org/wiki/Hamming_distance.

func Haversine

func Haversine(onePhi, oneLambda, twoPhi, twoLambda float64) (c float64)

Haversine finds the shortest "as-the-crow-flies" distance between 2 points (φ,λ) on a sphere R... accounting for curvuture. In Navigation terms, for the example below, φ (phi) is latitude, λ (lambda) is longitude, R(adius) is earth’s radius (mean radius = 6,371km).

	Example:
	  a = sin²(Δφ/2) + cos φ1 ⋅ cos φ2 ⋅ sin²(Δλ/2)
	  c = 2 ⋅ atan2( √a, √(1−a) )
	  d = R ⋅ c

	References:
 https://github.com/kellydunn/golang-geo.
 http://www.codecodex.com/wiki/Calculate_Distance_Between_Two_Points_on_a_Globe
 http://www.movable-type.co.uk/scripts/latlong.html
 https://github.com/reddavis/Distance-Measures/blob/master/lib/distance_measures/haversine.rb

func HaversineLatLon

func HaversineLatLon(lat1, lon1, lat2, lon2 float64) float64

HaversineLatLon accepts 2 sets of lat/lon, it multiplies the EARTH_RADIUS by the result of the haversine function.

func HaversinePoint

func HaversinePoint(pointA1, pointA2, pointB1, pointB2, radius float64) float64

HaversinePoint accepts 2 sets of points with a radius, it multiplies radius by the result of the haversine function.

func Jaro

func Jaro(r1 string, r2 string) (distance float64)

Jaro computes the Jaro edit distance between two strings. It represents this with a float64 between 0 and 1 inclusive, with 0 indicating the two strings are not at all similar and 1 indicating the two strings are exact matches.

See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for a full description.

func JaroWinkler

func JaroWinkler(r1 string, r2 string, longTolerance bool) (distance float64)

JaroWinkler computes the Jaro-Winkler edit distance between two strings. This is a modification of the Jaro algorithm that gives additional weight to prefix matches.

func Lev

func Lev(s1, s2 string) int

Lev (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func LevEditDistance

func LevEditDistance(s1, s2 string) (distance int)

LevEditDistance computes the Levenshtein distance between two strings. The returned value - distance - is the number of insertions, deletions, and substitutions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point.

//Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

func Manhattan

func Manhattan(x, y *mat64.Dense) (result float64)

Manhattan (also called TaxiCab) finds the distance by the absolute sum of all differences of points in vectors [u,v]:

Example:
 dist((u, v) = dist(v, u)) = |v_1 - u_1| + |v_2 - u_2| + ... + |v_n - u_n|

References:
http://mathworld.wolfram.com/TaxicabMetric.html
http://demonstrations.wolfram.com/TaxicabGeometry/

func NYSIIS

func NYSIIS(s1 string) string

NYSIIS computes the NYSIIS phonetic encoding of the input string. It is a modification of the traditional Soundex algorithm.

func OSA

func OSA(s1 string, s2 string) (distance int)

OSA computes the Optimal String Alignment distance between two strings. The returned value - distance - is the number of insertions, deletions, substitutions, and transpositions it takes to transform one string (s1) into another (s2). Each step in the transformation "costs" one distance point. It is similar to Damerau-Levenshtein, but is simpler because it does not allow multiple edits on any substring.

func Phonex

func Phonex(s1 string) string

Phonex computes the Phonex phonetic encoding of the input string. Phonex is a modification of the venerable Soundex algorithm. It accounts for a few more letter combinations to improve accuracy on some data sets.

This implementation is based off of the original C implementation by the creator - A. J. Lait - as found in his research paper entitled "An Assessment of Name Matching Algorithms."

func SmithWaterman

func SmithWaterman(s1 string, s2 string) float64

SmithWaterman computes the Smith-Waterman local sequence alignment for the two input strings. This was originally designed to find similar regions in strings representing DNA or protein sequences.

func Soundex

func Soundex(s1 string) string

Soundex computes the Soundex phonetic representation of the input string. It attempts to encode homophones with the same characters. More information can be found at http://en.wikipedia.org/wiki/Soundex.

Types

type Euclidean

type Euclidean struct{}

Euclidean finds the "ordinary" distance between vectors [u,v] given by the Pythagorean formula (sum of squares of all points):

Example:
  dist((u, v) = dist(v, u)) = √(v_1 - u_1)² + (v_2 - u_2)² + ... + (v_n - u_n)²

References:
http://reference.wolfram.com/language/ref/EuclideanDistance.html

func NewEuclidean

func NewEuclidean() *Euclidean

NewEuclidean initializes the Euclidean struct.

func (*Euclidean) Distance

func (e *Euclidean) Distance(u, v *mat64.Dense) float64

Distance finds the Euclidean distance.

func (*Euclidean) InnerProduct

func (e *Euclidean) InnerProduct(u, v *mat64.Dense) (result float64)

InnerProduct computes a Eucledian inner product.

type Leven

type Leven struct {
	S1         string
	S2         string
	Lens1      int
	Lens2      int
	Width      int
	VectorCell []int
}

Leven (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func NewLeven

func NewLeven(s1, s2 string) *Leven

func (*Leven) Similarity

func (l *Leven) Similarity() float64

type Levenshtein

type Levenshtein struct {
	Source       []rune
	Target       []rune
	SourceString string
	TargeString  string
	RowHeight    int
	ColWidth     int
	M            *mat64.Dense
}

Levenshtein (edit distance) gives similarity metric by calcuating number of positions for substitution, insertion, and deletion.

Currently bemchmarking 4 different implementations. One is a simple pass through (Lev), another (Leven) uses a struct and separate insertion, deletion, substitution functions with a handmade matrix called VectorCell. The third one (Levenshtein) uses the mat64 matrix package. The last one (EditDistance) is the fastest but I haven't checked it accuracy yet.

Row     == Height
Column  == Width

Demo here: http://andrew.hedges.name/experiments/levenshtein/

Levenshtein is the most accurate but also the most costly due to building matrices. I can get this down but using dense matrices for this is not a good idea. The other two functions are about the same speed but they are not very readable and not the most accurate... see the tests for differences in accuracy between Leven, Lev, and Levenshtein.

func NewLevenshtein

func NewLevenshtein(source, target string) *Levenshtein

func (*Levenshtein) ComputeMatrix

func (l *Levenshtein) ComputeMatrix() *mat64.Dense

func (*Levenshtein) Similarity

func (l *Levenshtein) Similarity() float64

type Ngram

type Ngram struct {
	Set1 set.Strings
	Set2 set.Strings
	S1   string
	S2   string
	N    int
}

Ngram is continuous sequence of n-items from a given sequence. The distance is the relative number of items between these two sequences.

References:

https://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf
https://en.wikipedia.org/wiki/N-gram
http://m.wolframalpha.com/input/?i=n-grams+%22n-gram+example+of+n-grams+in+wolfram+alpha%22&x=0&y=0

func NewNgram

func NewNgram(n int, s1, s2 string) *Ngram

NewNgram initializes and creates data structures for the Ngram struct, which you can then call Similarity() on.

func (*Ngram) Build

func (n *Ngram) Build()

Build creates Set1 and Set2 of the Ngram.

func (*Ngram) JaccardCoEfficient

func (n *Ngram) JaccardCoEfficient() float64

JaccardCoEfficient calculates the similarity of two sets as the intersection divided by the union of the two sets.

func (*Ngram) Similarity

func (n *Ngram) Similarity() float64

Similarity bulds the two ngram sets and returns their similarity.

type String

type String struct {
	// contains filtered or unexported fields
}

String wraps a regular string with a small structure that provides more efficient indexing by code point index, as opposed to byte index. Scanning incrementally forwards or backwards is O(1) per index operation (although not as fast a range clause going forwards). Random access is O(N) in the length of the string, but the overhead is less than always scanning from the beginning. If the string is ASCII, random access is O(1). Unlike the built-in string type, String has internal mutable state and is not thread-safe.

func NewString

func NewString(contents string) *String

NewString returns a new UTF-8 string with the provided contents.

func (*String) At

func (s *String) At(i int) int

At returns the rune with index i in the String. The sequence of runes is the same as iterating over the contents with a "for range" clause.

func (*String) Init

func (s *String) Init(contents string) *String

Init initializes an existing String to hold the provided contents. It returns a pointer to the initialized String.

func (*String) IsASCII

func (s *String) IsASCII() bool

IsASCII returns a boolean indicating whether the String contains only ASCII bytes.

func (*String) RuneCount

func (s *String) RuneCount() int

RuneCount returns the number of runes (Unicode code points) in the String.

func (*String) Slice

func (s *String) Slice(i, j int) string

Slice returns the string sliced at rune positions [i:j].

func (*String) String

func (s *String) String() string

String returns the contents of the String. This method also means the String is directly printable by fmt.Print.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL