stardust

package module
v0.1.1-0...-d6fc527 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 11, 2020 License: MIT Imports: 10 Imported by: 1

README

stardust

Stardust, strdist. String distance measures for the command line.

Build Status

Actual star dust

Overview

$ stardust
NAME:
   stardust - String similarity measures for tab separated values.

USAGE:
   stardust [global options] command [command options] [arguments...]

VERSION:
   0.1.1

AUTHOR:
  Martin Czygan - <martin.czygan@gmail.com>

COMMANDS:
   adhoc    Adhoc distance
   cosine   Cosine word-wise
   coslev   Cosine word-wise and levenshtein combined
   dice     Sørensen–Dice coefficient
   hamming  Hamming distance
   jaro     Jaro distance
   jaro-winkler Jaro-Winkler distance
   levenshtein  Levenshtein distance
   ngram    Ngram distance
   plain    Plain passthrough (for IO benchmarks)
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   -f '1,2'     c1,c2 the two columns to use for the comparison
   --delimiter, -d '    '   column delimiter (defaults to tab)
   --help, -h       show help
   --version, -v    print the version

For starters

$ stardust hamming "Hallo" "Hello"
Hallo   Hello   1

$ stardust ngram "Hallo" "Hello"
Hallo   Hello   0.2

$ stardust ngram "Hallo Welt" "Hello World"
Hallo Welt  Hello World 0.21428571428571427

Are the man pages of cp and mv more similar that those of ls and mv, when measured with a trigram model?

$ stardust ngram "$(echo $(man ls))" "$(echo $(man mv))" | cut -f3
0.29057337220602525

$ stardust ngram "$(echo $(man cp))" "$(echo $(man mv))" | cut -f3
0.4792746113989637

They seem to. And according to Jaro similarity?

$ stardust jaro "$(echo $(man ls))" "$(echo $(man mv))" | cut -f3
0.5597612762544908

$ stardust jaro "$(echo $(man cp))" "$(echo $(man mv))" | cut -f3
0.6376732132890776

Still.

Specific options

Some measures come with additional options, e.g. ngram will take a size option, which corresponds to the n in ngram.

$ stardust ngram --help
NAME:
   ngram - Ngram similarity

USAGE:
   command ngram [command options] [arguments...]

DESCRIPTION:
   Compute Ngram similarity, which lies between 0 and 1.

OPTIONS:
   --size, -s '3'   value of n

$ stardust ngram --size 2 "Hello" "Hallo"
Hello   Hallo   0.3333333333333333

$ stardust ngram --size 1 "Hallo" "Hello"
Hallo   Hello   0.6

Input from files

Using example.tsv:

$ stardust ngram example.tsv | sort -t$'\t' -k3,3 -nr | head -3
Deutsches Museum    Deutsches Museum    1
Deutsche Suchthilfestatistik    Deutsches Museum    0.17647058823529413
Deutsche+Guggenheim magazine /  Deutsches Museum    0.16666666666666666

Which is equivalent to:

$ cat example.tsv | stardust ngram | sort -t$'\t' -k3,3 -nr | head -3
Deutsches Museum    Deutsches Museum    1
Deutsche Suchthilfestatistik    Deutsches Museum    0.17647058823529413
Deutsche+Guggenheim magazine /  Deutsches Museum    0.16666666666666666

Documentation

Index

Constants

View Source
const Version = "0.1.1"

Version of the application

Variables

This section is empty.

Functions

func Bigrams

func Bigrams(s string) set.Strings

Bigrams returns a set of 2-grams

func CompleteString

func CompleteString(pool []string, prefix string) []string

CompleteString returns all strings from pool that have a given prefix

func HammingDistance

func HammingDistance(a, b string) (int, error)

HammingDistance computes the Hamming distance for two strings of equals length

func JaccardSets

func JaccardSets(a, b set.Strings) float64

JaccardSets measure Jaccard distance of two sets

func JaroDistance

func JaroDistance(a, b string) (float64, error)

JaroDistance computes the Jaro distance for two strings From: https://github.com/xrash/smetrics

func JaroWinklerDistance

func JaroWinklerDistance(a, b string, boostThreshold float64, prefixSize int) (float64, error)

JaroWinklerDistance computes the Jaro-Winkler distance for two strings From: https://github.com/xrash/smetrics

func LevenshteinDistance

func LevenshteinDistance(s, t string) (int, error)

LevenshteinDistance computes the Levenshtein distance for two strings

func NgramDistance

func NgramDistance(s, t string) (float64, error)

NgramDistance computes the trigram/Jaccard measure

func NgramDistanceSize

func NgramDistanceSize(s, t string, n int) (float64, error)

NgramDistanceSize computes the ngram/Jaccard measure for a given n

func Ngrams

func Ngrams(s string, n int) set.Strings

Ngrams return a set of n-grams for a given string

func RecordGenerator

func RecordGenerator(c *cli.Context) chan *Record

RecordGenerator abstracts from the way the strings are specified, e.g. via stdin a file or directly on the command line

func RecordGeneratorFile

func RecordGeneratorFile(reader io.ReadCloser, c *ColumnSpec) chan *Record

RecordGeneratorFile will produce pair values, that are extracted according to a given column specification and tab delimiter.

func RecordGeneratorFileDelimiter

func RecordGeneratorFileDelimiter(reader io.ReadCloser, c *ColumnSpec, delim string) chan *Record

RecordGeneratorFileDelimiter will produce pair values, that are extracted according to a given column specification and a custom field delimiter

func SorensenDiceDistance

func SorensenDiceDistance(a, b string) (float64, error)

func Trigrams

func Trigrams(s string) set.Strings

Trigrams returns a set of 3-grams

func Unigrams

func Unigrams(s string) set.Strings

Unigrams returns a set of 1-grams

Types

type ColumnSpec

type ColumnSpec struct {
	// contains filtered or unexported fields
}

ColumnSpec contains two column indexes

func ParseColumnSpec

func ParseColumnSpec(s string) (*ColumnSpec, error)

ParseColumnSpec parses a string like "2,3" into a ColumnSpec struct

type Record

type Record struct {
	Fields []string
	// contains filtered or unexported fields
}

Record represents a single input (fields) and two highlighted fields, left and right, that are used for comparison

func (*Record) Left

func (r *Record) Left() string

Left returns one of the highlighted rows

func (*Record) Right

func (r *Record) Right() string

Right returns another one of the highlighted rows

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL