Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
Default. Can be changed before use, see the CLI flags in main.go.
Functions ¶
func DistBytes ¶
Compute a normalized compression distance between two []byte using gzip. Should normally be between 0.0 and 1.0., could sometimes end up slightly above 1.0 ... See article attached, specially annexe A.
func DistFile ¶
Distance between two files, given by their path names. Useful text content will be extracted before distance computation.
func DistString ¶
Compute a normalized compression distance between two strings using gzip. Should normally be between 0.0 and 1.0., could sometimes end up slightly above 1.0 ... See article attached, specially annexe A.
func ExtractText ¶
Extract useful content. Currently tries gzip, zlib, zip, pure xml, html in that order, then removes multiple white space characters.
func FilesInFolder ¶
Get (recursively) all files in folder, ignoring .git folder Files names are returned as absolute path.
Types ¶
type Cache ¶
type Cache struct {
M map[[sha256.Size * 2]byte]float64 // should not be used directly, nor relied upon. Public only because required for ease of saving as gob.
}
Cache for file to file distance. It is a very expensive calculation, since we check for word, excel, zip, etc ... files, so caching makes sense. We do not use filenames, but the hash of both files, to ensure propoer handling of file name or content changes.
func NewCache ¶
func NewCache() *Cache
Create a new cache. Load from previously saved cache if there is one. Not thread safe.
func (*Cache) Clear ¶
func (c *Cache) Clear()
Clear cache in memory. Cache on file will be erased on next save.
type Matrix ¶
type Matrix struct {
// contains filtered or unexported fields
}
A distance matrix Optimised for storage efficiency. Zero value can be used immediately.
func ComputeEuclid ¶
Compute euclidian distance matrix for vectors. Used mainly for test purposes.
func ComputeFiles ¶
Compute the distance matrix for a group of files. Computations are cached for later reuse
func ComputeFolder ¶
Compute the distance matrix for all files in the folder
func ComputeString ¶
Compute the distance matrix between strings
func (*Matrix) Dist ¶
Get distance between i and j. This is the minimum interface required by the cluster package.
func (*Matrix) Set ¶
Set a distance for (i,j). It also sets the same value for (j,i). Size will increase as needed.