clusters

package module
v0.0.0-...-5a83968 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 12, 2019 License: MIT Imports: 14 Imported by: 0

README

Clusters

Go implementations of several clustering algoritms (k-means++, DBSCAN, OPTICS), as well as utilities for importing data and estimating optimal number of clusters.

The reason

This library was built out of necessity for a collection of performant cluster analysis utilities for Golang. Go, thanks to its numerous advantages (single binary distrubution, relative performance, growing community) seems to become an attractive alternative to languages commonly used in statistical computations and machine learning, yet it still lacks crucial tools and libraries. I use the floats package from the robust Gonum library to perform optimized vector calculations in tight loops.

Install

If you have Go 1.7+

go get github.com/mpraski/clusters

Usage

The currently supported hard clustering algorithms are represented by the HardClusterer interface, which defines several common operations. To show an example we create, train and use a KMeans++ clusterer:

var data [][]float64
var observation []float64

// Create a new KMeans++ clusterer with 1000 iterations, 
// 8 clusters and a distance measurement function of type func([]float64, []float64) float64).
// Pass nil to use clusters.EuclideanDistance
c, e := clusters.KMeans(1000, 8, clusters.EuclideanDistance)
if e != nil {
	panic(e)
}

// Use the data to train the clusterer
if e = c.Learn(data); e != nil {
	panic(e)
}

fmt.Printf("Clustered data set into %d\n", c.Sizes())

fmt.Printf("Assigned observation %v to cluster %d\n", observation, c.Predict(observation))

for index, number := range c.Guesses() {
	fmt.Printf("Assigned data point %v to cluster %d\n", data[index], number)
}

Algorithms currenly supported are KMeans++, DBSCAN and OPTICS.

Algorithms which support online learning can be trained this way using Online() function, which relies on channel communication to coordinate the process:

c, e := clusters.KmeansClusterer(1000, 8, clusters.EuclideanDistance)
if e != nil {
	panic(e)
}

c = c.WithOnline(clusters.Online{
	Alpha:     0.5,
	Dimension: 4,
})

var (
	send   = make(chan []float64)
	finish = make(chan struct{})
)

events := c.Online(send, finish)

go func() {
	for {
		select {
		case e := <-events:
			fmt.Printf("Classified observation %v into cluster: %d\n", e.Observation, e.Cluster)
		}
	}
}()

for i := 0; i < 10000; i++ {
	point := make([]float64, 4)
	for j := 0; j < 4; j++ {
		point[j] = 10 * (rand.Float64() - 0.5)
	}
	send <- point
}

finish <- struct{}{}

fmt.Printf("Clustered data set into %d\n", c.Sizes())

The Estimator interface defines an operation of guessing an optimal number of clusters in a dataset. As of now the KMeansEstimator is implemented using gap statistic and k-means++ as the clustering algorithm (see https://web.stanford.edu/~hastie/Papers/gap.pdf):

var data [][]float64

// Create a new KMeans++ estimator with 1000 iterations, 
// a maximum of 8 clusters and default (EuclideanDistance) distance measurement
c, e := clusters.KMeansEstimator(1000, 8, clusters.EuclideanDistance)
if e != nil {
	panic(e)
}

r, e := c.Estimate(data)
if e != nil {
	panic(e)
}

fmt.Printf("Estimated number of clusters: %d\n", r)

The library also provides an Importer to load data from file (as of now the CSV importer is implemented):

// Import first three columns from data.csv
d, e := i.Import("data.csv", 0, 2)
if e != nil {
	panic(e)
}

Development

The list of project goals include:

  • Implement commonly used hard clustering algorithms
  • Implement commonly used soft clustering algorithms
  • Devise reliable tests of performance and quality of each algorithm

Benchmarks

Soon to come.

Licence

MIT

Documentation

Overview

Package clusters provides abstract definitions of clusterers as well as their implementations.

Index

Constants

This section is empty.

Variables

View Source
var (
	// EuclideanDistance is one of the common distance measurement
	EuclideanDistance = func(a, b []float64) float64 {
		var (
			s, t float64
		)

		for i, _ := range a {
			t = a[i] - b[i]
			s += t * t
		}

		return math.Sqrt(s)
	}

	// EuclideanDistanceSquared is one of the common distance measurement
	EuclideanDistanceSquared = func(a, b []float64) float64 {
		var (
			s, t float64
		)

		for i, _ := range a {
			t = a[i] - b[i]
			s += t * t
		}

		return s
	}
)

Functions

This section is empty.

Types

type Clusterer

type Clusterer interface {
	Learn([][]float64) error
}

Clusterer defines the operation of learning common for all algorithms

type DistanceFunc

type DistanceFunc func([]float64, []float64) float64

DistanceFunc represents a function for measuring distance between n-dimensional vectors.

type Estimator

type Estimator interface {

	// Estimate provides an expected number of clusters in the dataset
	Estimate([][]float64) (int, error)
}

Estimator defines a computation used to determine an optimal number of clusters in the dataset

func KMeansEstimator

func KMeansEstimator(iterations, clusters int, distance DistanceFunc) (Estimator, error)

Implementation of cluster number estimator using gap statistic ("Estimating the number of clusters in a data set via the gap statistic", Tibshirani et al.) with k-means++ as clustering algorithm

type HCEvent

type HCEvent struct {
	Cluster     int
	Observation []float64
}

HCEvent represents the intermediate result of computation of hard clustering algorithm and are transmitted periodically to the caller during online learning

type HardClusterer

type HardClusterer interface {

	// Sizes returns sizes of respective clusters
	Sizes() []int

	// Guesses returns mapping from data point indices to cluster numbers. Clusters' numbering begins at 1.
	Guesses() []int

	// Predict returns number of cluster to which the observation would be assigned
	Predict(observation []float64) int

	// IsOnline tells the algorithm supports online learning
	IsOnline() bool

	// WithOnline configures the algorithms for online learning with given parameters
	WithOnline(Online) HardClusterer

	// Online begins the process of online training of an algorithm. Observations are sent on the observations channel,
	// once no more are expected an empty struct needs to be sent on done channel. Caller receives intermediate results of computation via
	// the returned channel.
	Online(observations chan []float64, done chan struct{}) chan *HCEvent

	// Implement common operation
	Clusterer
}

HardClusterer defines a set of operations for hard clustering algorithms

func DBSCAN

func DBSCAN(minpts int, eps float64, workers int, distance DistanceFunc) (HardClusterer, error)

Implementation of DBSCAN algorithm with concurrent nearest neighbour computation. The number of goroutines acting concurrently is controlled via workers argument. Passing 0 will result in this number being chosen arbitrarily.

func KMeans

func KMeans(iterations, clusters int, distance DistanceFunc) (HardClusterer, error)

Implementation of k-means++ algorithm with online learning

func OPTICS

func OPTICS(minpts int, eps, xi float64, workers int, distance DistanceFunc) (HardClusterer, error)

Implementation of OPTICS algorithm with concurrent nearest neighbour computation. The number of goroutines acting concurrently is controlled via workers argument. Passing 0 will result in this number being chosen arbitrarily.

type Importer

type Importer interface {

	// Import fetches the data from a file, start and end arguments allow user
	// to specify the span of data columns to be imported (inclusively)
	Import(file string, start, end int) ([][]float64, error)
}

Importer defines an operation of importing the dataset from an external file

func CsvImporter

func CsvImporter() Importer

func JsonImporter

func JsonImporter() Importer

type Online

type Online struct {
	Alpha     float64
	Dimension int
}

Online represents parameters important for online learning in clustering algorithms.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL