# base

package
Version: v0.0.0-...-bf439dd Latest Latest

Go to latest
Published: Jul 23, 2021 License: MIT

### Base Package

#### `import "github.com/cdipaolo/goml/base"`

This package helps define common patterns (interfaces,) as well as letting you work with data, get it into your programs, and munge through it.

This package also implements optimization algorithms which can be made available to a user's own models by implementing easy to use interfaces.

## Documentation ¶

### Overview ¶

Package base declares models, interfaces, and methods to be used when working with the rest of the goml library. It also includes common functions both used by the rest of the library and for the user's convenience for working with data, persisting it to files, and optimizing functions

### Constants ¶

This section is empty.

### Variables ¶

This section is empty.

### Functions ¶

#### func EuclideanDistance ¶

`func EuclideanDistance(u []float64, v []float64) float64`

EuclideanDistance returns the distance betweek two float64 vectors. NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

#### func GaussianKernel ¶

`func GaussianKernel(sigma float64) func([]float64, []float64) float64`

GaussianKernel takes in a parameter for sigma (σ) and returns a valid (Gaussian) Radial Basis Function Kernel. If the input dimensions aren't valid, the kernel will return 0.0 (as if the vectors are orthogonal)

```K(x, x`) = exp( -1 * |x - x`|^2 / 2σ^2)
```

This can be used within any models that can use Kernels.

Sigma (σ) will default to 1 if given 0.0

`func GradientAscent(d Ascendable) error`

GradientAscent operates on a Ascendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

#### func LinearKernel ¶

`func LinearKernel() func([]float64, []float64) float64`

LinearKernel is the base kernel function. It will return a valid kernel for use within models that can use the Kernel Trick. The resultant kernel just takes the dot/inner product of it's argument vectors.

```K(x, x`) = x*x`
```

This is also a subset of the Homogeneous Polynomial kernel family (where the degree is 1 in this case): https://en.wikipedia.org/wiki/Homogeneous_polynomial

Using this kernel is effectively the same as not using a kernel at all (for SVM and Kernel perceptron, at least.)

`func LoadDataFromCSV(filepath string) ([][]float64, []float64, error)`

LoadDataFromCSV takes in a path to a CSV file and loads that data into a Golang 2D array of 'X' values and a Golang 1D array of 'Y', or expected result, values.

#### Errors are returned if there are any problems ¶

Expected Data Format: - There should be no header/text lines. - The 'Y' (expected value) line should be the last

```column of the CSV.
```

Example CSV file with 2 input parameters:

```>>>>>>> BEGIN FILE
1.06,2.30,17
17.62,12.06,18.92
11.623,1.1,15.093
12.01,6,15.032
...
>>>>>>> END FILE
```

`func LoadDataFromCSVToStream(filepath string, data chan Datapoint, errors chan error)`

LoadDataFromCSVToStream loads a CSV data file just like LoadDataFromCSV, but it pushes each row into a data channel as it scans. This is useful for very large CSV files where you would want to learn (using the online model methods) as you read from the data as to minimize memory usage.

The errors channel will be passed any errors.

When the function returns, either in the case of an error, or at the end of reading, both the data stream channel and the errors channel will be closed.

#### func ManhattanDistance ¶

`func ManhattanDistance(u []float64, v []float64) float64`

ManhattanDistance returns the manhattan distance between teo float64 vectors. This is the sum of the differences between each value

Example Points:

``` .
|
2|
|______.
2
```

Note that the Euclidean distance between these 2 points is 2*sqrt(2)=2.828. The Manhattan distance is 4.

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

#### func Normalize ¶

`func Normalize(x [][]float64)`

Normalize takes in an array of arrays of inputs as well as the corresponding array of solutions and normalizes each 'row' of data to unit vector length.

That is: x[i][j] := x[i][j] / |x[i]|

#### func NormalizePoint ¶

`func NormalizePoint(x []float64)`

NormalizePoint is the same as Normalize, but it only operates on one singular datapoint, normalizing it's value to unit length.

#### func OnlyAsciiLetters ¶

`func OnlyAsciiLetters(r rune) bool`

OnlyAsciiLetters is a transform function that will only let a-zA-Z through

#### func OnlyAsciiWords ¶

`func OnlyAsciiWords(r rune) bool`

OnlyAsciiWords is a transform function that will only let a-zA-Z, and spaces through

#### func OnlyAsciiWordsAndNumbers ¶

`func OnlyAsciiWordsAndNumbers(r rune) bool`

OnlyAsciiWordsAndNumbers is a transform function that will only let 0-9a-zA-Z, and spaces through

#### func OnlyLetters ¶

`func OnlyLetters(r rune) bool`

OnlyLetters is a transform function that lets any unicode letter through

#### func OnlyWords ¶

`func OnlyWords(r rune) bool`

OnlyWords is a transform function that lets any unicode letter through as well as spaces

#### func OnlyWordsAndNumbers ¶

`func OnlyWordsAndNumbers(r rune) bool`

OnlyWordsAndNumbers is a transform function that lets any unicode letter or digit through as well as spaces

#### func PolynomialKernel ¶

`func PolynomialKernel(d int, constants ...float64) func([]float64, []float64) float64`

PolynomialKernel takes in an optional constant (where any extra args passed will be added and count as the constant,) and a main arg of the degree of the polynomial and returns a valid kernel in the Polynomial Function Kernel family. This kernel can be used with all models that take kernels.

```K(x, x`) = (x*x` + c)^d
```

Note that if no extra argument is passed (no constant) then the kernel is a Homogeneous Polynomial Kernel (as opposed to Inhomogeneous!) Also if there is no constant and d=1, then the returned kernel is the same (though less efficient) as just LinearKernel().

`d` will default to 1 if 0 is given.

#### func SaveDataToCSV ¶

`func SaveDataToCSV(filepath string, x [][]float64, y []float64, highPrecision bool) error`

SaveDataToCSV takes in a absolute filepath, as well as a 2D array of 'X' values and a 1D array of 'Y', or expected values, concatenates the format to the same as LoadDataFromCSV, and saves that data to a file, returning any errors.

highPrecision is a boolean where if true the values will be stored with a 64 bit precision when converting the floats to strings. Otherwise (if it's false) it uses 32 bits.

`func StochasticGradientAscent(d StochasticAscendable) error`

StochasticGradientAscent operates on a StochasticAscendable model and further optimizes the parameter vector Theta of the model, which is then used within the Predict function. Stochastic gradient descent updates the parameter vector after looking at each individual training example, which can result in never converging to the absolute minimum; even raising the cost function potentially, but it will typically converge faster than batch gradient descent (implemented as func GradientAscent(d Ascendable) error) because of that very difference.

Gradient Ascent follows the following algorithm: θ[j] := θ[j] + α·∇J(θ)

where J(θ) is the cost function, α is the learning rate, and θ[j] is the j-th value in the parameter vector

#### func TanhKernel ¶

`func TanhKernel(k float64, constants ...float64) func([]float64, []float64) float64`

TanhKernel takes in a required Kappa modifier parameter (defaults to 1.0 if 0.0 given,) and optional float64 args afterwords which will be added together to create a constant term (general reccomended use is to just pass one arg as the constant if you need it.)

```K(x, x`) = tanh(κx*x` + c)
```

Note that c must be less than 0 (if >= 0 default to -1.0) and κ (for most cases, but not all - hence no default) must be greater than 0

### Types ¶

#### type Ascendable ¶

```type Ascendable interface {
// LearningRate returns the learning rate α
// to be used in Gradient Descent as the
// modifier term
LearningRate() float64

// Dj returns the derivative of the cost function
// J(θ) with respect to the j-th parameter of
// the hypothesis, θ[j]. Called as Dj(j)
Dj(int) (float64, error)

// Theta returns a pointer to the parameter vector
// theta, which is 1D vector of floats
Theta() []float64

// MaxIterations returns the maximum number of
// iterations to try using gradient ascent. Might
// return after less if strong convergance is
// detected, but it'll let the user set a cap.
MaxIterations() int
}```

Ascendable is an interface that can be used with batch gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

#### type Datapoint ¶

```type Datapoint struct {
X []float64 `json:"x"`
Y []float64 `json:"y"`
}```

Datapoint is used in some models where it is cleaner to pass data as a struct rather than just as 1D and 2D arrays like Generalized Linear Models are doing, for example. X corresponds to the inputs and Y corresponds to the result of the hypothesis.

This is used with the Perceptron, for example, so data can be easily passed in channels while staying encapsulated well.

#### type DistanceMeasure ¶

`type DistanceMeasure func([]float64, []float64) float64`

DistanceMeasure is any function that maps two vectors of float64s to a float64. Used for vector distance calculations

#### func LNorm ¶

`func LNorm(p int) DistanceMeasure`

LNorm returns a DistanceMeasure of the l-p norm. L norms are a generalized family of the Euclidean and Manhattan distance.

(p = 1) -> Manhattan Distance (p = 2) -> Euclidean Distance

NOTE that this function does not check that the vectors are different lengths (to improve computation speed in, say, KNN.) Make sure you pass in same-length vectors.

#### type Model ¶

```type Model interface {

// The variadic argument in Predict is an
// optional arg which (if true) tells the
// function to first normalize the input to
// vector unit length. Use (and only use) this
// if you trained on normalized inputs.
Predict([]float64, ...bool) ([]float64, error)

// PersistToFile and RestoreFromFile both take
// in paths (absolute paths!) to files and
// persists the necessary data to the filepath
// such that you can RestoreFromFile later and
// have the same instance. Helpful when you want
// to train a model, save it to a file, then
// open it later for prediction
PersistToFile(string) error
RestoreFromFile(string) error
}```

Model is an interface that can Train based on a 2D array of data (called x) and an array (y) of solution data. Model trains in a supervised manor. Predict takes in a vector of floats and returns a real number response (float, again) and an error if any

#### type OnlineModel ¶

```type OnlineModel interface {
Predict([]float64) ([]float64, error)

// OnlineLearn has no outputs so you can run the data
// within a separate goroutine! A channel of
// errors is passed so you know when there's been
// an error in learning, though learning will
// just ignore the datapoint that caused the
// error and continue on.
//
// Most times errors are caused when passed
// datapoints are not of a consistent dimension.
//
// The function passed is a callback that is called
// whenever the parameter vector theta is updated
OnlineLearn(chan error, func([]float64))

// used in learning for the algorithm

// PersistToFile and RestoreFromFile both take
// in paths (absolute paths!) to files and
// persists the necessary data to the filepath
// such that you can RestoreFromFile later and
// have the same instance. Helpful when you want
// to train a model, save it to a file, then
// open it later for prediction
PersistToFile(string) error
RestoreFromFile(string) error
}```

OnlineModel differs from Model because the learning can take place in a goroutine because the data is passed through a channel, ending when the channel is closed.

#### type OnlineTextModel ¶

```type OnlineTextModel interface {
// Predict takes a document and returns the
// expected class found by the model
Predict(string) uint8

// OnlineLearn has no outputs so you can run the data
// within a separate goroutine! A channel of
// errors is passed so you know when there's been
// an error in learning, though learning will
// just ignore the datapoint that caused the
// error and continue on.
OnlineLearn(chan<- error)

// used in learning for the algorithm

// PersistToFile and RestoreFromFile both take
// in paths (absolute paths!) to files and
// persists the necessary data to the filepath
// such that you can RestoreFromFile later and
// have the same instance. Helpful when you want
// to train a model, save it to a file, then
// open it later for prediction
PersistToFile(string) error
RestoreFromFile(string) error
}```

OnlineTextModel holds the interface for text classifiers. They have the refular learn & predict functions, but don't include an updating callback func in OnlineLearn because the parameter vector passed would very often be _huge_, and therefore would be a detriment to performance.

#### type OptimizationMethod ¶

`type OptimizationMethod string`

OptimizationMethod defines a type enum which (using constants declared below) lets a user pass in a optimization method to use when creating a new model

```const (
BatchGA      OptimizationMethod = "Batch Gradient Ascent"
)```

Constants declare the types of optimization methods you can use.

#### type StochasticAscendable ¶

```type StochasticAscendable interface {
// LearningRate returns the learning rate α
// to be used in Gradient Descent as the
// modifier term
LearningRate() float64

// Examples returns the number of examples in the
// training set the model is using
Examples() int

// Dj returns the derivative of the cost function
// J(θ) with respect to the j-th parameter of
// the hypothesis, θ[j], for the training example
// x[i]. Called as Dij(i,j)
Dij(int, int) (float64, error)

// Theta returns a pointer to the parameter vector
// theta, which is 1D vector of floats
Theta() []float64

// MaxIterations returns the maximum number of
// iterations to try using gradient ascent. Might
// return after less if strong convergance is
// detected, but it'll let the user set a cap.
MaxIterations() int
}```

StochasticAscendable is an interface that can be used with stochastic gradient descent where the parameter vector theta is in one dimension only (so softmax regression would need it's own model, for example)

#### type TextDatapoint ¶

```type TextDatapoint struct {
X string `json:"x"`
Y uint8  `json:"y"`
}```

TextDatapoint is the data structure expected for text classification models. The passed types, therefore, are inherently different from the other structures. X is now a string (or, document. Usually this would be a sentence or multiple sentences.) Y is now a uint8 denoting the class, because you can't regress on text classification (at least not well/effectively)