nlp

package module

v0.0.0-...-26d441f Latest Latest Go to latest Published: May 11, 2021 License: MIT Imports: 20 Imported by: 21

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/james-bowman/nlp

Links

Open Source Insights

README ¶

Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.

Features

LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
PCA (Principal Component Analysis)
TF-IDF weighting to account for frequently occuring words
Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
Similarity/distance measures to calculate the similarity/distance between feature vectors.

Planned

Expanded persistence support
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, KNN, random forest, etc.

References

Documentation ¶

Overview ¶

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/)

Overview ¶

The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc.

These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions.

Typically the algorithms in this package implement one of three primary interfaces:

Vectoriser - Taking document input as strings and outputting matrices of numerical features e.g. term frequency.
Transformer - Takes matrices of numerical features and applies some logic/transformation to output a new matrix.
Comparer - Functions taking two vectors (columns from a matrix) and outputting a distance/similarity measure.

One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD):

pipeline := nlp.NewPipeline(
	nlp.NewCountVectoriser(true),
	nlp.NewTFIDFTransformer(),
	nlp.NewTruncatedSVD(100),
)

Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:

Fit() - Trains the model based upon the supplied, input training data.
Transform() - Transforms the input into the output matrix (requires the model to be already fitted by a previous call to Fit() or FitTransform()).
FitTransform() - Convenience method combining Fit() and Transform() methods to transform input data, fitting the model to the input data in the process.

Example ¶

package main

import (
	"fmt"

	"github.com/james-bowman/nlp"
	"github.com/james-bowman/nlp/measures/pairwise"
	"gonum.org/v1/gonum/mat"
)

func main() {
	testCorpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"hey diddle diddle, the cat and the fiddle",
		"the cow jumped over the moon",
		"the little dog laughed to see such fun",
		"and the dish ran away with the spoon",
	}

	var stopWords = []string{"a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"}

	query := "the brown fox ran around the dog"

	vectoriser := nlp.NewCountVectoriser(stopWords...)
	transformer := nlp.NewTfidfTransformer()

	// set k (the number of dimensions following truncation) to 4
	reducer := nlp.NewTruncatedSVD(4)

	lsiPipeline := nlp.NewPipeline(vectoriser, transformer, reducer)

	// Transform the corpus into an LSI fitting the model to the documents in the process
	lsi, err := lsiPipeline.FitTransform(testCorpus...)
	if err != nil {
		fmt.Printf("Failed to process documents because %v", err)
		return
	}

	// run the query through the same pipeline that was fitted to the corpus and
	// to project it into the same dimensional space
	queryVector, err := lsiPipeline.Transform(query)
	if err != nil {
		fmt.Printf("Failed to process documents because %v", err)
		return
	}

	// iterate over document feature vectors (columns) in the LSI matrix and compare
	// with the query vector for similarity.  Similarity is determined by the difference
	// between the angles of the vectors known as the cosine similarity
	highestSimilarity := -1.0
	var matched int
	_, docs := lsi.Dims()
	for i := 0; i < docs; i++ {
		similarity := pairwise.CosineSimilarity(queryVector.(mat.ColViewer).ColView(0), lsi.(mat.ColViewer).ColView(i))
		if similarity > highestSimilarity {
			matched = i
			highestSimilarity = similarity
		}
	}

	fmt.Printf("Matched '%s'", testCorpus[matched])
}

Output:

Matched 'The quick brown fox jumped over the lazy dog'

Index ¶

func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))
func ColNonZeroElemDo(m mat.Matrix, j int, fn func(i, j int, v float64))
func CreateRandomProjectionTransform(newDims, origDims int, density float64, rnd *rand.Rand) mat.Matrix
type ClassicLSH
- func NewClassicLSH(functions, tables int) *ClassicLSH
- func (l *ClassicLSH) GetCandidates(query *sparse.BinaryVec, k int) []interface{}
- func (l *ClassicLSH) Put(id interface{}, signature *sparse.BinaryVec)
- func (l *ClassicLSH) Remove(id interface{})
type CountVectoriser
- func NewCountVectoriser(stopWords ...string) *CountVectoriser
- func (v *CountVectoriser) Fit(train ...string) Vectoriser
- func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
- func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)
type Hasher
type HashingVectoriser
- func NewHashingVectoriser(numFeatures int, stopWords ...string) *HashingVectoriser
- func (v *HashingVectoriser) Fit(train ...string) Vectoriser
- func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
- func (v *HashingVectoriser) PartialFit(train ...string) Vectoriser
- func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)
type Indexer
type LSHForest
- func NewLSHForest(functions int, tables int) *LSHForest
- func (l *LSHForest) GetCandidates(query *sparse.BinaryVec, k int) []interface{}
- func (l *LSHForest) Put(id interface{}, signature *sparse.BinaryVec)
- func (l *LSHForest) Remove(id interface{})
type LSHIndex
- func NewLSHIndex(approx bool, hasher Hasher, store LSHScheme, distance pairwise.Comparer) *LSHIndex
- func (l *LSHIndex) Index(v mat.Vector, id interface{})
- func (l *LSHIndex) Remove(id interface{})
- func (l *LSHIndex) Search(q mat.Vector, k int) []Match
type LSHScheme
type LatentDirichletAllocation
- func NewLatentDirichletAllocation(k int) *LatentDirichletAllocation
- func (l *LatentDirichletAllocation) Components() mat.Matrix
- func (l *LatentDirichletAllocation) Fit(m mat.Matrix) Transformer
- func (l *LatentDirichletAllocation) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (l *LatentDirichletAllocation) Perplexity(m mat.Matrix) float64
- func (l *LatentDirichletAllocation) Transform(m mat.Matrix) (mat.Matrix, error)
type LearningSchedule
- func (l LearningSchedule) Calc(iteration float64) float64
type LinearScanIndex
- func NewLinearScanIndex(compareFN pairwise.Comparer) *LinearScanIndex
- func (b *LinearScanIndex) Index(v mat.Vector, id interface{})
- func (b *LinearScanIndex) Remove(id interface{})
- func (b *LinearScanIndex) Search(qv mat.Vector, k int) []Match
type Match
type OnlineTransformer
type OnlineVectoriser
type PCA
- func NewPCA(k int) *PCA
- func (p *PCA) ExplainedVariance() []float64
- func (p *PCA) Fit(m mat.Matrix) Transformer
- func (p *PCA) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (p *PCA) Transform(m mat.Matrix) (mat.Matrix, error)
type Pipeline
- func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline
- func (p *Pipeline) Fit(docs ...string) Vectoriser
- func (p *Pipeline) FitTransform(docs ...string) (mat.Matrix, error)
- func (p *Pipeline) Transform(docs ...string) (mat.Matrix, error)
type RRIBasis
type RandomIndexing
- func NewRandomIndexing(k int, density float64) *RandomIndexing
- func NewReflectiveRandomIndexing(k int, basis RRIBasis, reflections int, density float64) *RandomIndexing
- func (r *RandomIndexing) Components() mat.Matrix
- func (r *RandomIndexing) Fit(m mat.Matrix) Transformer
- func (r *RandomIndexing) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (r *RandomIndexing) PartialFit(m mat.Matrix) OnlineTransformer
- func (r *RandomIndexing) SetComponents(m mat.Matrix)
- func (r *RandomIndexing) Transform(m mat.Matrix) (mat.Matrix, error)
type RandomProjection
- func NewRandomProjection(k int, density float64) *RandomProjection
- func (r *RandomProjection) Fit(m mat.Matrix) Transformer
- func (r *RandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (r *RandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)
type RegExpTokeniser
- func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))
- func (t *RegExpTokeniser) Tokenise(text string) []string
type SignRandomProjection
- func NewSignRandomProjection(bits int) *SignRandomProjection
- func (s *SignRandomProjection) Fit(m mat.Matrix) Transformer
- func (s *SignRandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (s *SignRandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)
type SimHash
- func NewSimHash(bits int, dim int) *SimHash
- func (h *SimHash) Hash(v mat.Vector) *sparse.BinaryVec
type TfidfTransformer
- func NewTfidfTransformer() *TfidfTransformer
- func (t *TfidfTransformer) Fit(matrix mat.Matrix) Transformer
- func (t *TfidfTransformer) FitTransform(matrix mat.Matrix) (mat.Matrix, error)
- func (t *TfidfTransformer) Load(r io.Reader) error
- func (t TfidfTransformer) Save(w io.Writer) error
- func (t *TfidfTransformer) Transform(matrix mat.Matrix) (mat.Matrix, error)
type Tokeniser
- func NewTokeniser(stopWords ...string) Tokeniser
type Transformer
type TruncatedSVD
- func NewTruncatedSVD(k int) *TruncatedSVD
- func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer
- func (t *TruncatedSVD) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (t *TruncatedSVD) Load(r io.Reader) error
- func (t TruncatedSVD) Save(w io.Writer) error
- func (t *TruncatedSVD) Transform(m mat.Matrix) (mat.Matrix, error)
type Vectoriser

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ColDo ¶

func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))

ColDo executes fn for each column j in m. If the matrix implements the mat.ColViewer interface then this interface will be used to iterate over the column vectors more efficiently. If the matrix implements the sparse.TypeConverter interface then the matrix will be converted to a CSC matrix (which implements the mat.ColViewer interface) so that it can benefit from the same optimisation.

func ColNonZeroElemDo ¶

func ColNonZeroElemDo(m mat.Matrix, j int, fn func(i, j int, v float64))

ColNonZeroElemDo executes fn for each non-zero element in column j of matrix m. If m implements mat.ColNonZeroDoer then this interface will be used to perform the iteration.

func CreateRandomProjectionTransform ¶

func CreateRandomProjectionTransform(newDims, origDims int, density float64, rnd *rand.Rand) mat.Matrix

CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims. The matrix will be randomly populated using probability distributions where density is used as the probability that each element will be populated. Populated values will be randomly selected from [-1, 1] scaled according to the density and dimensions of the matrix. If rnd is nil then a new random number generator will be created and used.

Types ¶

type ClassicLSH ¶

type ClassicLSH struct {
	// contains filtered or unexported fields
}

ClassicLSH supports finding top-k Approximate Nearest Neighbours (ANN) using Locality Sensitive Hashing (LSH). Classic LSH scheme is based on using hash tables to store items by their locality sensitive hash code based on the work of A. Gionis et al. Items that map to the same bucket (their hash codes collide) are similar. Multiple hash tables are used to improve recall where some similar items would otherwise hash to separate, neighbouring buckets in only a single table.

A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999. http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf%5Cnhttp://portal.acm.org/citation.cfm?id=671516

func NewClassicLSH ¶

func NewClassicLSH(functions, tables int) *ClassicLSH

NewClassicLSH creates a new ClassicLSH with the configured number of hash tables and hash functions per table. The length of hash signatures used in this type's methods (Put() and GetCandidates()) should be exactly equal to functions * tables. The Classic LSH algorithm uses multiple hash tables to improve recall for similar items that hash to nearby buckets within a specific hash table.

func (*ClassicLSH) GetCandidates ¶

func (l *ClassicLSH) GetCandidates(query *sparse.BinaryVec, k int) []interface{}

GetCandidates returns the IDs of candidate nearest neighbours. It is up to the calling code to further filter these candidates based on distance to arrive at the top-k approximate nearest neighbours. The number of candidates returned may be smaller or larger than k. The method panics if the signature is not the same length as tables * functions.

func (*ClassicLSH) Put ¶

func (l *ClassicLSH) Put(id interface{}, signature *sparse.BinaryVec)

Put stores the specified LSH signature and associated ID in the LSH index. The method panics if the signature is not the same length as tables * functions.

func (*ClassicLSH) Remove ¶

func (l *ClassicLSH) Remove(id interface{})

Remove removes the specified item from the LSH index

type CountVectoriser ¶

type CountVectoriser struct {
	// Vocabulary is a map of words to indices that point to the row number representing
	// that word in the term document matrix output from the Transform() and FitTransform()
	// methods.  The Vocabulary map is populated by the Fit() or FitTransform() methods
	// based upon the words occurring in the datasets supplied to those methods.  Within
	// Transform(), any words found in the test data set that were not present in the
	// training data set supplied to Fit() will not have an entry in the Vocabulary
	// and will be ignored.
	Vocabulary map[string]int

	// Tokeniser is used to tokenise input text into features.
	Tokeniser Tokeniser
}

CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewCountVectoriser ¶

func NewCountVectoriser(stopWords ...string) *CountVectoriser

NewCountVectoriser creates a new CountVectoriser. stopWords is a potentially empty slice of words to be removed from the corpus

func (*CountVectoriser) Fit ¶

func (v *CountVectoriser) Fit(train ...string) Vectoriser

Fit processes the supplied training data (a variable number of strings representing documents). Each word appearing inside the training data will be added to the Vocabulary. The Fit() method is intended to be called once to train the model in a batch context. Calling the Fit() method a sceond time have the effect of re-training the model from scratch (discarding the previously learnt vocabulary).

func (*CountVectoriser) FitTransform ¶

func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*CountVectoriser) Transform ¶

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type Hasher ¶

type Hasher interface {
	// Hash hashes the input vector into a BinaryVector hash representation
	Hash(mat.Vector) *sparse.BinaryVec
}

Hasher interface represents a Locality Sensitive Hashing algorithm whereby the proximity of data points is preserved in the hash space i.e. similar data points will be hashed to values close together in the hash space.

type HashingVectoriser ¶

type HashingVectoriser struct {
	NumFeatures int
	Tokeniser   Tokeniser
}

HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewHashingVectoriser ¶

func NewHashingVectoriser(numFeatures int, stopWords ...string) *HashingVectoriser

NewHashingVectoriser creates a new HashingVectoriser. If stopWords is not an empty slice then english stop words will be removed. numFeatures specifies the number of features that should be present in produced vectors. Each word in a document is hashed and the mod of the hash and numFeatures gives the row in the matrix corresponding to that word.

func (*HashingVectoriser) Fit ¶

func (v *HashingVectoriser) Fit(train ...string) Vectoriser

Fit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does require a pre-determined vocabulary to map features to their correct row in the vector. It is effectively stateless and does not require fitting to training data. The method is included for compatibility with other vectorisers.

func (*HashingVectoriser) FitTransform ¶

func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform for a HashingVectoriser is exactly equivalent to calling Transform() with the same matrix. For most vectorisers, Fit() must be called prior to Transform() and so this method is a convenience where separate training data is not used to fit the model. For a HashingVectoriser, fitting is not required and so this method is exactly equivalent to Transform(). As with Fit(), this method is included with the HashingVectoriser for compatibility with other vectorisers. The returned matrix is a sparse matrix type.

func (*HashingVectoriser) PartialFit ¶

func (v *HashingVectoriser) PartialFit(train ...string) Vectoriser

PartialFit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does not require a pre-learnt vocabulary to map features to the correct row in the feature vector. This method is included for compatibility with other vectorisers.

func (*HashingVectoriser) Transform ¶

func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type Indexer ¶

type Indexer interface {
	Index(v mat.Vector, id interface{})
	Search(q mat.Vector, k int) []Match
	Remove(ids interface{})
}

Indexer indexes vectors to support Nearest Neighbour (NN) similarity searches across the indexed vectors.

type LSHForest ¶

type LSHForest struct {
	// contains filtered or unexported fields
}

LSHForest is an implementation of the LSH Forest Locality Sensitive Hashing scheme based on the work of M. Bawa et al.

M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005. http://dl.acm.org/citation.cfm?id=1060745.1060840

func NewLSHForest ¶

func NewLSHForest(functions int, tables int) *LSHForest

NewLSHForest creates a new LSHForest Locality Sensitive Hashing scheme with the specified number of hash tables and hash functions per table.

func (*LSHForest) GetCandidates ¶

func (l *LSHForest) GetCandidates(query *sparse.BinaryVec, k int) []interface{}

GetCandidates returns the IDs of candidate nearest neighbours. It is up to the calling code to further filter these candidates based on distance to arrive at the top-k approximate nearest neighbours. The number of candidates returned may be smaller or larger than k.

func (*LSHForest) Put ¶

func (l *LSHForest) Put(id interface{}, signature *sparse.BinaryVec)

Put stores the specified LSH signature and associated ID in the LSH index

func (*LSHForest) Remove ¶

func (l *LSHForest) Remove(id interface{})

Remove removes the specified item from the LSH index

type LSHIndex ¶

type LSHIndex struct {
	// contains filtered or unexported fields
}

LSHIndex is an LSH (Locality Sensitive Hashing) based index supporting Approximate Nearest Neighbour (ANN) search in O(log n). The storage required by the index will depend upon the underlying LSH scheme used but will typically be higher than O(n). In use cases where accurate Nearest Neighbour search is required other types of index should be considered like LinearScanIndex.

func NewLSHIndex ¶

func NewLSHIndex(approx bool, hasher Hasher, store LSHScheme, distance pairwise.Comparer) *LSHIndex

NewLSHIndex creates a new LSHIndex. When queried, the initial candidate nearest neighbours returned by the underlying LSH indexing algorithm are further filtered by comparing distances to the query vector using the supplied distance metric. If approx is true, the filtering comparison is performed on the hashes and if approx is false, then the comparison is performed on the original vectors instead. This will have time and storage implications as comparing the original vectors will be more accurate but slower and require the original vectors be stored for the comparison. The LSH algorithm and underlying LSH indexing algorithm may both be specified as hasher and store parameters respectively.

func (*LSHIndex) Index ¶

func (l *LSHIndex) Index(v mat.Vector, id interface{})

Index indexes the supplied vector along with its associated ID.

func (*LSHIndex) Remove ¶

func (l *LSHIndex) Remove(id interface{})

Remove removes the vector with the specified id from the index. If no vector is found with the specified id the method will simply do nothing.

func (*LSHIndex) Search ¶

func (l *LSHIndex) Search(q mat.Vector, k int) []Match

Search searches for the top-k approximate nearest neighbours in the index. The method returns up to the top-k most similar items in unsorted order. The method may return fewer than k items if less than k neighbours are found.

type LSHScheme ¶

type LSHScheme interface {
	// Put stores the specified LSH signature and associated ID in the LSH index
	Put(id interface{}, signature *sparse.BinaryVec)

	// GetCandidates returns the IDs of candidate nearest neighbours.  It is up to
	// the calling code to further filter these candidates based on distance to arrive
	// at the top-k approximate nearest neighbours.  The number of candidates returned
	// may be smaller or larger than k.
	GetCandidates(query *sparse.BinaryVec, k int) []interface{}

	// Remove removes the specified item from the LSH index
	Remove(id interface{})
}

LSHScheme interface represents LSH indexing schemes to support Approximate Nearest Neighbour (ANN) search.

type LatentDirichletAllocation ¶

type LatentDirichletAllocation struct {
	// Iterations is the maximum number of training iterations
	Iterations int

	// PerplexityTolerance is the tolerance of perplexity below which the Fit method will stop iterating
	// and complete.  If the evaluated perplexity is is below the tolerance, fitting will terminate successfully
	// without necessarily completing all of the configured number of training iterations.
	PerplexityTolerance float64

	// PerplexityEvaluationFrquency is the frequency with which to test Perplexity against PerplexityTolerance inside
	// Fit.  A value <= 0 will not evaluate Perplexity at all and simply iterate for `Iterations` iterations.
	PerplexityEvaluationFrequency int

	// BatchSize is the size of mini batches used during training
	BatchSize int

	// K is the number of topics
	K int

	// NumBurnInPasses is the number of `burn-in` passes across the documents in the
	// training data to learn the document statistics before we start collecting topic statistics.
	BurnInPasses int

	// TransformationPasses is the number of passes to transform new documents given a previously
	// fitted topic model
	TransformationPasses int

	// MeanChangeTolerance is the tolerance of change to Theta between burn in passes.
	// If the level of change between passes is below the tolerance, the burn in will complete
	// without necessarily completing the configured number of passes.
	MeanChangeTolerance float64

	// ChangeEvaluationFrequency is the frequency with which to test Perplexity against
	// MeanChangeTolerance during burn-in and transformation.  A value <= 0 will not evaluate
	// the mean change at all and simply iterate for `BurnInPasses` iterations.
	ChangeEvaluationFrequency int

	// Alpha is the prior of theta (the documents over topics distribution)
	Alpha float64

	// Eta is the prior of phi (the topics over words distribution)
	Eta float64

	// RhoPhi is the learning rate for phi (the topics over words distribution)
	RhoPhi LearningSchedule

	// RhoTheta is the learning rate for theta (the documents over topics distribution)
	RhoTheta LearningSchedule

	// Rnd is the random number generator used to generate the initial distributions
	// for nTheta (the document over topic distribution), nPhi (the topic over word
	// distribution) and nZ (the topic assignments).
	Rnd *rand.Rand

	// Processes is the degree of parallelisation, or more specifically, the number of
	// concurrent go routines to use during fitting.
	Processes int
	// contains filtered or unexported fields
}

LatentDirichletAllocation (LDA) for fast unsupervised topic extraction. LDA processes documents and learns their latent topic model estimating the posterior document over topic probability distribution (the probabilities of each document being allocated to each topic) and the posterior topic over word probability distribution.

This transformer uses a parallel implemention of the SCVB0 (Stochastic Collapsed Variational Bayes) Algorithm (https://arxiv.org/pdf/1305.2452.pdf) by Jimmy Foulds with optional `clumping` optimisations.

Example ¶

package main

import (
	"fmt"

	"github.com/james-bowman/nlp"
)

var stopWords = []string{"a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"}

func main() {
	corpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"The cow jumped over the moon",
		"The little dog laughed to see such fun",
	}

	// Create a pipeline with a count vectoriser and LDA transformer for 2 topics
	vectoriser := nlp.NewCountVectoriser(stopWords...)
	lda := nlp.NewLatentDirichletAllocation(2)
	pipeline := nlp.NewPipeline(vectoriser, lda)

	docsOverTopics, err := pipeline.FitTransform(corpus...)
	if err != nil {
		fmt.Printf("Failed to model topics for documents because %v", err)
		return
	}

	// Examine Document over topic probability distribution
	dr, dc := docsOverTopics.Dims()
	for doc := 0; doc < dc; doc++ {
		fmt.Printf("\nTopic distribution for document: '%s' -", corpus[doc])
		for topic := 0; topic < dr; topic++ {
			if topic > 0 {
				fmt.Printf(",")
			}
			fmt.Printf(" Topic #%d=%f", topic, docsOverTopics.At(topic, doc))
		}
	}

	// Examine Topic over word probability distribution
	topicsOverWords := lda.Components()
	tr, tc := topicsOverWords.Dims()

	vocab := make([]string, len(vectoriser.Vocabulary))
	for k, v := range vectoriser.Vocabulary {
		vocab[v] = k
	}
	for topic := 0; topic < tr; topic++ {
		fmt.Printf("\nWord distribution for Topic #%d -", topic)
		for word := 0; word < tc; word++ {
			if word > 0 {
				fmt.Printf(",")
			}
			fmt.Printf(" '%s'=%f", vocab[word], topicsOverWords.At(topic, word))
		}
	}
}

Output:

func NewLatentDirichletAllocation ¶

func NewLatentDirichletAllocation(k int) *LatentDirichletAllocation

NewLatentDirichletAllocation returns a new LatentDirichletAllocation type initialised with default values for k topics.

func (*LatentDirichletAllocation) Components ¶

func (l *LatentDirichletAllocation) Components() mat.Matrix

Components returns the topic over words probability distribution. The returned matrix is of dimensions K x W where w was the number of rows in the training matrix and each column represents a unique words in the vocabulary and K is the number of topics.

func (*LatentDirichletAllocation) Fit ¶

func (l *LatentDirichletAllocation) Fit(m mat.Matrix) Transformer

Fit fits the model to the specified matrix m. The latent topics, and probability distribution of topics over words, are learnt and stored to be used for furture transformations and analysis.

func (*LatentDirichletAllocation) FitTransform ¶

func (l *LatentDirichletAllocation) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. THe returned matrix contains the document over topic distributions where each element is the probability of the corresponding document being related to the corresponding topic. The returned matrix is a Dense matrix of shape K x C where K is the number of topics and C is the number of columns in the input matrix (representing the documents).

func (*LatentDirichletAllocation) Perplexity ¶

func (l *LatentDirichletAllocation) Perplexity(m mat.Matrix) float64

Perplexity calculates the perplexity of the matrix m against the trained model. m is first transformed into corresponding posterior estimates for document over topic distributions and then used to calculate the perplexity.

func (*LatentDirichletAllocation) Transform ¶

func (l *LatentDirichletAllocation) Transform(m mat.Matrix) (mat.Matrix, error)

Transform transforms the input matrix into a matrix representing the distribution of the documents over topics. THe returned matrix contains the document over topic distributions where each element is the probability of the corresponding document being related to the corresponding topic. The returned matrix is a Dense matrix of shape K x C where K is the number of topics and C is the number of columns in the input matrix (representing the documents).

type LearningSchedule ¶

type LearningSchedule struct {
	// S is the scale of the step size for the learning rate.
	S float64

	// Tau is the learning offset. The learning offset downweights the
	// learning rate from early iterations.
	Tau float64

	// Kappa controls the learning decay.  This is the amount the learning rate
	// reduces each iteration.  This is typically a value between 0.5 and 1.0.
	Kappa float64
}

LearningSchedule is used to calculate the learning rate for each iteration using a natural gradient descent algorithm.

func (LearningSchedule) Calc ¶

func (l LearningSchedule) Calc(iteration float64) float64

Calc returns the learning rate for the specified iteration

type LinearScanIndex ¶

type LinearScanIndex struct {
	// contains filtered or unexported fields
}

LinearScanIndex supports Nearest Neighbour (NN) similarity searches across indexed vectors performing queries in O(n) and requiring O(n) storage. As the name implies, LinearScanIndex performs a linear scan across all indexed vectors comparing them each in turn with the specified query vector using the configured pairwise distance metric. LinearScanIndex is accurate and will always return the true top-k nearest neighbours as opposed to some other types of index, like LSHIndex, which perform Approximate Nearest Neighbour (ANN) searches and trade some recall accuracy for performance over large scale datasets.

func NewLinearScanIndex ¶

func NewLinearScanIndex(compareFN pairwise.Comparer) *LinearScanIndex

NewLinearScanIndex construct a new empty LinearScanIndex which will use the specified pairwise distance metric to determine nearest neighbours based on similarity.

func (*LinearScanIndex) Index ¶

func (b *LinearScanIndex) Index(v mat.Vector, id interface{})

Index adds the specified vector v with associated id to the index.

func (*LinearScanIndex) Remove ¶

func (b *LinearScanIndex) Remove(id interface{})

Remove removes the vector with the specified id from the index. If no vector is found with the specified id the method will simply do nothing.

func (*LinearScanIndex) Search ¶

func (b *LinearScanIndex) Search(qv mat.Vector, k int) []Match

Search searches for the top-k nearest neighbours in the index. The method returns up to the top-k most similar items in unsorted order. The method may return fewer than k items if less than k neighbours are found.

type Match ¶

type Match struct {
	Distance float64
	ID       interface{}
}

Match represents a matching item for nearest neighbour similarity searches. It contains both the ID of the matching item and the distance from the queried item. The distance is represented as a score from 0 (exact match) to 1 (orthogonal) depending upon the metric used.

type OnlineTransformer ¶

type OnlineTransformer interface {
	Transformer
	PartialFit(mat.Matrix) OnlineTransformer
}

OnlineTransformer is an extension to the Transformer interface that supports online (streaming/mini-batch) training as opposed to just batch.

type OnlineVectoriser ¶

type OnlineVectoriser interface {
	Vectoriser
	PartialFit(...string) OnlineVectoriser
}

OnlineVectoriser is an extension to the Vectoriser interface that supports online (streaming/mini-batch) training as opposed to just batch.

type PCA ¶

type PCA struct {
	// K is the number of components
	K int
	// contains filtered or unexported fields
}

PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis. See https://en.wikipedia.org/wiki/Principal_component_analysis for further details.

func NewPCA ¶

func NewPCA(k int) *PCA

NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance

func (*PCA) ExplainedVariance ¶

func (p *PCA) ExplainedVariance() []float64

ExplainedVariance returns a slice of float64 values representing the variances of the principal component scores.

func (*PCA) Fit ¶

func (p *PCA) Fit(m mat.Matrix) Transformer

Fit calculates the principal component directions (axis of greatest variance) within the training data which can then be used to project matrices onto those principal components using the Transform() method.

func (*PCA) FitTransform ¶

func (p *PCA) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data.

func (*PCA) Transform ¶

func (p *PCA) Transform(m mat.Matrix) (mat.Matrix, error)

Transform projects the matrix onto the first K principal components calculated during training (the Fit() method). The returned matrix will be of reduced dimensionality compared to the input (K x c compared to r x c of the input).

type Pipeline ¶

type Pipeline struct {
	Vectoriser   Vectoriser
	Transformers []Transformer
}

Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps. For example to compose a classic LSA/LSI pipeline (vectorisation -> TFIDF transformation -> Truncated SVD) one could use a Pipeline as follows:

lsaPipeline := NewPipeline(NewCountVectoriser(false), NewTfidfTransformer(), NewTruncatedSVD(100))

func NewPipeline ¶

func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline

NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers

func (*Pipeline) Fit ¶

func (p *Pipeline) Fit(docs ...string) Vectoriser

Fit fits the model(s) to the supplied training data

func (*Pipeline) FitTransform ¶

func (p *Pipeline) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform transforms the supplied documents into a matrix representation of numerical feature vectors fitting the model to the supplied data in the process.

func (*Pipeline) Transform ¶

func (p *Pipeline) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a matrix representation of numerical feature vectors using a model(s) previously fitted to supplied training data.

type RRIBasis ¶

type RRIBasis int

RRIBasis represents the initial basis for the index/elemental vectors used for Random Reflective Indexing

const (
	// DocBasedRRI represents columns (documents/contexts in a term-document
	// matrix) forming the initial basis for index/elemental vectors in Random Indexing
	DocBasedRRI RRIBasis = iota

	// TermBasedRRI indicates rows (terms in a term-document matrix)
	// form the initial basis for index/elemental vectors in Reflective Random Indexing.
	TermBasedRRI
)

type RandomIndexing ¶

type RandomIndexing struct {
	// K specifies the number of dimensions for the semantic space
	K int

	// Density specifies the proportion of non-zero elements in the
	// elemental vectors
	Density float64

	// Type specifies the initial basis for the elemental vectors
	// i.e. whether they initially represent the rows or columns
	// This is only relevent for Reflective Random Indexing
	Type RRIBasis

	// Reflections specifies the number of reflective training cycles
	// to run during fitting for RRI (Reflective Random Indexing). For
	// Randome Indexing (non-reflective) this is 0.
	Reflections int
	// contains filtered or unexported fields
}

RandomIndexing is a method of dimensionality reduction used for Latent Semantic Analysis in a similar way to TruncatedSVD and PCA. Random Indexing is designed to solve limitations of very high dimensional vector space model implementations for modelling term co-occurance in language processing such as SVD typically used for LSA/LSI (Latent Semantic Analysis/Latent Semantic Indexing). In implementation it bears some similarity to other random projection techniques such as those implemented in RandomProjection and SignRandomProjection within this package. The RandomIndexing type can also be used to perform Reflective Random Indexing which extends the Random Indexing model with additional training cycles to better support indirect inferrence i.e. find synonyms where the words do not appear together in documents.

func NewRandomIndexing ¶

func NewRandomIndexing(k int, density float64) *RandomIndexing

NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space. The density parameter specifies the density of the index/elemental vectors used to project the input matrix into lower dimensional space i.e. the proportion of elements that are non-zero.

func NewReflectiveRandomIndexing ¶

func NewReflectiveRandomIndexing(k int, basis RRIBasis, reflections int, density float64) *RandomIndexing

NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing. Reflective Random Indexing applies additional (reflective) training cycles ontop of Random Indexing to capture indirect inferences (synonyms). i.e. similarity between terms that do not directly co-occur within the same context/document. basis specifies the basis for the reflective random indexing i.e. whether the initial, random index/elemental vectors should represent documents (columns) or terms (rows). reflections is the number of additional training cycles to apply to build the elemental vectors. Specifying basis == DocBasedRRI and reflections == 0 is equivalent to conventional Random Indexing.

func (*RandomIndexing) Components ¶

func (r *RandomIndexing) Components() mat.Matrix

Components returns a t x k matrix where `t` is the number of terms (rows) in the training data matrix. The rows in this matrix are the `context` vectors for RI each one representing a semantic representation of a term based upon the contexts in which it has appeared within the training data.

func (*RandomIndexing) Fit ¶

func (r *RandomIndexing) Fit(m mat.Matrix) Transformer

Fit trains the model, creating random index/elemental vectors to be used to construct the new projected feature vectors ('context' vectors) in the reduced semantic dimensional space. If configured for Reflective Random Indexing then Fit may actually run multiple training cycles as specified during construction. The Fit method trains the model in batch mode so is intended to be called once, for online/streaming or mini-batch training please consider the PartialFit method instead.

func (*RandomIndexing) FitTransform ¶

func (r *RandomIndexing) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.

func (*RandomIndexing) PartialFit ¶

func (r *RandomIndexing) PartialFit(m mat.Matrix) OnlineTransformer

PartialFit extends the model to take account of the specified matrix m. The context vectors are learnt and stored to be used for furture transformations and analysis. PartialFit performs Random Indexing even if the Transformer is configured for Reflective Random Indexing so if RRI is required please train using the Fit() method as a batch operation. Unlike the Fit() method, the PartialFit() method is designed to be called multiple times to support online and mini-batch learning whereas the Fit() method is only intended to be called once for batch learning.

func (*RandomIndexing) SetComponents ¶

func (r *RandomIndexing) SetComponents(m mat.Matrix)

SetComponents sets a t x k matrix where `t` is the number of terms (rows) in the training data matrix.

func (*RandomIndexing) Transform ¶

func (r *RandomIndexing) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform, projecting matrix m into the lower dimensional semantic space. The output matrix will be of shape k x c and will be a sparse CSR format matrix. The transformation for each document vector is simply the accumulation of all trained context vectors relating to terms appearing in the document. These are weighted by the frequency the term appears in the document.

type RandomProjection ¶

type RandomProjection struct {
	K       int
	Density float64
	// contains filtered or unexported fields
}

RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

The technique projects the original matrix orthogonally onto a random subspace, transforming the elements of the original matrix into a lower dimensional representation. Computing orthogonal matrices is expensive and so this technique uses specially generated random matrices (hence the name) following the principle that in high dimensional spaces, there are lots of nearly orthogonal matrices.

func NewRandomProjection ¶

func NewRandomProjection(k int, density float64) *RandomProjection

NewRandomProjection creates and returns a new RandomProjection transformer. The RandomProjection will use a specially generated random matrix of the specified density and dimensionality k to perform the transform to k dimensional space.

func (*RandomProjection) Fit ¶

func (r *RandomProjection) Fit(m mat.Matrix) Transformer

Fit creates the random (almost) orthogonal matrix used to project input matrices into the new reduced dimensional subspace.

func (*RandomProjection) FitTransform ¶

func (r *RandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.

func (*RandomProjection) Transform ¶

func (r *RandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transformation, projecting the input matrix into the reduced dimensional subspace. The transformed matrix will be a sparse CSR format matrix of shape k x c.

type RegExpTokeniser ¶

type RegExpTokeniser struct {
	RegExp    *regexp.Regexp
	StopWords map[string]bool
}

RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal

func (*RegExpTokeniser) ForEachIn ¶

func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))

ForEachIn iterates over each token within text and invokes function f with the token as parameter. If StopWords is not nil then any tokens from text present in StopWords will be ignored.

func (*RegExpTokeniser) Tokenise ¶

func (t *RegExpTokeniser) Tokenise(text string) []string

Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.

type SignRandomProjection ¶

type SignRandomProjection struct {
	// Bits represents the number of bits the output vectors should
	// be in length and hence the number of random hyperplanes needed
	// for the transformation
	Bits int
	// contains filtered or unexported fields
}

SignRandomProjection represents a transform of a matrix into a lower dimensional space. Sign Random Projection is a method of Locality Sensitive Hashing (LSH) sometimes referred to as the random hyperplane method. A set of random hyperplanes are created in the original dimensional space and then input matrices are expressed relative to the random hyperplanes as follows:

For each column vector in the input matrix, construct a corresponding output
bit vector with each bit (i) calculated as follows:
	if dot(vector, hyperplane[i]) > 0
		bit[i] = 1
	else
		bit[i] = 0

Whilst similar to other methods of random projection this method is unique in that it uses only a single bit in the output matrix to represent the sign of the result of the comparison (Dot product) with each hyperplane so encodes vector representations with very low memory and processor requirements whilst preserving relative distance between vectors from the original space. Hamming similarity (and distance) between the transformed vectors in the subspace can approximate Angular similarity (and distance) (which is strongly related to Cosine similarity) of the associated vectors from the original space.

func NewSignRandomProjection ¶

func NewSignRandomProjection(bits int) *SignRandomProjection

NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality. The transformer uses a number of random hyperplanes represented by `bits` and is the dimensionality of the output, transformed matrices.

func (*SignRandomProjection) Fit ¶

func (s *SignRandomProjection) Fit(m mat.Matrix) Transformer

Fit creates the random hyperplanes from the input training data matrix, mat and stores the hyperplanes as a transform to apply to matrices.

func (*SignRandomProjection) FitTransform ¶

func (s *SignRandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

func (*SignRandomProjection) Transform ¶

func (s *SignRandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The columns in the resulting output matrix will be a low dimensional binary representation of the columns within the original i.e. a hash or fingerprint that can be quickly and efficiently compared with other similar vectors. Hamming similarity in the new dimensional space can be used to approximate Cosine similarity between the vectors of the original space. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

type SimHash ¶

type SimHash struct {
	// contains filtered or unexported fields
}

SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm for angular distance using sign random projections based on the work of Moses S. Charikar. The distance between the original vectors is preserved through the hashing process such that hashed vectors can be compared using Hamming Similarity for a faster, more space efficient, approximation of Cosine Similarity for the original vectors.

Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380. https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

func NewSimHash ¶

func NewSimHash(bits int, dim int) *SimHash

NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits. This method creates a series of random hyperplanes which are then compared to each input vector to produce the output hashed binary vector encoding the input vector's location in vector space relative to the hyperplanes. Each bit in the output vector corresponds to the sign (1/0 for +/-) of the result of the dot product comparison with each random hyperplane.

func (*SimHash) Hash ¶

func (h *SimHash) Hash(v mat.Vector) *sparse.BinaryVec

Hash accepts a Vector and outputs a BinaryVec (which also implements the Gonum Vector interface). This method will panic if the input vector is of a different length than the dim parameter used when constructing the SimHash.

type TfidfTransformer ¶

type TfidfTransformer struct {
	// contains filtered or unexported fields
}

TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus. For example a very commonly occurring word like `the` is likely to occur in all documents and so would be weighted down. More precisely, TfidfTransformer applies a tf-idf algorithm to the matrix where each term frequency is multiplied by the inverse document frequency. Inverse document frequency is calculated as log(n/df) where df is the number of documents in which the term occurs and n is the total number of documents within the corpus. We add 1 to both n and df before division to prevent division by zero.

func NewTfidfTransformer ¶

func NewTfidfTransformer() *TfidfTransformer

NewTfidfTransformer constructs a new TfidfTransformer.

func (*TfidfTransformer) Fit ¶

func (t *TfidfTransformer) Fit(matrix mat.Matrix) Transformer

Fit takes a training term document matrix, counts term occurrences across all documents and constructs an inverse document frequency transform to apply to matrices in subsequent calls to Transform().

func (*TfidfTransformer) FitTransform ¶

func (t *TfidfTransformer) FitTransform(matrix mat.Matrix) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*TfidfTransformer) Load ¶

func (t *TfidfTransformer) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TfidfTransformer) Save ¶

func (t TfidfTransformer) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TfidfTransformer) Transform ¶

func (t *TfidfTransformer) Transform(matrix mat.Matrix) (mat.Matrix, error)

Transform applies the inverse document frequency (IDF) transform by multiplying each term frequency by its corresponding IDF value. This has the effect of weighting each term frequency according to how often it appears across the whole document corpus so that naturally frequent occurring words are given less weight than uncommon ones. The returned matrix is a sparse matrix type.

type Tokeniser ¶

type Tokeniser interface {
	// ForEachIn iterates over each token within text and invokes function
	// f with the token as parameter
	ForEachIn(text string, f func(token string))

	// Tokenise returns a slice of all the tokens contained in string
	// text
	Tokenise(text string) []string
}

Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g. Regexp and also supporting different different token types n-grams and languages.

func NewTokeniser ¶

func NewTokeniser(stopWords ...string) Tokeniser

NewTokeniser returns a new, default Tokeniser implementation. stopWords is a potentially empty string slice that contains the words that should be removed from the corpus default regExpTokeniser will split words by whitespace/tabs: "\t\n\f\r "

type Transformer ¶

type Transformer interface {
	Fit(mat.Matrix) Transformer
	Transform(mat mat.Matrix) (mat.Matrix, error)
	FitTransform(mat mat.Matrix) (mat.Matrix, error)
}

Transformer provides a common interface for transformer steps.

type TruncatedSVD ¶

type TruncatedSVD struct {
	// Components is the truncated term matrix (matrix U of the Singular Value Decomposition
	// (A=USV^T)).  The matrix will be of size m, k where m = the number of unique terms
	// in the training data and k = the number of elements to truncate to (specified by
	// attribute K) or m or n (the number of documents in the training data) whichever of
	// the 3 values is smaller.
	Components *mat.Dense

	// K is the number of dimensions to which the output, transformed, matrix should be
	// truncated to.  The matrix output by the FitTransform() and Transform() methods will
	// be n rows by min(m, n, K) columns, where n is the number of columns in the original,
	// input matrix and min(m, n, K) is the lowest value of m, n, K where m is the number of
	// rows in the original, input matrix.
	K int
}

TruncatedSVD implements the Singular Value Decomposition factorisation of matrices. This produces an approximation of the input matrix at a lower rank. This is a core component of LSA (Latent Semantic Analsis)

func NewTruncatedSVD ¶

func NewTruncatedSVD(k int) *TruncatedSVD

NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k

func (*TruncatedSVD) Fit ¶

func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer

Fit performs the SVD factorisation on the input training data matrix, mat and stores the output term matrix as a transform to apply to matrices in the Transform matrix.

func (*TruncatedSVD) FitTransform ¶

func (t *TruncatedSVD) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a dense matrix type.

func (*TruncatedSVD) Load ¶

func (t *TruncatedSVD) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TruncatedSVD) Save ¶

func (t TruncatedSVD) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TruncatedSVD) Transform ¶

func (t *TruncatedSVD) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The resulting output matrix will be the closest approximation to the input matrix at a reduced rank. The returned matrix is a dense matrix type.

type Vectoriser ¶

type Vectoriser interface {
	Fit(...string) Vectoriser
	Transform(...string) (mat.Matrix, error)
	FitTransform(...string) (mat.Matrix, error)
}

Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
measures
pairwise

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL