nlp

package module
Version: v0.0.0-...-26d441f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2021 License: MIT Imports: 20 Imported by: 15

README

Natural Language Processing

GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

Documentation

Overview

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/)

Overview

The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc.

These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions.

Typically the algorithms in this package implement one of three primary interfaces:

Vectoriser - Taking document input as strings and outputting matrices of numerical features e.g. term frequency.
Transformer - Takes matrices of numerical features and applies some logic/transformation to output a new matrix.
Comparer - Functions taking two vectors (columns from a matrix) and outputting a distance/similarity measure.

One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD):

pipeline := nlp.NewPipeline(
	nlp.NewCountVectoriser(true),
	nlp.NewTFIDFTransformer(),
	nlp.NewTruncatedSVD(100),
)

Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:

Fit() - Trains the model based upon the supplied, input training data.
Transform() - Transforms the input into the output matrix (requires the model to be already fitted by a previous call to Fit() or FitTransform()).
FitTransform() - Convenience method combining Fit() and Transform() methods to transform input data, fitting the model to the input data in the process.
Example
package main

import (
	"fmt"

	"github.com/james-bowman/nlp"
	"github.com/james-bowman/nlp/measures/pairwise"
	"gonum.org/v1/gonum/mat"
)

func main() {
	testCorpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"hey diddle diddle, the cat and the fiddle",
		"the cow jumped over the moon",
		"the little dog laughed to see such fun",
		"and the dish ran away with the spoon",
	}

	var stopWords = []string{"a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"}

	query := "the brown fox ran around the dog"

	vectoriser := nlp.NewCountVectoriser(stopWords...)
	transformer := nlp.NewTfidfTransformer()

	// set k (the number of dimensions following truncation) to 4
	reducer := nlp.NewTruncatedSVD(4)

	lsiPipeline := nlp.NewPipeline(vectoriser, transformer, reducer)

	// Transform the corpus into an LSI fitting the model to the documents in the process
	lsi, err := lsiPipeline.FitTransform(testCorpus...)
	if err != nil {
		fmt.Printf("Failed to process documents because %v", err)
		return
	}

	// run the query through the same pipeline that was fitted to the corpus and
	// to project it into the same dimensional space
	queryVector, err := lsiPipeline.Transform(query)
	if err != nil {
		fmt.Printf("Failed to process documents because %v", err)
		return
	}

	// iterate over document feature vectors (columns) in the LSI matrix and compare
	// with the query vector for similarity.  Similarity is determined by the difference
	// between the angles of the vectors known as the cosine similarity
	highestSimilarity := -1.0
	var matched int
	_, docs := lsi.Dims()
	for i := 0; i < docs; i++ {
		similarity := pairwise.CosineSimilarity(queryVector.(mat.ColViewer).ColView(0), lsi.(mat.ColViewer).ColView(i))
		if similarity > highestSimilarity {
			matched = i
			highestSimilarity = similarity
		}
	}

	fmt.Printf("Matched '%s'", testCorpus[matched])
}
Output:

Matched 'The quick brown fox jumped over the lazy dog'

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func ColDo

func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))

ColDo executes fn for each column j in m. If the matrix implements the mat.ColViewer interface then this interface will be used to iterate over the column vectors more efficiently. If the matrix implements the sparse.TypeConverter interface then the matrix will be converted to a CSC matrix (which implements the mat.ColViewer interface) so that it can benefit from the same optimisation.

func ColNonZeroElemDo

func ColNonZeroElemDo(m mat.Matrix, j int, fn func(i, j int, v float64))

ColNonZeroElemDo executes fn for each non-zero element in column j of matrix m. If m implements mat.ColNonZeroDoer then this interface will be used to perform the iteration.

func CreateRandomProjectionTransform

func CreateRandomProjectionTransform(newDims, origDims int, density float64, rnd *rand.Rand) mat.Matrix

CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims. The matrix will be randomly populated using probability distributions where density is used as the probability that each element will be populated. Populated values will be randomly selected from [-1, 1] scaled according to the density and dimensions of the matrix. If rnd is nil then a new random number generator will be created and used.

Types

type ClassicLSH

type ClassicLSH struct {
	// contains filtered or unexported fields
}

ClassicLSH supports finding top-k Approximate Nearest Neighbours (ANN) using Locality Sensitive Hashing (LSH). Classic LSH scheme is based on using hash tables to store items by their locality sensitive hash code based on the work of A. Gionis et al. Items that map to the same bucket (their hash codes collide) are similar. Multiple hash tables are used to improve recall where some similar items would otherwise hash to separate, neighbouring buckets in only a single table.

A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999. http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf%5Cnhttp://portal.acm.org/citation.cfm?id=671516

func NewClassicLSH

func NewClassicLSH(functions, tables int) *ClassicLSH

NewClassicLSH creates a new ClassicLSH with the configured number of hash tables and hash functions per table. The length of hash signatures used in this type's methods (Put() and GetCandidates()) should be exactly equal to functions * tables. The Classic LSH algorithm uses multiple hash tables to improve recall for similar items that hash to nearby buckets within a specific hash table.

func (*ClassicLSH) GetCandidates

func (l *ClassicLSH) GetCandidates(query *sparse.BinaryVec, k int) []interface{}

GetCandidates returns the IDs of candidate nearest neighbours. It is up to the calling code to further filter these candidates based on distance to arrive at the top-k approximate nearest neighbours. The number of candidates returned may be smaller or larger than k. The method panics if the signature is not the same length as tables * functions.

func (*ClassicLSH) Put

func (l *ClassicLSH) Put(id interface{}, signature *sparse.BinaryVec)

Put stores the specified LSH signature and associated ID in the LSH index. The method panics if the signature is not the same length as tables * functions.

func (*ClassicLSH) Remove

func (l *ClassicLSH) Remove(id interface{})

Remove removes the specified item from the LSH index

type CountVectoriser

type CountVectoriser struct {
	// Vocabulary is a map of words to indices that point to the row number representing
	// that word in the term document matrix output from the Transform() and FitTransform()
	// methods.  The Vocabulary map is populated by the Fit() or FitTransform() methods
	// based upon the words occurring in the datasets supplied to those methods.  Within
	// Transform(), any words found in the test data set that were not present in the
	// training data set supplied to Fit() will not have an entry in the Vocabulary
	// and will be ignored.
	Vocabulary map[string]int

	// Tokeniser is used to tokenise input text into features.
	Tokeniser Tokeniser
}

CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewCountVectoriser

func NewCountVectoriser(stopWords ...string) *CountVectoriser

NewCountVectoriser creates a new CountVectoriser. stopWords is a potentially empty slice of words to be removed from the corpus

func (*CountVectoriser) Fit

func (v *CountVectoriser) Fit(train ...string) Vectoriser

Fit processes the supplied training data (a variable number of strings representing documents). Each word appearing inside the training data will be added to the Vocabulary. The Fit() method is intended to be called once to train the model in a batch context. Calling the Fit() method a sceond time have the effect of re-training the model from scratch (discarding the previously learnt vocabulary).

func (*CountVectoriser) FitTransform

func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*CountVectoriser) Transform

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type Hasher

type Hasher interface {
	// Hash hashes the input vector into a BinaryVector hash representation
	Hash(mat.Vector) *sparse.BinaryVec
}

Hasher interface represents a Locality Sensitive Hashing algorithm whereby the proximity of data points is preserved in the hash space i.e. similar data points will be hashed to values close together in the hash space.

type HashingVectoriser

type HashingVectoriser struct {
	NumFeatures int
	Tokeniser   Tokeniser
}

HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewHashingVectoriser

func NewHashingVectoriser(numFeatures int, stopWords ...string) *HashingVectoriser

NewHashingVectoriser creates a new HashingVectoriser. If stopWords is not an empty slice then english stop words will be removed. numFeatures specifies the number of features that should be present in produced vectors. Each word in a document is hashed and the mod of the hash and numFeatures gives the row in the matrix corresponding to that word.

func (*HashingVectoriser) Fit

func (v *HashingVectoriser) Fit(train ...string) Vectoriser

Fit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does require a pre-determined vocabulary to map features to their correct row in the vector. It is effectively stateless and does not require fitting to training data. The method is included for compatibility with other vectorisers.

func (*HashingVectoriser) FitTransform

func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform for a HashingVectoriser is exactly equivalent to calling Transform() with the same matrix. For most vectorisers, Fit() must be called prior to Transform() and so this method is a convenience where separate training data is not used to fit the model. For a HashingVectoriser, fitting is not required and so this method is exactly equivalent to Transform(). As with Fit(), this method is included with the HashingVectoriser for compatibility with other vectorisers. The returned matrix is a sparse matrix type.

func (*HashingVectoriser) PartialFit

func (v *HashingVectoriser) PartialFit(train ...string) Vectoriser

PartialFit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does not require a pre-learnt vocabulary to map features to the correct row in the feature vector. This method is included for compatibility with other vectorisers.

func (*HashingVectoriser) Transform

func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type Indexer

type Indexer interface {
	Index(v mat.Vector, id interface{})
	Search(q mat.Vector, k int) []Match
	Remove(ids interface{})
}

Indexer indexes vectors to support Nearest Neighbour (NN) similarity searches across the indexed vectors.

type LSHForest

type LSHForest struct {
	// contains filtered or unexported fields
}

LSHForest is an implementation of the LSH Forest Locality Sensitive Hashing scheme based on the work of M. Bawa et al.

M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005. http://dl.acm.org/citation.cfm?id=1060745.1060840

func NewLSHForest

func NewLSHForest(functions int, tables int) *LSHForest

NewLSHForest creates a new LSHForest Locality Sensitive Hashing scheme with the specified number of hash tables and hash functions per table.

func (*LSHForest) GetCandidates

func (l *LSHForest) GetCandidates(query *sparse.BinaryVec, k int) []interface{}

GetCandidates returns the IDs of candidate nearest neighbours. It is up to the calling code to further filter these candidates based on distance to arrive at the top-k approximate nearest neighbours. The number of candidates returned may be smaller or larger than k.

func (*LSHForest) Put

func (l *LSHForest) Put(id interface{}, signature *sparse.BinaryVec)

Put stores the specified LSH signature and associated ID in the LSH index

func (*LSHForest) Remove

func (l *LSHForest) Remove(id interface{})

Remove removes the specified item from the LSH index

type LSHIndex

type LSHIndex struct {
	// contains filtered or unexported fields
}

LSHIndex is an LSH (Locality Sensitive Hashing) based index supporting Approximate Nearest Neighbour (ANN) search in O(log n). The storage required by the index will depend upon the underlying LSH scheme used but will typically be higher than O(n). In use cases where accurate Nearest Neighbour search is required other types of index should be considered like LinearScanIndex.

func NewLSHIndex

func NewLSHIndex(approx bool, hasher Hasher, store LSHScheme, distance pairwise.Comparer) *LSHIndex

NewLSHIndex creates a new LSHIndex. When queried, the initial candidate nearest neighbours returned by the underlying LSH indexing algorithm are further filtered by comparing distances to the query vector using the supplied distance metric. If approx is true, the filtering comparison is performed on the hashes and if approx is false, then the comparison is performed on the original vectors instead. This will have time and storage implications as comparing the original vectors will be more accurate but slower and require the original vectors be stored for the comparison. The LSH algorithm and underlying LSH indexing algorithm may both be specified as hasher and store parameters respectively.

func (*LSHIndex) Index

func (l *LSHIndex) Index(v mat.Vector, id interface{})

Index indexes the supplied vector along with its associated ID.

func (*LSHIndex) Remove

func (l *LSHIndex) Remove(id interface{})

Remove removes the vector with the specified id from the index. If no vector is found with the specified id the method will simply do nothing.

func (*LSHIndex) Search

func (l *LSHIndex) Search(q mat.Vector, k int) []Match

Search searches for the top-k approximate nearest neighbours in the index. The method returns up to the top-k most similar items in unsorted order. The method may return fewer than k items if less than k neighbours are found.

type LSHScheme

type LSHScheme interface {
	// Put stores the specified LSH signature and associated ID in the LSH index
	Put(id interface{}, signature *sparse.BinaryVec)

	// GetCandidates returns the IDs of candidate nearest neighbours.  It is up to
	// the calling code to further filter these candidates based on distance to arrive
	// at the top-k approximate nearest neighbours.  The number of candidates returned
	// may be smaller or larger than k.
	GetCandidates(query *sparse.BinaryVec, k int) []interface{}

	// Remove removes the specified item from the LSH index
	Remove(id interface{})
}

LSHScheme interface represents LSH indexing schemes to support Approximate Nearest Neighbour (ANN) search.

type LatentDirichletAllocation

type LatentDirichletAllocation struct {
	// Iterations is the maximum number of training iterations
	Iterations int

	// PerplexityTolerance is the tolerance of perplexity below which the Fit method will stop iterating
	// and complete.  If the evaluated perplexity is is below the tolerance, fitting will terminate successfully
	// without necessarily completing all of the configured number of training iterations.
	PerplexityTolerance float64

	// PerplexityEvaluationFrquency is the frequency with which to test Perplexity against PerplexityTolerance inside
	// Fit.  A value <= 0 will not evaluate Perplexity at all and simply iterate for `Iterations` iterations.
	PerplexityEvaluationFrequency int

	// BatchSize is the size of mini batches used during training
	BatchSize int

	// K is the number of topics
	K int

	// NumBurnInPasses is the number of `burn-in` passes across the documents in the
	// training data to learn the document statistics before we start collecting topic statistics.
	BurnInPasses int

	// TransformationPasses is the number of passes to transform new documents given a previously
	// fitted topic model
	TransformationPasses int

	// MeanChangeTolerance is the tolerance of change to Theta between burn in passes.
	// If the level of change between passes is below the tolerance, the burn in will complete
	// without necessarily completing the configured number of passes.
	MeanChangeTolerance float64

	// ChangeEvaluationFrequency is the frequency with which to test Perplexity against
	// MeanChangeTolerance during burn-in and transformation.  A value <= 0 will not evaluate
	// the mean change at all and simply iterate for `BurnInPasses` iterations.
	ChangeEvaluationFrequency int

	// Alpha is the prior of theta (the documents over topics distribution)
	Alpha float64

	// Eta is the prior of phi (the topics over words distribution)
	Eta float64

	// RhoPhi is the learning rate for phi (the topics over words distribution)
	RhoPhi LearningSchedule

	// RhoTheta is the learning rate for theta (the documents over topics distribution)
	RhoTheta LearningSchedule

	// Rnd is the random number generator used to generate the initial distributions
	// for nTheta (the document over topic distribution), nPhi (the topic over word
	// distribution) and nZ (the topic assignments).
	Rnd *rand.Rand

	// Processes is the degree of parallelisation, or more specifically, the number of
	// concurrent go routines to use during fitting.
	Processes int
	// contains filtered or unexported fields
}

LatentDirichletAllocation (LDA) for fast unsupervised topic extraction. LDA processes documents and learns their latent topic model estimating the posterior document over topic probability distribution (the probabilities of each document being allocated to each topic) and the posterior topic over word probability distribution.

This transformer uses a parallel implemention of the SCVB0 (Stochastic Collapsed Variational Bayes) Algorithm (https://arxiv.org/pdf/1305.2452.pdf) by Jimmy Foulds with optional `clumping` optimisations.

Example
package main

import (
	"fmt"

	"github.com/james-bowman/nlp"
)

var stopWords = []string{"a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"}

func main() {
	corpus := []string{
		"The quick brown fox jumped over the lazy dog",
		"The cow jumped over the moon",
		"The little dog laughed to see such fun",
	}

	// Create a pipeline with a count vectoriser and LDA transformer for 2 topics
	vectoriser := nlp.NewCountVectoriser(stopWords...)
	lda := nlp.NewLatentDirichletAllocation(2)
	pipeline := nlp.NewPipeline(vectoriser, lda)

	docsOverTopics, err := pipeline.FitTransform(corpus...)
	if err != nil {
		fmt.Printf("Failed to model topics for documents because %v", err)
		return
	}

	// Examine Document over topic probability distribution
	dr, dc := docsOverTopics.Dims()
	for doc := 0; doc < dc; doc++ {
		fmt.Printf("\nTopic distribution for document: '%s' -", corpus[doc])
		for topic := 0; topic < dr; topic++ {
			if topic > 0 {
				fmt.Printf(",")
			}
			fmt.Printf(" Topic #%d=%f", topic, docsOverTopics.At(topic, doc))
		}
	}

	// Examine Topic over word probability distribution
	topicsOverWords := lda.Components()
	tr, tc := topicsOverWords.Dims()

	vocab := make([]string, len(vectoriser.Vocabulary))
	for k, v := range vectoriser.Vocabulary {
		vocab[v] = k
	}
	for topic := 0; topic < tr; topic++ {
		fmt.Printf("\nWord distribution for Topic #%d -", topic)
		for word := 0; word < tc; word++ {
			if word > 0 {
				fmt.Printf(",")
			}
			fmt.Printf(" '%s'=%f", vocab[word], topicsOverWords.At(topic, word))
		}
	}
}
Output:

func NewLatentDirichletAllocation

func NewLatentDirichletAllocation(k int) *LatentDirichletAllocation

NewLatentDirichletAllocation returns a new LatentDirichletAllocation type initialised with default values for k topics.

func (*LatentDirichletAllocation) Components

func (l *LatentDirichletAllocation) Components() mat.Matrix

Components returns the topic over words probability distribution. The returned matrix is of dimensions K x W where w was the number of rows in the training matrix and each column represents a unique words in the vocabulary and K is the number of topics.

func (*LatentDirichletAllocation) Fit

Fit fits the model to the specified matrix m. The latent topics, and probability distribution of topics over words, are learnt and stored to be used for furture transformations and analysis.

func (*LatentDirichletAllocation) FitTransform

func (l *LatentDirichletAllocation) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. THe returned matrix contains the document over topic distributions where each element is the probability of the corresponding document being related to the corresponding topic. The returned matrix is a Dense matrix of shape K x C where K is the number of topics and C is the number of columns in the input matrix (representing the documents).

func (*LatentDirichletAllocation) Perplexity

func (l *LatentDirichletAllocation) Perplexity(m mat.Matrix) float64

Perplexity calculates the perplexity of the matrix m against the trained model. m is first transformed into corresponding posterior estimates for document over topic distributions and then used to calculate the perplexity.

func (*LatentDirichletAllocation) Transform

func (l *LatentDirichletAllocation) Transform(m mat.Matrix) (mat.Matrix, error)

Transform transforms the input matrix into a matrix representing the distribution of the documents over topics. THe returned matrix contains the document over topic distributions where each element is the probability of the corresponding document being related to the corresponding topic. The returned matrix is a Dense matrix of shape K x C where K is the number of topics and C is the number of columns in the input matrix (representing the documents).

type LearningSchedule

type LearningSchedule struct {
	// S is the scale of the step size for the learning rate.
	S float64

	// Tau is the learning offset. The learning offset downweights the
	// learning rate from early iterations.
	Tau float64

	// Kappa controls the learning decay.  This is the amount the learning rate
	// reduces each iteration.  This is typically a value between 0.5 and 1.0.
	Kappa float64
}

LearningSchedule is used to calculate the learning rate for each iteration using a natural gradient descent algorithm.

func (LearningSchedule) Calc

func (l LearningSchedule) Calc(iteration float64) float64

Calc returns the learning rate for the specified iteration

type LinearScanIndex

type LinearScanIndex struct {
	// contains filtered or unexported fields
}

LinearScanIndex supports Nearest Neighbour (NN) similarity searches across indexed vectors performing queries in O(n) and requiring O(n) storage. As the name implies, LinearScanIndex performs a linear scan across all indexed vectors comparing them each in turn with the specified query vector using the configured pairwise distance metric. LinearScanIndex is accurate and will always return the true top-k nearest neighbours as opposed to some other types of index, like LSHIndex, which perform Approximate Nearest Neighbour (ANN) searches and trade some recall accuracy for performance over large scale datasets.

func NewLinearScanIndex

func NewLinearScanIndex(compareFN pairwise.Comparer) *LinearScanIndex

NewLinearScanIndex construct a new empty LinearScanIndex which will use the specified pairwise distance metric to determine nearest neighbours based on similarity.

func (*LinearScanIndex) Index

func (b *LinearScanIndex) Index(v mat.Vector, id interface{})

Index adds the specified vector v with associated id to the index.

func (*LinearScanIndex) Remove

func (b *LinearScanIndex) Remove(id interface{})

Remove removes the vector with the specified id from the index. If no vector is found with the specified id the method will simply do nothing.

func (*LinearScanIndex) Search

func (b *LinearScanIndex) Search(qv mat.Vector, k int) []Match

Search searches for the top-k nearest neighbours in the index. The method returns up to the top-k most similar items in unsorted order. The method may return fewer than k items if less than k neighbours are found.

type Match

type Match struct {
	Distance float64
	ID       interface{}
}

Match represents a matching item for nearest neighbour similarity searches. It contains both the ID of the matching item and the distance from the queried item. The distance is represented as a score from 0 (exact match) to 1 (orthogonal) depending upon the metric used.

type OnlineTransformer

type OnlineTransformer interface {
	Transformer
	PartialFit(mat.Matrix) OnlineTransformer
}

OnlineTransformer is an extension to the Transformer interface that supports online (streaming/mini-batch) training as opposed to just batch.

type OnlineVectoriser

type OnlineVectoriser interface {
	Vectoriser
	PartialFit(...string) OnlineVectoriser
}

OnlineVectoriser is an extension to the Vectoriser interface that supports online (streaming/mini-batch) training as opposed to just batch.

type PCA

type PCA struct {
	// K is the number of components
	K int
	// contains filtered or unexported fields
}

PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis. See https://en.wikipedia.org/wiki/Principal_component_analysis for further details.

func NewPCA

func NewPCA(k int) *PCA

NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance

func (*PCA) ExplainedVariance

func (p *PCA) ExplainedVariance() []float64

ExplainedVariance returns a slice of float64 values representing the variances of the principal component scores.

func (*PCA) Fit

func (p *PCA) Fit(m mat.Matrix) Transformer

Fit calculates the principal component directions (axis of greatest variance) within the training data which can then be used to project matrices onto those principal components using the Transform() method.

func (*PCA) FitTransform

func (p *PCA) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data.

func (*PCA) Transform

func (p *PCA) Transform(m mat.Matrix) (mat.Matrix, error)

Transform projects the matrix onto the first K principal components calculated during training (the Fit() method). The returned matrix will be of reduced dimensionality compared to the input (K x c compared to r x c of the input).

type Pipeline

type Pipeline struct {
	Vectoriser   Vectoriser
	Transformers []Transformer
}

Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps. For example to compose a classic LSA/LSI pipeline (vectorisation -> TFIDF transformation -> Truncated SVD) one could use a Pipeline as follows:

lsaPipeline := NewPipeline(NewCountVectoriser(false), NewTfidfTransformer(), NewTruncatedSVD(100))

func NewPipeline

func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline

NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers

func (*Pipeline) Fit

func (p *Pipeline) Fit(docs ...string) Vectoriser

Fit fits the model(s) to the supplied training data

func (*Pipeline) FitTransform

func (p *Pipeline) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform transforms the supplied documents into a matrix representation of numerical feature vectors fitting the model to the supplied data in the process.

func (*Pipeline) Transform

func (p *Pipeline) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a matrix representation of numerical feature vectors using a model(s) previously fitted to supplied training data.

type RRIBasis

type RRIBasis int

RRIBasis represents the initial basis for the index/elemental vectors used for Random Reflective Indexing

const (
	// DocBasedRRI represents columns (documents/contexts in a term-document
	// matrix) forming the initial basis for index/elemental vectors in Random Indexing
	DocBasedRRI RRIBasis = iota

	// TermBasedRRI indicates rows (terms in a term-document matrix)
	// form the initial basis for index/elemental vectors in Reflective Random Indexing.
	TermBasedRRI
)

type RandomIndexing

type RandomIndexing struct {
	// K specifies the number of dimensions for the semantic space
	K int

	// Density specifies the proportion of non-zero elements in the
	// elemental vectors
	Density float64

	// Type specifies the initial basis for the elemental vectors
	// i.e. whether they initially represent the rows or columns
	// This is only relevent for Reflective Random Indexing
	Type RRIBasis

	// Reflections specifies the number of reflective training cycles
	// to run during fitting for RRI (Reflective Random Indexing). For
	// Randome Indexing (non-reflective) this is 0.
	Reflections int
	// contains filtered or unexported fields
}

RandomIndexing is a method of dimensionality reduction used for Latent Semantic Analysis in a similar way to TruncatedSVD and PCA. Random Indexing is designed to solve limitations of very high dimensional vector space model implementations for modelling term co-occurance in language processing such as SVD typically used for LSA/LSI (Latent Semantic Analysis/Latent Semantic Indexing). In implementation it bears some similarity to other random projection techniques such as those implemented in RandomProjection and SignRandomProjection within this package. The RandomIndexing type can also be used to perform Reflective Random Indexing which extends the Random Indexing model with additional training cycles to better support indirect inferrence i.e. find synonyms where the words do not appear together in documents.

func NewRandomIndexing

func NewRandomIndexing(k int, density float64) *RandomIndexing

NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space. The density parameter specifies the density of the index/elemental vectors used to project the input matrix into lower dimensional space i.e. the proportion of elements that are non-zero.

func NewReflectiveRandomIndexing

func NewReflectiveRandomIndexing(k int, basis RRIBasis, reflections int, density float64) *RandomIndexing

NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing. Reflective Random Indexing applies additional (reflective) training cycles ontop of Random Indexing to capture indirect inferences (synonyms). i.e. similarity between terms that do not directly co-occur within the same context/document. basis specifies the basis for the reflective random indexing i.e. whether the initial, random index/elemental vectors should represent documents (columns) or terms (rows). reflections is the number of additional training cycles to apply to build the elemental vectors. Specifying basis == DocBasedRRI and reflections == 0 is equivalent to conventional Random Indexing.

func (*RandomIndexing) Components

func (r *RandomIndexing) Components() mat.Matrix

Components returns a t x k matrix where `t` is the number of terms (rows) in the training data matrix. The rows in this matrix are the `context` vectors for RI each one representing a semantic representation of a term based upon the contexts in which it has appeared within the training data.

func (*RandomIndexing) Fit

Fit trains the model, creating random index/elemental vectors to be used to construct the new projected feature vectors ('context' vectors) in the reduced semantic dimensional space. If configured for Reflective Random Indexing then Fit may actually run multiple training cycles as specified during construction. The Fit method trains the model in batch mode so is intended to be called once, for online/streaming or mini-batch training please consider the PartialFit method instead.

func (*RandomIndexing) FitTransform

func (r *RandomIndexing) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.

func (*RandomIndexing) PartialFit

func (r *RandomIndexing) PartialFit(m mat.Matrix) OnlineTransformer

PartialFit extends the model to take account of the specified matrix m. The context vectors are learnt and stored to be used for furture transformations and analysis. PartialFit performs Random Indexing even if the Transformer is configured for Reflective Random Indexing so if RRI is required please train using the Fit() method as a batch operation. Unlike the Fit() method, the PartialFit() method is designed to be called multiple times to support online and mini-batch learning whereas the Fit() method is only intended to be called once for batch learning.

func (*RandomIndexing) SetComponents

func (r *RandomIndexing) SetComponents(m mat.Matrix)

SetComponents sets a t x k matrix where `t` is the number of terms (rows) in the training data matrix.

func (*RandomIndexing) Transform

func (r *RandomIndexing) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform, projecting matrix m into the lower dimensional semantic space. The output matrix will be of shape k x c and will be a sparse CSR format matrix. The transformation for each document vector is simply the accumulation of all trained context vectors relating to terms appearing in the document. These are weighted by the frequency the term appears in the document.

type RandomProjection

type RandomProjection struct {
	K       int
	Density float64
	// contains filtered or unexported fields
}

RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

The technique projects the original matrix orthogonally onto a random subspace, transforming the elements of the original matrix into a lower dimensional representation. Computing orthogonal matrices is expensive and so this technique uses specially generated random matrices (hence the name) following the principle that in high dimensional spaces, there are lots of nearly orthogonal matrices.

func NewRandomProjection

func NewRandomProjection(k int, density float64) *RandomProjection

NewRandomProjection creates and returns a new RandomProjection transformer. The RandomProjection will use a specially generated random matrix of the specified density and dimensionality k to perform the transform to k dimensional space.

func (*RandomProjection) Fit

Fit creates the random (almost) orthogonal matrix used to project input matrices into the new reduced dimensional subspace.

func (*RandomProjection) FitTransform

func (r *RandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.

func (*RandomProjection) Transform

func (r *RandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transformation, projecting the input matrix into the reduced dimensional subspace. The transformed matrix will be a sparse CSR format matrix of shape k x c.

type RegExpTokeniser

type RegExpTokeniser struct {
	RegExp    *regexp.Regexp
	StopWords map[string]bool
}

RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal

func (*RegExpTokeniser) ForEachIn

func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))

ForEachIn iterates over each token within text and invokes function f with the token as parameter. If StopWords is not nil then any tokens from text present in StopWords will be ignored.

func (*RegExpTokeniser) Tokenise

func (t *RegExpTokeniser) Tokenise(text string) []string

Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.

type SignRandomProjection

type SignRandomProjection struct {
	// Bits represents the number of bits the output vectors should
	// be in length and hence the number of random hyperplanes needed
	// for the transformation
	Bits int
	// contains filtered or unexported fields
}

SignRandomProjection represents a transform of a matrix into a lower dimensional space. Sign Random Projection is a method of Locality Sensitive Hashing (LSH) sometimes referred to as the random hyperplane method. A set of random hyperplanes are created in the original dimensional space and then input matrices are expressed relative to the random hyperplanes as follows:

For each column vector in the input matrix, construct a corresponding output
bit vector with each bit (i) calculated as follows:
	if dot(vector, hyperplane[i]) > 0
		bit[i] = 1
	else
		bit[i] = 0

Whilst similar to other methods of random projection this method is unique in that it uses only a single bit in the output matrix to represent the sign of the result of the comparison (Dot product) with each hyperplane so encodes vector representations with very low memory and processor requirements whilst preserving relative distance between vectors from the original space. Hamming similarity (and distance) between the transformed vectors in the subspace can approximate Angular similarity (and distance) (which is strongly related to Cosine similarity) of the associated vectors from the original space.

func NewSignRandomProjection

func NewSignRandomProjection(bits int) *SignRandomProjection

NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality. The transformer uses a number of random hyperplanes represented by `bits` and is the dimensionality of the output, transformed matrices.

func (*SignRandomProjection) Fit

Fit creates the random hyperplanes from the input training data matrix, mat and stores the hyperplanes as a transform to apply to matrices.

func (*SignRandomProjection) FitTransform

func (s *SignRandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

func (*SignRandomProjection) Transform

func (s *SignRandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The columns in the resulting output matrix will be a low dimensional binary representation of the columns within the original i.e. a hash or fingerprint that can be quickly and efficiently compared with other similar vectors. Hamming similarity in the new dimensional space can be used to approximate Cosine similarity between the vectors of the original space. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

type SimHash

type SimHash struct {
	// contains filtered or unexported fields
}

SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm for angular distance using sign random projections based on the work of Moses S. Charikar. The distance between the original vectors is preserved through the hashing process such that hashed vectors can be compared using Hamming Similarity for a faster, more space efficient, approximation of Cosine Similarity for the original vectors.

Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380. https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

func NewSimHash

func NewSimHash(bits int, dim int) *SimHash

NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits. This method creates a series of random hyperplanes which are then compared to each input vector to produce the output hashed binary vector encoding the input vector's location in vector space relative to the hyperplanes. Each bit in the output vector corresponds to the sign (1/0 for +/-) of the result of the dot product comparison with each random hyperplane.

func (*SimHash) Hash

func (h *SimHash) Hash(v mat.Vector) *sparse.BinaryVec

Hash accepts a Vector and outputs a BinaryVec (which also implements the Gonum Vector interface). This method will panic if the input vector is of a different length than the dim parameter used when constructing the SimHash.

type TfidfTransformer

type TfidfTransformer struct {
	// contains filtered or unexported fields
}

TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus. For example a very commonly occurring word like `the` is likely to occur in all documents and so would be weighted down. More precisely, TfidfTransformer applies a tf-idf algorithm to the matrix where each term frequency is multiplied by the inverse document frequency. Inverse document frequency is calculated as log(n/df) where df is the number of documents in which the term occurs and n is the total number of documents within the corpus. We add 1 to both n and df before division to prevent division by zero.

func NewTfidfTransformer

func NewTfidfTransformer() *TfidfTransformer

NewTfidfTransformer constructs a new TfidfTransformer.

func (*TfidfTransformer) Fit

func (t *TfidfTransformer) Fit(matrix mat.Matrix) Transformer

Fit takes a training term document matrix, counts term occurrences across all documents and constructs an inverse document frequency transform to apply to matrices in subsequent calls to Transform().

func (*TfidfTransformer) FitTransform

func (t *TfidfTransformer) FitTransform(matrix mat.Matrix) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*TfidfTransformer) Load

func (t *TfidfTransformer) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TfidfTransformer) Save

func (t TfidfTransformer) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TfidfTransformer) Transform

func (t *TfidfTransformer) Transform(matrix mat.Matrix) (mat.Matrix, error)

Transform applies the inverse document frequency (IDF) transform by multiplying each term frequency by its corresponding IDF value. This has the effect of weighting each term frequency according to how often it appears across the whole document corpus so that naturally frequent occurring words are given less weight than uncommon ones. The returned matrix is a sparse matrix type.

type Tokeniser

type Tokeniser interface {
	// ForEachIn iterates over each token within text and invokes function
	// f with the token as parameter
	ForEachIn(text string, f func(token string))

	// Tokenise returns a slice of all the tokens contained in string
	// text
	Tokenise(text string) []string
}

Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g. Regexp and also supporting different different token types n-grams and languages.

func NewTokeniser

func NewTokeniser(stopWords ...string) Tokeniser

NewTokeniser returns a new, default Tokeniser implementation. stopWords is a potentially empty string slice that contains the words that should be removed from the corpus default regExpTokeniser will split words by whitespace/tabs: "\t\n\f\r "

type Transformer

type Transformer interface {
	Fit(mat.Matrix) Transformer
	Transform(mat mat.Matrix) (mat.Matrix, error)
	FitTransform(mat mat.Matrix) (mat.Matrix, error)
}

Transformer provides a common interface for transformer steps.

type TruncatedSVD

type TruncatedSVD struct {
	// Components is the truncated term matrix (matrix U of the Singular Value Decomposition
	// (A=USV^T)).  The matrix will be of size m, k where m = the number of unique terms
	// in the training data and k = the number of elements to truncate to (specified by
	// attribute K) or m or n (the number of documents in the training data) whichever of
	// the 3 values is smaller.
	Components *mat.Dense

	// K is the number of dimensions to which the output, transformed, matrix should be
	// truncated to.  The matrix output by the FitTransform() and Transform() methods will
	// be n rows by min(m, n, K) columns, where n is the number of columns in the original,
	// input matrix and min(m, n, K) is the lowest value of m, n, K where m is the number of
	// rows in the original, input matrix.
	K int
}

TruncatedSVD implements the Singular Value Decomposition factorisation of matrices. This produces an approximation of the input matrix at a lower rank. This is a core component of LSA (Latent Semantic Analsis)

func NewTruncatedSVD

func NewTruncatedSVD(k int) *TruncatedSVD

NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k

func (*TruncatedSVD) Fit

func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer

Fit performs the SVD factorisation on the input training data matrix, mat and stores the output term matrix as a transform to apply to matrices in the Transform matrix.

func (*TruncatedSVD) FitTransform

func (t *TruncatedSVD) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a dense matrix type.

func (*TruncatedSVD) Load

func (t *TruncatedSVD) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TruncatedSVD) Save

func (t TruncatedSVD) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TruncatedSVD) Transform

func (t *TruncatedSVD) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The resulting output matrix will be the closest approximation to the input matrix at a reduced rank. The returned matrix is a dense matrix type.

type Vectoriser

type Vectoriser interface {
	Fit(...string) Vectoriser
	Transform(...string) (mat.Matrix, error)
	FitTransform(...string) (mat.Matrix, error)
}

Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.

Directories

Path Synopsis
measures

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL