Documentation ¶
Overview ¶
Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The initial primary focus being on the implementation of algorithms supporting LSA (Latent Semantic Analysis), often referred to as Latent Semantic Indexing in the context of information retrieval.
Overview ¶
The algorithms in the package typically support document input as text strings which are then encoded as a matrix of numerical feature vectors called a `term document matrix`. Columns in this matrix represent the documents in the corpus and the rows represent terms occurring in the documents. The individual elements within the matrix contains counts of the number of occurrences of each term in the associated document.
This matrix can be manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions.
A common transformation is for the purpose of weighting features to remove natural biases which would skew results e.g. commonly occurring words like `the`, `of`, `and`, etc. which should carry lower weight than unusual words.
Term Document matrices typically have a very large number of dimensions and so transformations are often applied to reduce the dimensionality using techniques such as Locality Sensitive Hashing or Latent Semantic Analysis (typically performed using matrix SVD - `Singular Value Decomposition`) which approximates the original term document matrix with a new matrix of much lower rank (typically around 100 rather than 1000s). Truncated SVD is a fundamental part of LSA (Latent Semantic Analysis aka Latent Semantic Indexing) and serves a number of purposes:
1. The reduced dimensionality of the data theoretically requires less memory.
2. As less significant dimensions are removed, there is less `noise` in the data which could have artificially skewed results.
3. Perhaps most importantly, the SVD effectively encodes the co-occurrence of terms within the documents to capture semantic meaning rather than simply the presence (or lack of presence) of words. This combats the problem of synonymy (a common challenge in NLP) where different words in the English language can be used to mean the same thing (synonyms). In LSA, documents can have a high degree of semantic similarity with very few words in common.
The post SVD matrix (with each column being a feature vector representing a document within the corpus) can be compared for similarity with each other (for clustering) or with a query (also represented as a feature vector projected into the same dimensional space). Similarity is measured by the angle between the two feature vectors being considered.
Example ¶
testCorpus := []string{ "The quick brown fox jumped over the lazy dog", "hey diddle diddle, the cat and the fiddle", "the cow jumped over the moon", "the little dog laughed to see such fun", "and the dish ran away with the spoon", } query := "the brown fox ran around the dog" vectoriser := NewCountVectoriser(true) transformer := NewTfidfTransformer() // set k (the number of dimensions following truncation) to 4 reducer := NewTruncatedSVD(4) lsiPipeline := NewPipeline(vectoriser, transformer, reducer) // Transform the corpus into an LSI fitting the model to the documents in the process lsi, err := lsiPipeline.FitTransform(testCorpus...) if err != nil { fmt.Printf("Failed to process documents because %v", err) return } // run the query through the same pipeline that was fitted to the corpus and // to project it into the same dimensional space queryVector, err := lsiPipeline.Transform(query) if err != nil { fmt.Printf("Failed to process documents because %v", err) return } // iterate over document feature vectors (columns) in the LSI and compare with the // query vector for similarity. Similarity is determined by the difference between // the angles of the vectors known as the cosine similarity highestSimilarity := -1.0 var matched int _, docs := lsi.Dims() for i := 0; i < docs; i++ { similarity := pairwise.CosineSimilarity(queryVector.(mat.ColViewer).ColView(0), lsi.(mat.ColViewer).ColView(i)) if similarity > highestSimilarity { matched = i highestSimilarity = similarity } } fmt.Printf("Matched '%s'", testCorpus[matched])
Output: Matched 'The quick brown fox jumped over the lazy dog'
Index ¶
- func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))
- func CreateRandomProjectionTransform(newDims, origDims int, density float64) mat.Matrix
- func MorfModeWord(word string) string
- type CountVectoriser
- func (v *CountVectoriser) Fit(train ...string) Vectoriser
- func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
- func (c *CountVectoriser) Load(r io.Reader) error
- func (c *CountVectoriser) Save(w io.Writer) error
- func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)
- type HashingVectoriser
- type PCA
- type Pipeline
- type RIBasis
- type RandomIndexing
- type RandomProjection
- type RegExpTokeniser
- type SignRandomProjection
- type SimHash
- type TfidfTransformer
- func (t *TfidfTransformer) Fit(mat mat.Matrix) Transformer
- func (t *TfidfTransformer) FitTransform(mat mat.Matrix) (mat.Matrix, error)
- func (t *TfidfTransformer) GetTransform() *sparse.DIA
- func (t *TfidfTransformer) Load(r io.Reader) error
- func (t TfidfTransformer) Save(w io.Writer) error
- func (t *TfidfTransformer) Transform(mat mat.Matrix) (mat.Matrix, error)
- type Tokeniser
- type Transformer
- type TruncatedSVD
- type Vectoriser
- type VsTokeniser
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CreateRandomProjectionTransform ¶
CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims. The matrix will be randomly populated using probability distributions where density is used as the probability that each element will be populated. Populated values will be randomly selected from [-1, 1] scaled according to the density and dimensions of the matrix.
func MorfModeWord ¶
Types ¶
type CountVectoriser ¶
type CountVectoriser struct { // Vocabulary is a map of words to indices that point to the row number representing // that word in the term document matrix output from the Transform() and FitTransform() // methods. The Vocabulary map is populated by the Fit() or FitTransform() methods // based upon the words occurring in the datasets supplied to those methods. Within // Transform(), any words found in the test data set that were not present in the // training data set supplied to Fit() will not have an entry in the Vocabulary // and will be ignored. Vocabulary map[string]int // Tokeniser is used to tokenise input text into features. Tokeniser Tokeniser }
CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.
func NewCountVectoriser ¶
func NewCountVectoriser(removeStopwords bool) *CountVectoriser
NewCountVectoriser creates a new CountVectoriser. If removeStopwords is true then english stop words will be removed.
func VsNewCountVectoriser ¶
func VsNewCountVectoriser(removeStopwords bool) *CountVectoriser
func (*CountVectoriser) Fit ¶
func (v *CountVectoriser) Fit(train ...string) Vectoriser
Fit processes the supplied training data (a variable number of strings representing documents). Each word appearing inside the training data will be added to the Vocabulary
func (*CountVectoriser) FitTransform ¶
func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.
func (*CountVectoriser) Transform ¶
func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)
Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.
type HashingVectoriser ¶
HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.
func NewHashingVectoriser ¶
func NewHashingVectoriser(removeStopwords bool, numFeatures int) *HashingVectoriser
NewHashingVectoriser creates a new HashingVectoriser. If removeStopwords is true then english stop words will be removed. numFeatures specifies the number of features that should be present in produced vectors. Each word in a document is hashed and the mod of the hash and numFeatures gives the row in the matrix corresponding to that word.
func (*HashingVectoriser) Fit ¶
func (v *HashingVectoriser) Fit(train ...string) Vectoriser
Fit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does require a pre-determined vocabulary to map features to their correct row in the vector. It is effectively stateless and does not require fitting to training data. The method is included for compatibility with other vectorisers.
func (*HashingVectoriser) FitTransform ¶
func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
FitTransform for a HashingVectoriser is exactly equivalent to calling Transform() with the same matrix. For most vectorisers, Fit() must be called prior to Transform() and so this method is a convenience where separate training data is not used to fit the model. For a HashingVectoriser, fitting is not required and so this method is exactly equivalent to Transform(). As with Fit(), this method is included with the HashingVectoriser for compatibility with other vectorisers. The returned matrix is a sparse matrix type.
func (*HashingVectoriser) Transform ¶
func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)
Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.
type PCA ¶
type PCA struct { // K is the number of components K int // contains filtered or unexported fields }
PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis. See https://en.wikipedia.org/wiki/Principal_component_analysis for further details.
func NewPCA ¶
NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance
func (*PCA) ExplainedVariance ¶
ExplainedVariance returns a slice of float64 values representing the variances of the principal component scores.
func (*PCA) Fit ¶
func (p *PCA) Fit(m mat.Matrix) Transformer
Fit calculates the principal component directions (axis of greatest variance) within the training data which can then be used to project matrices onto those principal components using the Transform() method.
func (*PCA) FitTransform ¶
FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data.
type Pipeline ¶
type Pipeline struct { Vectoriser Vectoriser Transformers []Transformer }
Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps. For example to compose a classic LSA/LSI pipeline (vectorisation -> TFIDF transformation -> Truncated SVD) one could use a Pipeline as follows:
lsaPipeline := NewPipeline(NewCountVectoriser(false), NewTfidfTransformer(), NewTruncatedSVD(100))
func NewPipeline ¶
func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline
NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers
func (*Pipeline) Fit ¶
func (p *Pipeline) Fit(docs ...string) Vectoriser
Fit fits the model(s) to the supplied training data
func (*Pipeline) FitTransform ¶
FitTransform transforms the supplied documents into a matrix representation of numerical feature vectors fitting the model to the supplied data in the process.
type RIBasis ¶
type RIBasis int
RIBasis represents the initial basis for the elemental vectors used for Random Indexing
const ( // RowBasedRI indicates rows (terms in a term-document matrix) // forming the initial basis for elemental vectors in Random Indexing. // This is basis used for Random Indexing of documents, Reflective // Random Indexing can use either rows or columns as the initial // basis for elemental vectors. RowBasedRI RIBasis = iota // ColBasedRI represents columns (documents/contexts in a term-document // matrix) forming the initial basis for elemental vectors in Random Indexing ColBasedRI )
type RandomIndexing ¶
type RandomIndexing struct { // K specifies the number of dimensions for the semantic space K int // Density specifies the proportion of non-zero elements in the // elemental vectors Density float64 // Type specifies the initial basis for the elemental vectors // i.e. whether they initially represent the rows or columns // For Random Indexing this should be RowBasedRI, for RRI // (Reflective Random Indexing) it can be either RowBasedRI or // ColBasedRI Type RIBasis // Reflections specifies the number of reflective training cycles // to run during fitting for RRI (Reflective Random Indexing). // If Type is ColBasedRI then Reflections must be >= 1 Reflections int // contains filtered or unexported fields }
RandomIndexing is a method of dimensionality reduction similar to TruncatedSVD and PCA. Random Indexing is designed to solve limitations of very high dimensional vector space model implementations for modelling term co-occurance in language processing such as SVD as used by LSA/LSI (Latent Semantic Analysis/Latent Semantic Indexing). In implementation it bears some similarity to other random projection techniques such as those implemented in RandomProjection and SignRandomProjection. The RandomIndexing type can also be used to perform Reflective Random Indexing which extends the Random Indexing model with additional training cycles to support indirect inferrences i.e. find synonyms where the words do not appear together in documents.
func NewRandomIndexing ¶
func NewRandomIndexing(k int, density float64) *RandomIndexing
NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space. The density parameter specifies the density of the elemental vectors used to project the input matrix into lower dimensional space i.e. the proportion of elements that are non-zero. As RandomIndexing makes use of sparse matrix formats, specifying lower values for density will result in lower memory usage.
func NewReflectiveRandomIndexing ¶
func NewReflectiveRandomIndexing(k int, t RIBasis, reflections int, density float64) *RandomIndexing
NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing. Reflective Random Indexing applies additional (reflective) training cycles ontop of Random Indexing to capture indirect inferences (synonyms). i.e. similarity between terms that do not directly co-occur within the same context/document. t specifies the basis for the reflective random indexing i.e. whether the initial, random elemental vectors should represent columns or rows. reflections is the number of training cycles to apply. If t == RowBasedRI and reflections == 0 then the created type will perform conventional Random Indexing. NewReflectiveRandomIndexing will panic if t == ColBasedRI and reflections < 1 because column based Reflective Random Indexing requires at least one reflective training cycle to generate the row based elemental vectors required for RI/RRI.
func (*RandomIndexing) Fit ¶
func (r *RandomIndexing) Fit(m mat.Matrix) Transformer
Fit trains the model, creating random elemental vectors that will later be used to construct the new projected feature vectors in the reduced semantic dimensional space. If configured for Reflective Random Indexing then Fit may actually run multiple training cycles as specified during construction.
func (*RandomIndexing) FitTransform ¶
FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSC format matrix of shape k x c.
type RandomProjection ¶
RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
The technique projects the original matrix orthogonally onto a random subspace, transforming the elements of the original matrix into a lower dimensional representation. Computing orthogonal matrices is expensive and so this technique uses specially generated random matrices (hence the name) following the principle that in high dimensional spaces, there are lots of nearly orthogonal matrices.
func NewRandomProjection ¶
func NewRandomProjection(k int, density float64) *RandomProjection
NewRandomProjection creates and returns a new RandomProjection transformer. The RandomProjection will use a specially generated random matrix of the specified density and dimensionality k to perform the transform to k dimensional space.
func (*RandomProjection) Fit ¶
func (r *RandomProjection) Fit(m mat.Matrix) Transformer
Fit creates the random (almost) orthogonal matrix used to project input matrices into the new reduced dimensional subspace.
func (*RandomProjection) FitTransform ¶
FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.
type RegExpTokeniser ¶
RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal
func (*RegExpTokeniser) ForEachIn ¶
func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))
ForEachIn iterates over each token within text and invokes function f with the token as parameter. If StopWords is not nil then any tokens from text present in StopWords will be ignored.
func (*RegExpTokeniser) Tokenise ¶
func (t *RegExpTokeniser) Tokenise(text string) []string
Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.
type SignRandomProjection ¶
type SignRandomProjection struct { // Bits represents the number of bits the output vectors should // be in length and hence the number of random projections needed // for the transformation Bits int // contains filtered or unexported fields }
SignRandomProjection represents a transform of a matrix into a lower dimensional space. Sign Random Projection is a method of Locality Sensitive Hashing (LSH) sometimes referred to as the random hyperplane method. A set of random hyperplanes are projected into dimensional space and then input matrices are expressed relative to the random projections as follows:
For each column vector in the input matrix, construct a corresponding output bit vector with each bit (i) calculated as follows: if dot(vector, projection[i]) > 0 bit[i] = 1 else bit[i] = 0
Similar to other methods of random projection this method is unique in that it uses a single bit in the output matrix to represent the sign of result of the comparison (Dot product) with each projection and so is very space and computationally efficient. Hamming similarity (and distance) between the transformed vectors in this new space can approximate Angular similarity (and distance) (which is strongly related to Cosine similarity) of the associated vectors from the original space with significant reductions in both memory usage and processing time.
func NewSignRandomProjection ¶
func NewSignRandomProjection(bits int) *SignRandomProjection
NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality. The transformer uses a number of random hyperplanes represented by `bits` and is the dimensionality of the output, transformed matrices.
func (*SignRandomProjection) Fit ¶
func (s *SignRandomProjection) Fit(m mat.Matrix) Transformer
Fit creates the random hyperplanes from the input training data matrix, mat and stores the hyperplanes as a transform to apply to matrices.
func (*SignRandomProjection) FitTransform ¶
FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.
func (*SignRandomProjection) Transform ¶
Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The columns in the resulting output matrix will be a low dimensional binary representation of the columns within the original i.e. a hash or fingerprint that can be quickly and efficiently compared with other similar vectors. Hamming similarity in the new dimensional space can be used to approximate Cosine similarity between the vectors of the original space. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.
type SimHash ¶
type SimHash struct {
// contains filtered or unexported fields
}
SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm using sign random projections (Moses S. Charikar, https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf) The distance between the original vectors is preserved through the hashing process such that hashed vectors can be compared using Hamming Similarity for a faster, more space efficient, approximation of Cosine Similarity for the original vectors.
func NewSimHash ¶
NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits. This method creates a series of random hyperplanes which are then compared to each input vector to produce the output hashed binary vector encoding the input vector's location in vector space relative to the hyperplanes. Each bit in the output vector corresponds to the sign (1/0 for +/-) of the result of the dot product comparison with each random hyperplane.
type TfidfTransformer ¶
type TfidfTransformer struct {
// contains filtered or unexported fields
}
TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus. For example a very commonly occurring word like `the` is likely to occur in all documents and so would be weighted down. More precisely, TfidfTransformer applies a tf-idf algorithm to the matrix where each term frequency is multiplied by the inverse document frequency. Inverse document frequency is calculated as log(n/df) where df is the number of documents in which the term occurs and n is the total number of documents within the corpus. We add 1 to both n and df before division to prevent division by zero.
func NewTfidfTransformer ¶
func NewTfidfTransformer() *TfidfTransformer
NewTfidfTransformer constructs a new TfidfTransformer.
func (*TfidfTransformer) Fit ¶
func (t *TfidfTransformer) Fit(mat mat.Matrix) Transformer
Fit takes a training term document matrix, counts term occurrences across all documents and constructs an inverse document frequency transform to apply to matrices in subsequent calls to Transform().
func (*TfidfTransformer) FitTransform ¶
FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.
func (*TfidfTransformer) GetTransform ¶
func (t *TfidfTransformer) GetTransform() *sparse.DIA
func (*TfidfTransformer) Load ¶
func (t *TfidfTransformer) Load(r io.Reader) error
Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.
func (TfidfTransformer) Save ¶
func (t TfidfTransformer) Save(w io.Writer) error
Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.
func (*TfidfTransformer) Transform ¶
Transform applies the inverse document frequency (IDF) transform by multiplying each term frequency by its corresponding IDF value. This has the effect of weighting each term frequency according to how often it appears across the whole document corpus so that naturally frequent occurring words are given less weight than uncommon ones. The returned matrix is a sparse matrix type.
type Tokeniser ¶
type Tokeniser interface { // ForEachIn iterates over each token within text and invokes function // f with the token as parameter ForEachIn(text string, f func(token string)) // Tokenise returns a slice of all the tokens contained in string // text Tokenise(text string) []string }
Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g. Regexp and also supporting different different token types n-grams and languages.
func NewTokeniser ¶
NewTokeniser returns a new, default Tokeniser implementation. If removeStopwords is true then stop words will be removed from tokens
func VsNewTokeniser ¶
type Transformer ¶
type Transformer interface { Fit(mat.Matrix) Transformer Transform(mat mat.Matrix) (mat.Matrix, error) FitTransform(mat mat.Matrix) (mat.Matrix, error) }
Transformer provides a common interface for transformer steps.
type TruncatedSVD ¶
type TruncatedSVD struct { // Components is the truncated term matrix (matrix U of the Singular Value Decomposition // (A=USV^T)). The matrix will be of size m, k where m = the number of unique terms // in the training data and k = the number of elements to truncate to (specified by // attribute K) or m or n (the number of documents in the training data) whichever of // the 3 values is smaller. Components *mat.Dense // K is the number of dimensions to which the output, transformed, matrix should be // truncated to. The matrix output by the FitTransform() and Transform() methods will // be n rows by min(m, n, K) columns, where n is the number of columns in the original, // input matrix and min(m, n, K) is the lowest value of m, n, K where m is the number of // rows in the original, input matrix. K int }
TruncatedSVD implements the Singular Value Decomposition factorisation of matrices. This produces an approximation of the input matrix at a lower rank. This is a core component of LSA (Latent Semantic Analsis)
func NewTruncatedSVD ¶
func NewTruncatedSVD(k int) *TruncatedSVD
NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k
func (*TruncatedSVD) Fit ¶
func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer
Fit performs the SVD factorisation on the input training data matrix, mat and stores the output term matrix as a transform to apply to matrices in the Transform matrix.
func (*TruncatedSVD) FitTransform ¶
FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a dense matrix type.
func (*TruncatedSVD) Load ¶
func (t *TruncatedSVD) Load(r io.Reader) error
Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.
func (TruncatedSVD) Save ¶
func (t TruncatedSVD) Save(w io.Writer) error
Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.
type Vectoriser ¶
type Vectoriser interface { Fit(...string) Vectoriser Transform(...string) (mat.Matrix, error) FitTransform(...string) (mat.Matrix, error) }
Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.
type VsTokeniser ¶
func (*VsTokeniser) ForEachIn ¶
func (t *VsTokeniser) ForEachIn(text string, f func(token string))
ForEachIn iterates over each token within text and invokes function f with the token as parameter
func (*VsTokeniser) Tokenise ¶
func (t *VsTokeniser) Tokenise(text string) []string
Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.