nlp

package module

v2.0.0-...-2c226c0 Latest Latest Go to latest Published: Sep 14, 2018 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/vench/nlp

README ¶

Natural Language Processing

An implementation of selected machine learning algorithms for basic natural language processing in golang. The initial focus for this project is Latent Semantic Analysis to allow retrieval/searching, clustering and classification of text documents based upon semantic content.

Built upon the Gonum library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn.

Check out the companion blog post or the go documentation page for full usage and examples.

Features

Sparse matrix implementations for more effective memory usage
Convert plain text strings into numerical feature vectors for analysis
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
TF-IDF weighting to account for frequently occuring words
LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
PCA (Principal Component Analysis)
SimHash implementation of LSH (Locality Sensitive Hashing) using sign random projection to support approximate cosine similarity using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) with semantic vector space models.
Cosine, Angular and Hamming similarity/distance measures to calculate the similarity/distance between feature vectors.
Persistence for trained models (persistence for Vectorisers coming soon)

Planned

Ability to persist trained vectorisers
LDA (Latent Dirichlet Allocation) implementation for topic extraction
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, random forest, etc.

References

Documentation ¶

Overview ¶

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The initial primary focus being on the implementation of algorithms supporting LSA (Latent Semantic Analysis), often referred to as Latent Semantic Indexing in the context of information retrieval.

Overview ¶

The algorithms in the package typically support document input as text strings which are then encoded as a matrix of numerical feature vectors called a `term document matrix`. Columns in this matrix represent the documents in the corpus and the rows represent terms occurring in the documents. The individual elements within the matrix contains counts of the number of occurrences of each term in the associated document.

This matrix can be manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions.

A common transformation is for the purpose of weighting features to remove natural biases which would skew results e.g. commonly occurring words like `the`, `of`, `and`, etc. which should carry lower weight than unusual words.

Term Document matrices typically have a very large number of dimensions and so transformations are often applied to reduce the dimensionality using techniques such as Locality Sensitive Hashing or Latent Semantic Analysis (typically performed using matrix SVD - `Singular Value Decomposition`) which approximates the original term document matrix with a new matrix of much lower rank (typically around 100 rather than 1000s). Truncated SVD is a fundamental part of LSA (Latent Semantic Analysis aka Latent Semantic Indexing) and serves a number of purposes:

1. The reduced dimensionality of the data theoretically requires less memory.

2. As less significant dimensions are removed, there is less `noise` in the data which could have artificially skewed results.

3. Perhaps most importantly, the SVD effectively encodes the co-occurrence of terms within the documents to capture semantic meaning rather than simply the presence (or lack of presence) of words. This combats the problem of synonymy (a common challenge in NLP) where different words in the English language can be used to mean the same thing (synonyms). In LSA, documents can have a high degree of semantic similarity with very few words in common.

The post SVD matrix (with each column being a feature vector representing a document within the corpus) can be compared for similarity with each other (for clustering) or with a query (also represented as a feature vector projected into the same dimensional space). Similarity is measured by the angle between the two feature vectors being considered.

Example ¶

testCorpus := []string{
	"The quick brown fox jumped over the lazy dog",
	"hey diddle diddle, the cat and the fiddle",
	"the cow jumped over the moon",
	"the little dog laughed to see such fun",
	"and the dish ran away with the spoon",
}

query := "the brown fox ran around the dog"

vectoriser := NewCountVectoriser(true)
transformer := NewTfidfTransformer()

// set k (the number of dimensions following truncation) to 4
reducer := NewTruncatedSVD(4)

lsiPipeline := NewPipeline(vectoriser, transformer, reducer)

// Transform the corpus into an LSI fitting the model to the documents in the process
lsi, err := lsiPipeline.FitTransform(testCorpus...)
if err != nil {
	fmt.Printf("Failed to process documents because %v", err)
	return
}

// run the query through the same pipeline that was fitted to the corpus and
// to project it into the same dimensional space
queryVector, err := lsiPipeline.Transform(query)
if err != nil {
	fmt.Printf("Failed to process documents because %v", err)
	return
}

// iterate over document feature vectors (columns) in the LSI and compare with the
// query vector for similarity.  Similarity is determined by the difference between
// the angles of the vectors known as the cosine similarity
highestSimilarity := -1.0
var matched int
_, docs := lsi.Dims()
for i := 0; i < docs; i++ {
	similarity := pairwise.CosineSimilarity(queryVector.(mat.ColViewer).ColView(0), lsi.(mat.ColViewer).ColView(i))
	if similarity > highestSimilarity {
		matched = i
		highestSimilarity = similarity
	}
}

fmt.Printf("Matched '%s'", testCorpus[matched])

Output:

Matched 'The quick brown fox jumped over the lazy dog'

Index ¶

func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))
func CreateRandomProjectionTransform(newDims, origDims int, density float64) mat.Matrix
func MorfModeWord(word string) string
type CountVectoriser
- func NewCountVectoriser(removeStopwords bool) *CountVectoriser
- func VsNewCountVectoriser(removeStopwords bool) *CountVectoriser
- func (v *CountVectoriser) Fit(train ...string) Vectoriser
- func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
- func (c *CountVectoriser) Load(r io.Reader) error
- func (c *CountVectoriser) Save(w io.Writer) error
- func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)
type HashingVectoriser
- func NewHashingVectoriser(removeStopwords bool, numFeatures int) *HashingVectoriser
- func (v *HashingVectoriser) Fit(train ...string) Vectoriser
- func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)
- func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)
type PCA
- func NewPCA(k int) *PCA
- func (p *PCA) ExplainedVariance() []float64
- func (p *PCA) Fit(m mat.Matrix) Transformer
- func (p *PCA) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (p *PCA) Transform(m mat.Matrix) (mat.Matrix, error)
type Pipeline
- func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline
- func (p *Pipeline) Fit(docs ...string) Vectoriser
- func (p *Pipeline) FitTransform(docs ...string) (mat.Matrix, error)
- func (p *Pipeline) Transform(docs ...string) (mat.Matrix, error)
type RIBasis
type RandomIndexing
- func NewRandomIndexing(k int, density float64) *RandomIndexing
- func NewReflectiveRandomIndexing(k int, t RIBasis, reflections int, density float64) *RandomIndexing
- func (r *RandomIndexing) Fit(m mat.Matrix) Transformer
- func (r *RandomIndexing) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (r *RandomIndexing) Transform(matrix mat.Matrix) (mat.Matrix, error)
type RandomProjection
- func NewRandomProjection(k int, density float64) *RandomProjection
- func (r *RandomProjection) Fit(m mat.Matrix) Transformer
- func (r *RandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (r *RandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)
type RegExpTokeniser
- func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))
- func (t *RegExpTokeniser) Tokenise(text string) []string
type SignRandomProjection
- func NewSignRandomProjection(bits int) *SignRandomProjection
- func (s *SignRandomProjection) Fit(m mat.Matrix) Transformer
- func (s *SignRandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (s *SignRandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)
type SimHash
- func NewSimHash(bits int, dim int) *SimHash
- func (h *SimHash) Hash(v mat.Vector) *sparse.BinaryVec
type TfidfTransformer
- func NewTfidfTransformer() *TfidfTransformer
- func (t *TfidfTransformer) Fit(mat mat.Matrix) Transformer
- func (t *TfidfTransformer) FitTransform(mat mat.Matrix) (mat.Matrix, error)
- func (t *TfidfTransformer) GetTransform() *sparse.DIA
- func (t *TfidfTransformer) Load(r io.Reader) error
- func (t TfidfTransformer) Save(w io.Writer) error
- func (t *TfidfTransformer) Transform(mat mat.Matrix) (mat.Matrix, error)
type Tokeniser
- func NewTokeniser(removeStopwords bool) Tokeniser
- func VsNewTokeniser(removeStopwords bool) Tokeniser
type Transformer
type TruncatedSVD
- func NewTruncatedSVD(k int) *TruncatedSVD
- func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer
- func (t *TruncatedSVD) FitTransform(m mat.Matrix) (mat.Matrix, error)
- func (t *TruncatedSVD) Load(r io.Reader) error
- func (t TruncatedSVD) Save(w io.Writer) error
- func (t *TruncatedSVD) Transform(m mat.Matrix) (mat.Matrix, error)
type Vectoriser
type VsTokeniser
- func (t *VsTokeniser) ForEachIn(text string, f func(token string))
- func (t *VsTokeniser) Tokenise(text string) []string

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ColDo ¶

func ColDo(m mat.Matrix, fn func(j int, vec mat.Vector))

ColDo executes fn for each column j in m

func CreateRandomProjectionTransform ¶

func CreateRandomProjectionTransform(newDims, origDims int, density float64) mat.Matrix

CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims. The matrix will be randomly populated using probability distributions where density is used as the probability that each element will be populated. Populated values will be randomly selected from [-1, 1] scaled according to the density and dimensions of the matrix.

func MorfModeWord ¶

func MorfModeWord(word string) string

Types ¶

type CountVectoriser ¶

type CountVectoriser struct {
	// Vocabulary is a map of words to indices that point to the row number representing
	// that word in the term document matrix output from the Transform() and FitTransform()
	// methods.  The Vocabulary map is populated by the Fit() or FitTransform() methods
	// based upon the words occurring in the datasets supplied to those methods.  Within
	// Transform(), any words found in the test data set that were not present in the
	// training data set supplied to Fit() will not have an entry in the Vocabulary
	// and will be ignored.
	Vocabulary map[string]int

	// Tokeniser is used to tokenise input text into features.
	Tokeniser Tokeniser
}

CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewCountVectoriser ¶

func NewCountVectoriser(removeStopwords bool) *CountVectoriser

NewCountVectoriser creates a new CountVectoriser. If removeStopwords is true then english stop words will be removed.

func VsNewCountVectoriser ¶

func VsNewCountVectoriser(removeStopwords bool) *CountVectoriser

func (*CountVectoriser) Fit ¶

func (v *CountVectoriser) Fit(train ...string) Vectoriser

Fit processes the supplied training data (a variable number of strings representing documents). Each word appearing inside the training data will be added to the Vocabulary

func (*CountVectoriser) FitTransform ¶

func (v *CountVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*CountVectoriser) Load ¶

func (c *CountVectoriser) Load(r io.Reader) error

func (*CountVectoriser) Save ¶

func (c *CountVectoriser) Save(w io.Writer) error

func (*CountVectoriser) Transform ¶

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type HashingVectoriser ¶

type HashingVectoriser struct {
	NumFeatures int
	Tokeniser   Tokeniser
}

HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term. Each element represents the frequency the corresponding term appears in the corresponding document e.g. tf(t, d) = 5 would mean that term t (perhaps the word "dog") appears 5 times in the document d.

func NewHashingVectoriser ¶

func NewHashingVectoriser(removeStopwords bool, numFeatures int) *HashingVectoriser

NewHashingVectoriser creates a new HashingVectoriser. If removeStopwords is true then english stop words will be removed. numFeatures specifies the number of features that should be present in produced vectors. Each word in a document is hashed and the mod of the hash and numFeatures gives the row in the matrix corresponding to that word.

func (*HashingVectoriser) Fit ¶

func (v *HashingVectoriser) Fit(train ...string) Vectoriser

Fit does nothing for a HashingVectoriser. As the HashingVectoriser vectorises features based on their hash, it does require a pre-determined vocabulary to map features to their correct row in the vector. It is effectively stateless and does not require fitting to training data. The method is included for compatibility with other vectorisers.

func (*HashingVectoriser) FitTransform ¶

func (v *HashingVectoriser) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform for a HashingVectoriser is exactly equivalent to calling Transform() with the same matrix. For most vectorisers, Fit() must be called prior to Transform() and so this method is a convenience where separate training data is not used to fit the model. For a HashingVectoriser, fitting is not required and so this method is exactly equivalent to Transform(). As with Fit(), this method is included with the HashingVectoriser for compatibility with other vectorisers. The returned matrix is a sparse matrix type.

func (*HashingVectoriser) Transform ¶

func (v *HashingVectoriser) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a term document matrix where each column is a feature vector representing one of the supplied documents. Each element represents the frequency with which the associated term for that row occurred within that document. The returned matrix is a sparse matrix type.

type PCA ¶

type PCA struct {
	// K is the number of components
	K int
	// contains filtered or unexported fields
}

PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis. See https://en.wikipedia.org/wiki/Principal_component_analysis for further details.

func NewPCA ¶

func NewPCA(k int) *PCA

NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance

func (*PCA) ExplainedVariance ¶

func (p *PCA) ExplainedVariance() []float64

ExplainedVariance returns a slice of float64 values representing the variances of the principal component scores.

func (*PCA) Fit ¶

func (p *PCA) Fit(m mat.Matrix) Transformer

Fit calculates the principal component directions (axis of greatest variance) within the training data which can then be used to project matrices onto those principal components using the Transform() method.

func (*PCA) FitTransform ¶

func (p *PCA) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data.

func (*PCA) Transform ¶

func (p *PCA) Transform(m mat.Matrix) (mat.Matrix, error)

Transform projects the matrix onto the first K principal components calculated during training (the Fit() method). The returned matrix will be of reduced dimensionality compared to the input (K x c compared to r x c of the input).

type Pipeline ¶

type Pipeline struct {
	Vectoriser   Vectoriser
	Transformers []Transformer
}

Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps. For example to compose a classic LSA/LSI pipeline (vectorisation -> TFIDF transformation -> Truncated SVD) one could use a Pipeline as follows:

lsaPipeline := NewPipeline(NewCountVectoriser(false), NewTfidfTransformer(), NewTruncatedSVD(100))

func NewPipeline ¶

func NewPipeline(vectoriser Vectoriser, transformers ...Transformer) *Pipeline

NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers

func (*Pipeline) Fit ¶

func (p *Pipeline) Fit(docs ...string) Vectoriser

Fit fits the model(s) to the supplied training data

func (*Pipeline) FitTransform ¶

func (p *Pipeline) FitTransform(docs ...string) (mat.Matrix, error)

FitTransform transforms the supplied documents into a matrix representation of numerical feature vectors fitting the model to the supplied data in the process.

func (*Pipeline) Transform ¶

func (p *Pipeline) Transform(docs ...string) (mat.Matrix, error)

Transform transforms the supplied documents into a matrix representation of numerical feature vectors using a model(s) previously fitted to supplied training data.

type RIBasis ¶

type RIBasis int

RIBasis represents the initial basis for the elemental vectors used for Random Indexing

const (
	// RowBasedRI indicates rows (terms in a term-document matrix)
	// forming the initial basis for elemental vectors in Random Indexing.
	// This is basis used for Random Indexing of documents, Reflective
	// Random Indexing can use either rows or columns as the initial
	// basis for elemental vectors.
	RowBasedRI RIBasis = iota

	// ColBasedRI represents columns (documents/contexts in a term-document
	// matrix) forming the initial basis for elemental vectors in Random Indexing
	ColBasedRI
)

type RandomIndexing ¶

type RandomIndexing struct {
	// K specifies the number of dimensions for the semantic space
	K int

	// Density specifies the proportion of non-zero elements in the
	// elemental vectors
	Density float64

	// Type specifies the initial basis for the elemental vectors
	// i.e. whether they initially represent the rows or columns
	// For Random Indexing this should be RowBasedRI, for RRI
	// (Reflective Random Indexing) it can be either RowBasedRI or
	// ColBasedRI
	Type RIBasis

	// Reflections specifies the number of reflective training cycles
	// to run during fitting for RRI (Reflective Random Indexing).
	// If Type is ColBasedRI then Reflections must be >= 1
	Reflections int
	// contains filtered or unexported fields
}

RandomIndexing is a method of dimensionality reduction similar to TruncatedSVD and PCA. Random Indexing is designed to solve limitations of very high dimensional vector space model implementations for modelling term co-occurance in language processing such as SVD as used by LSA/LSI (Latent Semantic Analysis/Latent Semantic Indexing). In implementation it bears some similarity to other random projection techniques such as those implemented in RandomProjection and SignRandomProjection. The RandomIndexing type can also be used to perform Reflective Random Indexing which extends the Random Indexing model with additional training cycles to support indirect inferrences i.e. find synonyms where the words do not appear together in documents.

func NewRandomIndexing ¶

func NewRandomIndexing(k int, density float64) *RandomIndexing

NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space. The density parameter specifies the density of the elemental vectors used to project the input matrix into lower dimensional space i.e. the proportion of elements that are non-zero. As RandomIndexing makes use of sparse matrix formats, specifying lower values for density will result in lower memory usage.

func NewReflectiveRandomIndexing ¶

func NewReflectiveRandomIndexing(k int, t RIBasis, reflections int, density float64) *RandomIndexing

NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing. Reflective Random Indexing applies additional (reflective) training cycles ontop of Random Indexing to capture indirect inferences (synonyms). i.e. similarity between terms that do not directly co-occur within the same context/document. t specifies the basis for the reflective random indexing i.e. whether the initial, random elemental vectors should represent columns or rows. reflections is the number of training cycles to apply. If t == RowBasedRI and reflections == 0 then the created type will perform conventional Random Indexing. NewReflectiveRandomIndexing will panic if t == ColBasedRI and reflections < 1 because column based Reflective Random Indexing requires at least one reflective training cycle to generate the row based elemental vectors required for RI/RRI.

func (*RandomIndexing) Fit ¶

func (r *RandomIndexing) Fit(m mat.Matrix) Transformer

Fit trains the model, creating random elemental vectors that will later be used to construct the new projected feature vectors in the reduced semantic dimensional space. If configured for Reflective Random Indexing then Fit may actually run multiple training cycles as specified during construction.

func (*RandomIndexing) FitTransform ¶

func (r *RandomIndexing) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSC format matrix of shape k x c.

func (*RandomIndexing) Transform ¶

func (r *RandomIndexing) Transform(matrix mat.Matrix) (mat.Matrix, error)

Transform applies the transform, projecting matrix into the lower dimensional semantic space. The output matrix will be of shape k x c and will be a sparse CSC format matrix.

type RandomProjection ¶

type RandomProjection struct {
	K       int
	Density float64
	// contains filtered or unexported fields
}

RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

The technique projects the original matrix orthogonally onto a random subspace, transforming the elements of the original matrix into a lower dimensional representation. Computing orthogonal matrices is expensive and so this technique uses specially generated random matrices (hence the name) following the principle that in high dimensional spaces, there are lots of nearly orthogonal matrices.

func NewRandomProjection ¶

func NewRandomProjection(k int, density float64) *RandomProjection

NewRandomProjection creates and returns a new RandomProjection transformer. The RandomProjection will use a specially generated random matrix of the specified density and dimensionality k to perform the transform to k dimensional space.

func (*RandomProjection) Fit ¶

func (r *RandomProjection) Fit(m mat.Matrix) Transformer

Fit creates the random (almost) orthogonal matrix used to project input matrices into the new reduced dimensional subspace.

func (*RandomProjection) FitTransform ¶

func (r *RandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse CSR format matrix of shape k x c.

func (*RandomProjection) Transform ¶

func (r *RandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transformation, projecting the input matrix into the reduced dimensional subspace. The transformed matrix will be a sparse CSR format matrix of shape k x c.

type RegExpTokeniser ¶

type RegExpTokeniser struct {
	RegExp    *regexp.Regexp
	StopWords map[string]bool
}

RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal

func (*RegExpTokeniser) ForEachIn ¶

func (t *RegExpTokeniser) ForEachIn(text string, f func(token string))

ForEachIn iterates over each token within text and invokes function f with the token as parameter. If StopWords is not nil then any tokens from text present in StopWords will be ignored.

func (*RegExpTokeniser) Tokenise ¶

func (t *RegExpTokeniser) Tokenise(text string) []string

Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.

type SignRandomProjection ¶

type SignRandomProjection struct {
	// Bits represents the number of bits the output vectors should
	// be in length and hence the number of random projections needed
	// for the transformation
	Bits int
	// contains filtered or unexported fields
}

SignRandomProjection represents a transform of a matrix into a lower dimensional space. Sign Random Projection is a method of Locality Sensitive Hashing (LSH) sometimes referred to as the random hyperplane method. A set of random hyperplanes are projected into dimensional space and then input matrices are expressed relative to the random projections as follows:

For each column vector in the input matrix, construct a corresponding output
bit vector with each bit (i) calculated as follows:
	if dot(vector, projection[i]) > 0
		bit[i] = 1
	else
		bit[i] = 0

Similar to other methods of random projection this method is unique in that it uses a single bit in the output matrix to represent the sign of result of the comparison (Dot product) with each projection and so is very space and computationally efficient. Hamming similarity (and distance) between the transformed vectors in this new space can approximate Angular similarity (and distance) (which is strongly related to Cosine similarity) of the associated vectors from the original space with significant reductions in both memory usage and processing time.

func NewSignRandomProjection ¶

func NewSignRandomProjection(bits int) *SignRandomProjection

NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality. The transformer uses a number of random hyperplanes represented by `bits` and is the dimensionality of the output, transformed matrices.

func (*SignRandomProjection) Fit ¶

func (s *SignRandomProjection) Fit(m mat.Matrix) Transformer

Fit creates the random hyperplanes from the input training data matrix, mat and stores the hyperplanes as a transform to apply to matrices.

func (*SignRandomProjection) FitTransform ¶

func (s *SignRandomProjection) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

func (*SignRandomProjection) Transform ¶

func (s *SignRandomProjection) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The columns in the resulting output matrix will be a low dimensional binary representation of the columns within the original i.e. a hash or fingerprint that can be quickly and efficiently compared with other similar vectors. Hamming similarity in the new dimensional space can be used to approximate Cosine similarity between the vectors of the original space. The returned matrix is a Binary matrix or BinaryVec type depending upon whether m is Matrix or Vector.

type SimHash ¶

type SimHash struct {
	// contains filtered or unexported fields
}

SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm using sign random projections (Moses S. Charikar, https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf) The distance between the original vectors is preserved through the hashing process such that hashed vectors can be compared using Hamming Similarity for a faster, more space efficient, approximation of Cosine Similarity for the original vectors.

func NewSimHash ¶

func NewSimHash(bits int, dim int) *SimHash

NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits. This method creates a series of random hyperplanes which are then compared to each input vector to produce the output hashed binary vector encoding the input vector's location in vector space relative to the hyperplanes. Each bit in the output vector corresponds to the sign (1/0 for +/-) of the result of the dot product comparison with each random hyperplane.

func (*SimHash) Hash ¶

func (h *SimHash) Hash(v mat.Vector) *sparse.BinaryVec

Hash accepts a Vector and outputs a BinaryVec (which also implements the Gonum Vector interface). This method will panic if the input vector is of a different length than the dim parameter used when constructing the SimHash.

type TfidfTransformer ¶

type TfidfTransformer struct {
	// contains filtered or unexported fields
}

TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus. For example a very commonly occurring word like `the` is likely to occur in all documents and so would be weighted down. More precisely, TfidfTransformer applies a tf-idf algorithm to the matrix where each term frequency is multiplied by the inverse document frequency. Inverse document frequency is calculated as log(n/df) where df is the number of documents in which the term occurs and n is the total number of documents within the corpus. We add 1 to both n and df before division to prevent division by zero.

func NewTfidfTransformer ¶

func NewTfidfTransformer() *TfidfTransformer

NewTfidfTransformer constructs a new TfidfTransformer.

func (*TfidfTransformer) Fit ¶

func (t *TfidfTransformer) Fit(mat mat.Matrix) Transformer

Fit takes a training term document matrix, counts term occurrences across all documents and constructs an inverse document frequency transform to apply to matrices in subsequent calls to Transform().

func (*TfidfTransformer) FitTransform ¶

func (t *TfidfTransformer) FitTransform(mat mat.Matrix) (mat.Matrix, error)

FitTransform is exactly equivalent to calling Fit() followed by Transform() on the same matrix. This is a convenience where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a sparse matrix type.

func (*TfidfTransformer) GetTransform ¶

func (t *TfidfTransformer) GetTransform() *sparse.DIA

func (*TfidfTransformer) Load ¶

func (t *TfidfTransformer) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TfidfTransformer) Save ¶

func (t TfidfTransformer) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TfidfTransformer) Transform ¶

func (t *TfidfTransformer) Transform(mat mat.Matrix) (mat.Matrix, error)

Transform applies the inverse document frequency (IDF) transform by multiplying each term frequency by its corresponding IDF value. This has the effect of weighting each term frequency according to how often it appears across the whole document corpus so that naturally frequent occurring words are given less weight than uncommon ones. The returned matrix is a sparse matrix type.

type Tokeniser ¶

type Tokeniser interface {
	// ForEachIn iterates over each token within text and invokes function
	// f with the token as parameter
	ForEachIn(text string, f func(token string))

	// Tokenise returns a slice of all the tokens contained in string
	// text
	Tokenise(text string) []string
}

Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g. Regexp and also supporting different different token types n-grams and languages.

func NewTokeniser ¶

func NewTokeniser(removeStopwords bool) Tokeniser

NewTokeniser returns a new, default Tokeniser implementation. If removeStopwords is true then stop words will be removed from tokens

func VsNewTokeniser ¶

func VsNewTokeniser(removeStopwords bool) Tokeniser

type Transformer ¶

type Transformer interface {
	Fit(mat.Matrix) Transformer
	Transform(mat mat.Matrix) (mat.Matrix, error)
	FitTransform(mat mat.Matrix) (mat.Matrix, error)
}

Transformer provides a common interface for transformer steps.

type TruncatedSVD ¶

type TruncatedSVD struct {
	// Components is the truncated term matrix (matrix U of the Singular Value Decomposition
	// (A=USV^T)).  The matrix will be of size m, k where m = the number of unique terms
	// in the training data and k = the number of elements to truncate to (specified by
	// attribute K) or m or n (the number of documents in the training data) whichever of
	// the 3 values is smaller.
	Components *mat.Dense

	// K is the number of dimensions to which the output, transformed, matrix should be
	// truncated to.  The matrix output by the FitTransform() and Transform() methods will
	// be n rows by min(m, n, K) columns, where n is the number of columns in the original,
	// input matrix and min(m, n, K) is the lowest value of m, n, K where m is the number of
	// rows in the original, input matrix.
	K int
}

TruncatedSVD implements the Singular Value Decomposition factorisation of matrices. This produces an approximation of the input matrix at a lower rank. This is a core component of LSA (Latent Semantic Analsis)

func NewTruncatedSVD ¶

func NewTruncatedSVD(k int) *TruncatedSVD

NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k

func (*TruncatedSVD) Fit ¶

func (t *TruncatedSVD) Fit(mat mat.Matrix) Transformer

Fit performs the SVD factorisation on the input training data matrix, mat and stores the output term matrix as a transform to apply to matrices in the Transform matrix.

func (*TruncatedSVD) FitTransform ¶

func (t *TruncatedSVD) FitTransform(m mat.Matrix) (mat.Matrix, error)

FitTransform is approximately equivalent to calling Fit() followed by Transform() on the same matrix. This is a useful shortcut where separate training data is not being used to fit the model i.e. the model is fitted on the fly to the test data. The returned matrix is a dense matrix type.

func (*TruncatedSVD) Load ¶

func (t *TruncatedSVD) Load(r io.Reader) error

Load binary deserialises the previously serialised model into the receiver. This is useful for loading a previously trained and saved model from another context (e.g. offline training) for use within another context (e.g. production) for reproducible results. Load should only be performed with trusted data.

func (TruncatedSVD) Save ¶

func (t TruncatedSVD) Save(w io.Writer) error

Save binary serialises the model and writes it into w. This is useful for persisting a trained model to disk so that it may be loaded (using the Load() method)in another context (e.g. production) for reproducible results.

func (*TruncatedSVD) Transform ¶

func (t *TruncatedSVD) Transform(m mat.Matrix) (mat.Matrix, error)

Transform applies the transform decomposed from the training data matrix in Fit() to the input matrix. The resulting output matrix will be the closest approximation to the input matrix at a reduced rank. The returned matrix is a dense matrix type.

type Vectoriser ¶

type Vectoriser interface {
	Fit(...string) Vectoriser
	Transform(...string) (mat.Matrix, error)
	FitTransform(...string) (mat.Matrix, error)
}

Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.

type VsTokeniser ¶

type VsTokeniser struct {
	ModeWord        bool
	RemoveStopwords bool
}

func (*VsTokeniser) ForEachIn ¶

func (t *VsTokeniser) ForEachIn(text string, f func(token string))

ForEachIn iterates over each token within text and invokes function f with the token as parameter

func (*VsTokeniser) Tokenise ¶

func (t *VsTokeniser) Tokenise(text string) []string

Tokenise returns a slice of all the tokens contained in string text. If StopWords is not nil then any tokens from text present in StopWords will be removed from the slice.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
measures
pairwise

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL