text

package
v0.0.0-...-00e0c84 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 15, 2022 License: MIT Imports: 13 Imported by: 28

README

Text Classification

import "github.com/cdipaolo/goml/text"

GoDoc

This package implements text classification algorithms. For algorithms that could be used numberically (most/all of them,) this package makes working with text documents easier than hand-rolling a bag-of-words model and integrating it with other models

implemented models

  • multiclass naive bayes
  • term frequency - inverse document frequency
    • this model lets you easily calculate keywords from documents, as well as general importance scores for any word (with it's document) that you can throw at it!
    • because this is so similar to Bayes under the hood, you train TFIDF by casting a trained Bayes model to it such as tfidf := TFIDF(*myNaiveBayesModel)

example online naive bayes sentiment analysis

This is the general text classification example from the GoDoc package comment. Look there and at the tests for more detailed and varied examples of usage:

// create the channel of data and errors
stream := make(chan base.TextDatapoint, 100)
errors := make(chan error)

// make a new NaiveBayes model with
// 2 classes expected (classes in
// datapoints will now expect {0,1}.
// in general, given n as the classes
// variable, the model will expect
// datapoint classes in {0,...,n-1})
//
// Note that the model is filtering
// the text to omit anything except
// words and numbers (and spaces
// obviously)
model := NewNaiveBayes(stream, 2, base.OnlyWordsAndNumbers)

go model.OnlineLearn(errors)

stream <- base.TextDatapoint{
	X: "I love the city",
	Y: 1,
}

stream <- base.TextDatapoint{
	X: "I hate Los Angeles",
	Y: 0,
}

stream <- base.TextDatapoint{
	X: "My mother is not a nice lady",
	Y: 0,
}

close(stream)

for {
	err, more := <- errors
	if err != nil {
		fmt.Printf("Error passed: %v", err)
	} else {
		// training is done!
		break
	}
}

// now you can predict like normal
class := model.Predict("My mother is in Los Angeles") // 0

Documentation

Overview

Package text holds models which make text classification easy. They are regular models, but take strings as arguments so you can feed in documents rather than large, hand-constructed word vectors. Although models might represent the words as these vectors, the munging of a document is hidden from the user.

The simplest model, although suprisingly effective, is Naive Bayes. If you want to read more about the specific model, check out the docs for the NaiveBayes struct/model.

The following example is an online Naive Bayes model used for sentiment analysis.

Example Online Naive Bayes Text Classifier (multiclass):

// create the channel of data and errors
stream := make(chan base.TextDatapoint, 100)
errors := make(chan error)

// make a new NaiveBayes model with
// 2 classes expected (classes in
// datapoints will now expect {0,1}.
// in general, given n as the classes
// variable, the model will expect
// datapoint classes in {0,...,n-1})
//
// Note that the model is filtering
// the text to omit anything except
// words and numbers (and spaces
// obviously)
model := NewNaiveBayes(stream, 2, base.OnlyWordsAndNumbers)

go model.OnlineLearn(errors)

stream <- base.TextDatapoint{
	X: "I love the city",
	Y: 1,
}

stream <- base.TextDatapoint{
	X: "I hate Los Angeles",
	Y: 0,
}

stream <- base.TextDatapoint{
	X: "My mother is not a nice lady",
	Y: 0,
}

close(stream)

for {
	err, more := <- errors
	if err != nil {
		fmt.Fprintf(b.Output, "Error passed: %v", err)
	} else {
		// training is done!
		break
	}
}

// now you can predict like normal
class := model.Predict("My mother is in Los Angeles") // 0

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Frequencies

type Frequencies []Frequency

Frequencies is an array of word frequencies (stored as separate type to be able to sort)

func TermFrequencies

func TermFrequencies(document []string) Frequencies

TermFrequencies gives the TermFrequency of all words in a document, and is more efficient at doing so than calling that function multiple times

func (Frequencies) Len

func (f Frequencies) Len() int

Len gives the length of a frequency array

func (Frequencies) Less

func (f Frequencies) Less(i, j int) bool

Less gives whether the ith element of a frequency list has is lesser than the jth element by comparing their TFIDF values

func (Frequencies) Swap

func (f Frequencies) Swap(i, j int)

Swap swaps two indexed values in a frequency slice

type Frequency

type Frequency struct {
	Word      string  `json:"word"`
	Frequency float64 `json:"frequency,omitempty"`
	TFIDF     float64 `json:"tfidf_score,omitempty"`
}

Frequency holds word frequency information so you don't have to hold a map[string]float64 and can, then, sort

type NaiveBayes

type NaiveBayes struct {
	// Words holds a map of words
	// to their corresponding Word
	// structure
	Words concurrentMap `json:"words"`

	// Count holds the number of times
	// class i was seen as Count[i]
	Count []uint64 `json:"count"`

	// Probabilities holds the probability
	// that class Y is class i as
	// Probabilities[i] for
	Probabilities []float64 `json:"probabilities"`

	// DocumentCount holds the number of
	// documents that have been seen
	DocumentCount uint64 `json:"document_count"`

	// DictCount holds the size of the
	// NaiveBayes model's vocabulary
	DictCount uint64 `json:"vocabulary_size"`

	// tokenizer is used by a model
	// to split the input into tokens
	Tokenizer Tokenizer `json:"tokenizer"`

	// Output is the io.Writer used for logging
	// and printing. Defaults to os.Stdout.
	Output io.Writer `json:"-"`
	// contains filtered or unexported fields
}

NaiveBayes is a general classification model that calculates the probability that a datapoint is part of a class by using Bayes Rule:

P(y|x) = P(x|y)*P(y)/P(x)

The unique part of this model is that it assumes words are unrelated to eachother. For example, the probability of seeing the word 'penis' in spam emails if you've already seen 'viagra' might be different than if you hadn't seen it. The model ignores this fact because the computation of full Bayesian model would take much longer, and would grow significantly with each word you see.

https://en.wikipedia.org/wiki/Naive_Bayes_classifier http://cs229.stanford.edu/notes/cs229-notes2.pdf

Based on Bayes Rule, we can easily calculate the numerator (x | y is just the number of times x is seen and the class=y, and P(y) is just the number of times y=class / the number of positive training examples/words.) The denominator is also easy to calculate, but if you recognize that it's just a constant because it's just the probability of seeing a certain document given the dataset we can make the following transformation to be able to classify without as much classification:

Class(x) = argmax_c{P(y = c) * ∏P(x|y = c)}

And we can use logarithmic transformations to make this calculation more computer-practical (multiplying a bunch of probabilities on [0,1] will always result in a very small number which could easily underflow the float value):

Class(x) = argmax_c{log(P(y = c)) + Σ log(P(x|y = c)0}

Much better. That's our model!

func NewNaiveBayes

func NewNaiveBayes(stream <-chan base.TextDatapoint, classes uint8, sanitize func(rune) bool) *NaiveBayes

NewNaiveBayes returns a NaiveBayes model the given number of classes instantiated, ready to learn off the given data stream. The sanitization function is set to the given function. It must comply with the transform.RemoveFunc interface

func (*NaiveBayes) OnlineLearn

func (b *NaiveBayes) OnlineLearn(errors chan<- error)

OnlineLearn lets the NaiveBayes model learn from the datastream, waiting for new data to come into the stream from a separate goroutine

func (*NaiveBayes) PersistToFile

func (b *NaiveBayes) PersistToFile(path string) error

PersistToFile takes in an absolute filepath and saves the parameter vector θ to the file, which can be restored later. The function will take paths from the current directory, but functions

The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.

func (*NaiveBayes) Predict

func (b *NaiveBayes) Predict(sentence string) uint8

Predict takes in a document, predicts the class of the document based on the training data passed so far, and returns the class estimated for the document.

func (*NaiveBayes) Probability

func (b *NaiveBayes) Probability(sentence string) (uint8, float64)

Probability takes in a small document, returns the estimated class of the document based on the model as well as the probability that the model is part of that class

NOTE: you should only use this for small documents because, as discussed in the docs for the model, the probability will often times underflow because you are multiplying together a bunch of probabilities which range on [0,1]. As such, the returned float could be NaN, and the predicted class could be 0 always.

Basically, use Predict to be robust for larger documents. Use Probability only on relatively small (MAX of maybe a dozen words - basically just sentences and words) documents.

func (*NaiveBayes) Restore

func (b *NaiveBayes) Restore(data []byte) error

Restore takes the bytes of a NaiveBayes model and restores a model to it. It defaults the sanitizer to base.OnlyWordsAndNumbers and the tokenizer to to a SimpleTokenizer that splits on spaces.

This would be useful if training a model and saving it into a project using go-bindata (look it up) so you don't have to persist a large file and deal with paths on a production system. This option is included in text models vs. others because the text models usually have much larger storage requirements.

func (*NaiveBayes) RestoreFromFile

func (b *NaiveBayes) RestoreFromFile(path string) error

RestoreFromFile takes in a path to a parameter vector theta and assigns the model it's operating on's parameter vector to that. The only parameters not in the vector are the sanitization and tokenization functions which default to base.OnlyWordsAndNumbers and SimpleTokenizer{SplitOn: " "}

The path must ba an absolute path or a path from the current directory

This would be useful in persisting data between running a model on data.

func (*NaiveBayes) RestoreWithFuncs

func (b *NaiveBayes) RestoreWithFuncs(data io.Reader, sanitizer func(rune) bool, tokenizer Tokenizer) error

RestoreWithFuncs takes raw JSON data of a model and restores a model from it. The tokenizer and sanitizer passed in will be assigned to the restored model.

func (*NaiveBayes) String

func (b *NaiveBayes) String() string

String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the perceptron hypothesis model.

func (*NaiveBayes) UpdateSanitize

func (b *NaiveBayes) UpdateSanitize(sanitize func(rune) bool)

UpdateSanitize updates the NaiveBayes model's text sanitization transformation function

func (*NaiveBayes) UpdateStream

func (b *NaiveBayes) UpdateStream(stream chan base.TextDatapoint)

UpdateStream updates the NaiveBayes model's text datastream

func (*NaiveBayes) UpdateTokenizer

func (b *NaiveBayes) UpdateTokenizer(tokenizer Tokenizer)

UpdateTokenizer updates NaiveBayes model's tokenizer function. The default implementation will convert the input to lower case and split on the space character.

type SimpleTokenizer

type SimpleTokenizer struct {
	SplitOn string
}

SimpleTokenizer splits sentences into tokens delimited by its SplitOn string – space, for example

func (*SimpleTokenizer) Tokenize

func (t *SimpleTokenizer) Tokenize(sentence string) []string

Tokenize splits input sentences into a lowecase slice of strings. The tokenizer's SlitOn string is used as a delimiter and it

type TFIDF

type TFIDF NaiveBayes

TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)

This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.

Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }

Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )

TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important

func (*TFIDF) InverseDocumentFrequency

func (t *TFIDF) InverseDocumentFrequency(word string) float64

InverseDocumentFrequency returns the 'uniqueness' of a word within the corpus defined within a trained NaiveBayes model.

Look at the TFIDF docs to see more about how this is calculated

func (*TFIDF) MostImportantWords

func (t *TFIDF) MostImportantWords(sentence string, n int) Frequencies

MostImportantWords runs TFIDF on a whole document, returning the n most important words in the document. If n is greater than the number of words then all words will be returned.

The returned keyword slice is sorted by importance

func (*TFIDF) TFIDF

func (t *TFIDF) TFIDF(word string, sentence string) float64

TFIDF returns the TermFrequency- InverseDocumentFrequency of a word within a corpus given by the trained NaiveBayes model

Look at the TFIDF docs to see more about how this is calculated

func (*TFIDF) TermFrequency

func (t *TFIDF) TermFrequency(word string, document []string) float64

TermFrequency returns the term frequency of a word within a corpus defined by the trained NaiveBayes model

Look at the TFIDF docs to see more about how this is calculated

type Tokenizer

type Tokenizer interface {
	Tokenize(string) []string
}

Tokenizer accepts a sentence as input and breaks it down into a slice of tokens

type Word

type Word struct {
	// Count holds the number of times,
	// (i in Count[i] is the given class)
	Count []uint64

	// Seen holds the number of times
	// the world has been seen. This
	// is than same as
	//    foldl (+) 0 Count
	// in Haskell syntax, but is included
	// you wouldn't have to calculate
	// this every time you wanted to
	// recalc the probabilities (foldl
	// is the same as reduce, basically.)
	Seen uint64

	// DocsSeen is the same as Seen but
	// a word is only counted once even
	// if it's in a document multiple times
	DocsSeen uint64 `json:"-"`
}

Word holds the structural information needed to calculate the probability of

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL