Documentation
¶
Overview ¶
Package bayes implements Naive Bayes trainer and classifier. Code is located at https://github.com/gnames/bayes
Naive Bayes rule calculates a probability of a hypothesis from a prior knowledge about the hypothesis, as well as the evidence that supports or diminishes the probability of the hypothesis. Prior knowledge can dramatically influence the posterior probability of a hypothesis. For example assuming that an adult bird that cannot fly is a penguin is very unlikely in the northern hemisphere, but is very likely in Antarctica. Bayes' theorem is often depicted as
P(H|E) = P(H) * P(E|H) / P(E)
where H is our hypothesis, E is a new evidence, P(H) is a prior probability of H to be true, P(E|H) is a known probability for the evidence when H is true, P(E) is a known probability of E in all known cases. P(H|E) is a posterior probability of a hypothesis H adjusted accordingly to the new evidence E.
Finding a probability that a hypothesis is true can be considered a classification event. Given prior knowledge and a new evidence we are able to classify an entity to a hypothesis that has the highest posterior probability.
Using odds instead of probabilities ¶
It is possible to represent Bayes theorem using odds. Odds describe how likely a hypothesis is in comparison to all other possible hypotheses.
odds = P(H) / (1 - P(H)) P(H) = odds / (1 + odds)
Using odds allows us to simplify Bayes calculations
oddsPosterior = oddsPrior * likelihood
where likelihood is
likelihood = P(E|H)/P(E|H')
P(E|H') in this case is a known probability of an evidence when H is not true. In case if we have several evidences that are independent from each other, posterior odds can be calculated as a product of prior odds and all likelihoods of all given evidences.
oddsPosterior = oddsPrior * likelihood1 * likelihood2 * likelihood3 ...
Each subsequent evidence modifies prior odds. If evidences are not independent (for example inability to fly and a propensity to nesting on the ground for birds) they skew the outcome. In reality given evidences are quite often not completely independent. Because of that Naive Bayes got its name. People who apply it "naively" state that their evidences are completely independent from each other. In practice Naive Bayes approach often shows good results in spite of this known fallacy.
Training and prior odds ¶
It is quite possible that while likelihoods of evidences are representative for classification data the prior odds from the training are not. As in the previous example an evidence that a bird cannot fly supports a 'penguin' hypothesis much better in Antarctica because odds to meet a penguin there are much higher than in the northern hemisphere. Therefore we give an ability to supply prior probability value at a classification event.
Terminology ¶
In natural language processing `evidences` are often called `features`. We follow the same convention in this package.
Hypotheses are often called `classes` or `labels`. Based on the outcome we classify an entity (assign a label to the entity in other words). In this package we use the term `label` for hypotheses. Every label receives a number of elements or `tokens`, each with a set of features.
Index ¶
- func IgnorePriorOdds(nb *NaiveBayes) error
- func Odds(l Labeler, lf LabelFreq) (float64, error)
- func RegisterLabel(m map[string]Labeler)
- func WithPriorOdds(lf LabelFreq) func(*NaiveBayes) error
- type FeatureFreq
- type FeatureName
- type FeatureTotal
- type FeatureValue
- type Featurer
- type LabelFreq
- type LabeledFeatures
- type Labeler
- type Likelihoods
- type NaiveBayes
- func (nb *NaiveBayes) Dump() []byte
- func (nb *NaiveBayes) MarshalJSON() ([]byte, error)
- func (nb *NaiveBayes) Predict(fs []Featurer, opts ...OptionNB) (Posterior, error)
- func (nb *NaiveBayes) Restore(dump []byte)
- func (nb *NaiveBayes) TrainingPrior(l Labeler) (float64, error)
- func (nb *NaiveBayes) UnmarshalJSON(data []byte) (err error)
- type OptionNB
- type Posterior
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IgnorePriorOdds ¶
func IgnorePriorOdds(nb *NaiveBayes) error
IgnorePriorOdds might be needed if it is a multistep Bayes calculation and PriorOdds already are accounted for.
func RegisterLabel ¶
RegisterLabel takes a map from a string to Labeler interface. This map is required to unmarshal JSON data from a string to a label using LabelFactory.
func WithPriorOdds ¶
func WithPriorOdds(lf LabelFreq) func(*NaiveBayes) error
WithPriorOdds allows dynamical change of prior odds used in calculations. Sometimes prior odds during classification event are very different from ones aquired during training. If for example 'real' prior odds are 100 times larger it means the calculated posterior odds will be 100 times smaller than what they would suppose to be.
Types ¶
type FeatureFreq ¶
type FeatureFreq map[FeatureName]map[FeatureValue]map[Labeler]float64
FeatureFreq is a map for collecting frequencies of a training feature set. FeatureFreq is used for calculating Likelihoods of a NaiveBayes classifier.
type FeatureTotal ¶
type FeatureTotal map[FeatureName]map[FeatureValue]float64
FeatureTotal is used for calculating multinomial likelihoods. For example if we are interested in calculating likelihood of feature `f` its likelihood would be
L = P(f|H)/P(f/H')
where `H` is main "hypothesis" or "label" and `H'` is a combination of all other hypotheses.
type Featurer ¶
type Featurer interface { // Name defines an id of a feature Name() FeatureName // Value defines the value of a feature. The value set can be simple // 'true|false' or more complex 'red|blue|grey|yellow' Value() FeatureValue }
Feature is an interface of an "evidence" we use for training a NaiveBayes classifier, or for the classification of an unknown entity.
type LabelFreq ¶
LabelFreq is a collection of counts for every Label in the training dataset. this information allows to calculate prior odds for a Label.
type LabeledFeatures ¶
LabeledFeatures are data used for supervised training of NaiveBayes algorithm.
type Labeler ¶
Labeler is an interface to represent a "hypothesis" or "class" the NaiveBayes is aware about.
func LabelFactory ¶
LabelFactory takes a string and returns a Label. This function is mostly used for unmarshalling data from JSON to NaiveBayes object.
type Likelihoods ¶
type Likelihoods map[Labeler]map[FeatureName]map[FeatureValue]float64
Likelihoods provide a likelihood of a feature to appear for a particular label.
type NaiveBayes ¶
type NaiveBayes struct { // Labels is a list of "hypotheses", "classes", "categories", "labels". // It contains all labels created by training. Labels []Labeler `json:"-"` // FeatureFreq keeps count of all the features for the labels. FeatureFreq `json:"-"` // LabelFreq keeps counts of the tokens belonging to each label LabelFreq `json:"-"` // FeatureTotal keeps total count of tokens for each feature. FeatureTotal `json:"feature_total"` // Total is a total number of tokens used for training. Total float64 `json:"total"` // IgnorePriorOdds is set true if prior odds do not need to be used in the // returning result. IgnorePriorOdds bool Output io.Writer `json:"-"` // contains filtered or unexported fields }
NaiveBayes is a classifier for assigning an entity represented by its features to a label.
func NewNaiveBayes ¶
func NewNaiveBayes() *NaiveBayes
NewNaiveBayes is a constructor for NaiveBayes object. It initializes several important defaults, and sets options that modify behavior of the NaiveBayes object. Currently constructor supports the following options:
func TrainNB ¶
func TrainNB(lfs []LabeledFeatures, opts ...OptionNB) *NaiveBayes
TrainNB takes data from a training dataset and returns back a trained classifier.
func (*NaiveBayes) Dump ¶
func (nb *NaiveBayes) Dump() []byte
Dump serializes a NaiveBayes object into a JSON format.
func (*NaiveBayes) MarshalJSON ¶
func (nb *NaiveBayes) MarshalJSON() ([]byte, error)
MarshalJSON serializes a NaiveBayes object to JSON.
func (*NaiveBayes) Predict ¶
func (nb *NaiveBayes) Predict(fs []Featurer, opts ...OptionNB) (Posterior, error)
Predict is a general function that runs NaiveBayes classifier against trained set. It can take a different PriorOdds value to influence calculation of the Posterior Odds.
func (*NaiveBayes) Restore ¶
func (nb *NaiveBayes) Restore(dump []byte)
Restore deserializes a JSON text into NaiveBayes object. The function needs to know how to convert a string that represents a label to an object. Use RegisterLabel function to inject a string-to-Label conversion map.
func (*NaiveBayes) TrainingPrior ¶
func (nb *NaiveBayes) TrainingPrior(l Labeler) (float64, error)
TrainingPrior returns prior odds calculated from the training set
func (*NaiveBayes) UnmarshalJSON ¶
func (nb *NaiveBayes) UnmarshalJSON(data []byte) (err error)
UnmarshalJSON deserializes JSON data to a NaiveBayes object.
type OptionNB ¶
type OptionNB func(*NaiveBayes) error
OptionNB is a type for options supplied to NaiveBayes classifier. It can support either flags or parameterized options.