bayes

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 14, 2018 License: MIT Imports: 5 Imported by: 5

README

bayes Build Status Doc Status

A simple implementation of Naive Bayes classifier. More details are in docs.

Development

Testing

Install ginkgo, a BDD testing framefork for Go.

go get github.com/onsi/ginkgo/ginkgo
go get github.com/onsi/gomega

To run tests go to root directory of the project and run

ginkgo

#or

go test

Other implementations:

Go, Java, Python, R, Ruby

Documentation

Overview

Package bayes implements Naive Bayes trainer and classifier. Code is located at https://github.com/gnames/bayes

Naive Bayes rule calculates a probability of a hypothesis from a prior knowledge about the hypothesis, as well as the evidence that supports or diminishes the probability of the hypothesis. Prior knowledge can dramatically influence the posterior probability of a hypothesis. For example assuming that an adult bird that cannot fly is a penguin is very unlikely in the northern hemisphere, but is very likely in Antarctica. Bayes' theorem is often depicted as

P(H|E) = P(H) * P(E|H) / P(E)

where H is our hypothesis, E is a new evidence, P(H) is a prior probability of H to be true, P(E|H) is a known probability for the evidence when H is true, P(E) is a known probability of E in all known cases. P(H|E) is a posterior probability of a hypothesis H adjusted accordingly to the new evidence E.

Finding a probability that a hypothesis is true can be considered a classification event. Given prior knowledge and a new evidence we are able to classify an entity to a hypothesis that has the highest posterior probability.

Using odds instead of probabilities

It is possible to represent Bayes theorem using odds. Odds describe how likely a hypothesis is in comparison to all other possible hypotheses.

odds = P(H) / (1 - P(H))

P(H) = odds / (1 + odds)

Using odds allows us to simplify Bayes calculations

oddsPosterior = oddsPrior * likelihood

where likelihood is

likelihood = P(E|H)/P(E|H')

P(E|H') in this case is a known probability of an evidence when H is not true. In case if we have several evidences that are independent from each other, posterior odds can be calculated as a product of prior odds and all likelihoods of all given evidences.

oddsPosterior = oddsPrior * likelihood1 * likelihood2 * likelihood3 ...

Each subsequent evidence modifies prior odds. If evidences are not independent (for example inability to fly and a propensity to nesting on the ground for birds) they skew the outcome. In reality given evidences are quite often not completely independent. Because of that Naive Bayes got its name. People who apply it "naively" state that their evidences are completely independent from each other. In practice Naive Bayes approach often shows good results in spite of this known fallacy.

Training and prior odds

It is quite possible that while likelihoods of evidences are representative for classification data the prior odds from the training are not. As in the previous example an evidence that a bird cannot fly supports a 'penguin' hypothesis much better in Antarctica because odds to meet a penguin there are much higher than in the northern hemisphere. Therefore we give an ability to supply prior probability value at a classification event.

Terminology

In natural language processing `evidences` are often called `features`. We follow the same convention in this package.

Hypotheses are often called `classes` or `labels`. Based on the outcome we classify an entity (assign a label to the entity in other words). In this package we use the term `label` for hypotheses. Every label receives a number of elements or `tokens`, each with a set of features.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IgnorePriorOdds

func IgnorePriorOdds(nb *NaiveBayes) error

IgnorePriorOdds might be needed if it is a multistep Bayes calculation and PriorOdds already are accounted for.

func Odds

func Odds(l Labeler, lf LabelFreq) (float64, error)

Odds returns odds for a label in a given label frequency distribution

func RegisterLabel

func RegisterLabel(m map[string]Labeler)

RegisterLabel takes a map from a string to Labeler interface. This map is required to unmarshal JSON data from a string to a label using LabelFactory.

func WithPriorOdds

func WithPriorOdds(lf LabelFreq) func(*NaiveBayes) error

WithPriorOdds allows dynamical change of prior odds used in calculations. Sometimes prior odds during classification event are very different from ones aquired during training. If for example 'real' prior odds are 100 times larger it means the calculated posterior odds will be 100 times smaller than what they would suppose to be.

Types

type FeatureFreq

type FeatureFreq map[FeatureName]map[FeatureValue]map[Labeler]float64

FeatureFreq is a map for collecting frequencies of a training feature set. FeatureFreq is used for calculating Likelihoods of a NaiveBayes classifier.

type FeatureName

type FeatureName string

FeatureName is a name of a Feature

type FeatureTotal

type FeatureTotal map[FeatureName]map[FeatureValue]float64

FeatureTotal is used for calculating multinomial likelihoods. For example if we are interested in calculating likelihood of feature `f` its likelihood would be

L = P(f|H)/P(f/H')

where `H` is main "hypothesis" or "label" and `H'` is a combination of all other hypotheses.

type FeatureValue

type FeatureValue string

FeatureName is a value of a Feature

type Featurer

type Featurer interface {
	// Name defines an id of a feature
	Name() FeatureName
	// Value defines the value of a feature. The value set can be simple
	// 'true|false' or more complex 'red|blue|grey|yellow'
	Value() FeatureValue
}

Feature is an interface of an "evidence" we use for training a NaiveBayes classifier, or for the classification of an unknown entity.

type LabelFreq

type LabelFreq map[Labeler]float64

LabelFreq is a collection of counts for every Label in the training dataset. this information allows to calculate prior odds for a Label.

type LabeledFeatures

type LabeledFeatures struct {
	Features []Featurer
	Label    Labeler
}

LabeledFeatures are data used for supervised training of NaiveBayes algorithm.

type Labeler

type Labeler interface {
	fmt.Stringer
}

Labeler is an interface to represent a "hypothesis" or "class" the NaiveBayes is aware about.

func LabelFactory

func LabelFactory(s string) (Labeler, error)

LabelFactory takes a string and returns a Label. This function is mostly used for unmarshalling data from JSON to NaiveBayes object.

type Likelihoods

type Likelihoods map[Labeler]map[FeatureName]map[FeatureValue]float64

Likelihoods provide a likelihood of a feature to appear for a particular label.

type NaiveBayes

type NaiveBayes struct {
	// Labels is a list of "hypotheses", "classes", "categories", "labels".
	// It contains all labels created by training.
	Labels []Labeler `json:"-"`
	// FeatureFreq keeps count of all the features for the labels.
	FeatureFreq `json:"-"`
	// LabelFreq keeps counts of the tokens belonging to each label
	LabelFreq `json:"-"`
	// FeatureTotal keeps total count of tokens for each feature.
	FeatureTotal `json:"feature_total"`
	// Total is a total number of tokens used for training.
	Total float64 `json:"total"`
	// IgnorePriorOdds is set true if prior odds do not need to be used in the
	// returning result.
	IgnorePriorOdds bool

	Output io.Writer `json:"-"`
	// contains filtered or unexported fields
}

NaiveBayes is a classifier for assigning an entity represented by its features to a label.

func NewNaiveBayes

func NewNaiveBayes() *NaiveBayes

NewNaiveBayes is a constructor for NaiveBayes object. It initializes several important defaults, and sets options that modify behavior of the NaiveBayes object. Currently constructor supports the following options:

func TrainNB

func TrainNB(lfs []LabeledFeatures, opts ...OptionNB) *NaiveBayes

TrainNB takes data from a training dataset and returns back a trained classifier.

func (*NaiveBayes) Dump

func (nb *NaiveBayes) Dump() []byte

Dump serializes a NaiveBayes object into a JSON format.

func (*NaiveBayes) MarshalJSON

func (nb *NaiveBayes) MarshalJSON() ([]byte, error)

MarshalJSON serializes a NaiveBayes object to JSON.

func (*NaiveBayes) Predict

func (nb *NaiveBayes) Predict(fs []Featurer,
	opts ...OptionNB) (Posterior, error)

Predict is a general function that runs NaiveBayes classifier against trained set. It can take a different PriorOdds value to influence calculation of the Posterior Odds.

func (*NaiveBayes) Restore

func (nb *NaiveBayes) Restore(dump []byte)

Restore deserializes a JSON text into NaiveBayes object. The function needs to know how to convert a string that represents a label to an object. Use RegisterLabel function to inject a string-to-Label conversion map.

func (*NaiveBayes) TrainingPrior

func (nb *NaiveBayes) TrainingPrior(l Labeler) (float64, error)

TrainingPrior returns prior odds calculated from the training set

func (*NaiveBayes) UnmarshalJSON

func (nb *NaiveBayes) UnmarshalJSON(data []byte) (err error)

UnmarshalJSON deserializes JSON data to a NaiveBayes object.

type OptionNB

type OptionNB func(*NaiveBayes) error

OptionNB is a type for options supplied to NaiveBayes classifier. It can support either flags or parameterized options.

type Posterior

type Posterior struct {
	LabelOdds map[Labeler]float64
	MaxLabel  Labeler
	MaxOdds   float64
	LabelFreq
	Likelihoods
}

Posterior contains outcomes from NativeBayes classifier

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL