bayes

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 8, 2021 License: MIT Imports: 7 Imported by: 5

README

bayes Build Status Doc Status

An implementation of Naive Bayes classifier. More details are in docs.

Usage

This package allows to classify a new entity into one or another category (label) according to features of the entity. The algorithm uses known data to calculate a weight of each feature for each category.

func Example() {
	// there are two jars of cookies, they are our training set.
	// Cookies have be round or star-shaped.
	// There are plain or chocolate chips cookies.
	jar1 := ft.Label("Jar1")
	jar2 := ft.Label("Jar2")

	// Every labeled feature-set provides data for one cookie. It tells
	// what jar has the cookie, what its kind and shape.
	cookie1 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie2 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie3 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie4 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie5 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie6 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie7 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie8 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}

	lfs := []ft.LabeledFeatures{
		cookie1, cookie2, cookie3, cookie4, cookie5, cookie6, cookie7, cookie8,
	}

	nb := bayes.New()
	nb.Train(lfs)
	oddsPrior, err := nb.PriorOdds(jar1)
	if err != nil {
		log.Println(err)
	}

	// If we got a chocolate star-shaped cookie, which jar it came from most
	// likely?
	aCookie := []ft.Feature{
		{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
		{Name: ft.Name("shape"), Value: ft.Val("star")},
	}

	res, err := nb.PosteriorOdds(aCookie)
	if err != nil {
		fmt.Println(err)
	}

	// it is more likely to that a random cookie comes from Jar1, but
	// for chocolate and star-shaped cookie it is more likely to come from
	// Jar2.
	fmt.Printf("Prior odds for Jar1 are %0.2f\n", oddsPrior)
	fmt.Printf("The cookie came from %s, with odds %0.2f\n", res.MaxLabel, res.MaxOdds)
	// Output:
	// Prior odds for Jar1 are 1.67
	// The cookie came from Jar2, with odds 7.50
}

Development

Testing

Install ginkgo, a BDD testing framefork for Go.

go get github.com/onsi/ginkgo/ginkgo
go get github.com/onsi/gomega

To run tests go to root directory of the project and run

ginkgo

#or

go test

Other implementations:

Go, Java, Python, R, Ruby

Documentation

Overview

Package bayes implements Naive Bayes trainer and classifier. Code is located at https://github.com/gnames/bayes

Naive Bayes rule calculates a probability of a hypothesis from a prior knowledge about the hypothesis, as well as the evidence that supports or diminishes the probability of the hypothesis. Prior knowledge can dramatically influence the posterior probability of a hypothesis. For example assuming that an adult bird that cannot fly is a penguin is very unlikely in the northern hemisphere, but is very likely in Antarctica. Bayes' theorem is often depicted as

P(H|E) = P(H) * P(E|H) / P(E)

where H is our hypothesis, E is a new evidence, P(H) is a prior probability of H to be true, P(E|H) is a known probability for the evidence when H is true, P(E) is a known probability of E in all known cases. P(H|E) is a posterior probability of a hypothesis H adjusted accordingly to the new evidence E.

Finding a probability that a hypothesis is true can be considered a classification event. Given prior knowledge and a new evidence we are able to classify an entity to a hypothesis that has the highest posterior probability.

Using odds instead of probabilities

It is possible to represent Bayes theorem using odds. Odds describe how likely a hypothesis is in comparison to all other possible hypotheses.

odds = P(H) / (1 - P(H))

P(H) = odds / (1 + odds)

Using odds allows us to simplify Bayes calculations

oddsPosterior = oddsPrior * likelihood

where likelihood is

likelihood = P(E|H)/P(E|H')

P(E|H') in this case is a known probability of an evidence when H is not true. In case if we have several evidences that are independent from each other, posterior odds can be calculated as a product of prior odds and all likelihoods of all given evidences.

oddsPosterior = oddsPrior * likelihood1 * likelihood2 * likelihood3 ...

Each subsequent evidence modifies prior odds. If evidences are not independent (for example inability to fly and a propensity to nesting on the ground for birds) they skew the outcome. In reality given evidences are quite often not completely independent. Because of that Naive Bayes got its name. People who apply it "naively" state that their evidences are completely independent from each other. In practice Naive Bayes approach often shows good results in spite of this known fallacy.

Training and prior odds

It is quite possible that while likelihoods of evidences are representative for classification data the prior odds from the training are not. As in the previous example an evidence that a bird cannot fly supports a 'penguin' hypothesis much better in Antarctica because odds to meet a penguin there are much higher than in the northern hemisphere. Therefore we give an ability to supply prior probability value at a classification event.

Terminology

In natural language processing `evidences` are often called `features`. We follow the same convention in this package.

Hypotheses are often called `classes` or `labels`. Based on the outcome we classify an entity (assign a label to the entity in other words). In this package we use the term `label` for hypotheses. Every label receives a number of elements or `tokens`, each with a set of features.

Example
package main

import (
	"fmt"
	"log"

	"github.com/gnames/bayes"
	ft "github.com/gnames/bayes/ent/feature"
)

func main() {
	// there are two jars of cookies, they are our training set.
	// Cookies have be round or star-shaped.
	// There are plain or chocolate chips cookies.
	jar1 := ft.Label("Jar1")
	jar2 := ft.Label("Jar2")

	// Every labeled feature-set provides data for one cookie. It tells
	// what jar has the cookie, what its kind and shape.
	cookie1 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie2 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie3 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie4 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie5 := ft.LabeledFeatures{
		Label: jar1,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("plain")},
			{Name: ft.Name("shape"), Value: ft.Val("round")},
		},
	}
	cookie6 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie7 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}
	cookie8 := ft.LabeledFeatures{
		Label: jar2,
		Features: []ft.Feature{
			{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
			{Name: ft.Name("shape"), Value: ft.Val("star")},
		},
	}

	lfs := []ft.LabeledFeatures{
		cookie1, cookie2, cookie3, cookie4, cookie5, cookie6, cookie7, cookie8,
	}

	nb := bayes.New()
	nb.Train(lfs)
	oddsPrior, err := nb.PriorOdds(jar1)
	if err != nil {
		log.Println(err)
	}

	// If we got a chocolate star-shaped cookie, which jar it came from most
	// likely?
	aCookie := []ft.Feature{
		{Name: ft.Name("kind"), Value: ft.Val("chocolate")},
		{Name: ft.Name("shape"), Value: ft.Val("star")},
	}

	res, err := nb.PosteriorOdds(aCookie)
	if err != nil {
		fmt.Println(err)
	}

	// it is more likely to that a random cookie comes from Jar1, but
	// for chocolate and star-shaped cookie it is more likely to come from
	// Jar2.
	fmt.Printf("Prior odds for Jar1 are %0.2f\n", oddsPrior)
	fmt.Printf("The cookie came from %s, with odds %0.2f\n", res.MaxLabel, res.MaxOdds)
}
Output:

Prior odds for Jar1 are 1.67
The cookie came from Jar2, with odds 7.50

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Bayes added in v0.2.0

type Bayes interface {
	Trainer
	Serializer
	Calc
}

Bayes means to use Bayes algorithm for calculation of posterior and prior odds, for training Bayes using manually curated data, allows to save and load Bayes data to/from a long storage.

func New added in v0.2.0

func New() Bayes

New creates a new instance of Bayes object. This object needs to get data from either training or from loading a dump of previous training data.

type Calc added in v0.2.0

type Calc interface {
	// PriorOdds method returns Odds from the training.
	PriorOdds(ft.Label) (float64, error)
	// PosteriorOdds uses set of features to determing which label they belong
	// to with the most probability.
	PosteriorOdds([]ft.Feature, ...Option) (posterior.Odds, error)
}

Calc provides methods for calculating Prior and Posterior Odds from new data, allowing to classify the data according to its features.

type Option added in v0.2.0

type Option func(nb *bayes)

func OptIgnorePriorOdds added in v0.2.0

func OptIgnorePriorOdds(b bool) Option

OptIgnorePriorOdds might be needed if it is a muV PriorOdds already are accounted for.

func OptPriorOdds added in v0.2.0

func OptPriorOdds(lc map[ft.Label]float64) Option

OptPriorOdds allows dynamical change of prior odds used in calculations. Sometimes prior odds during classification event are very different from ones aquired during training. If for example 'real' prior odds are 100 times larger it means the calculated posterior odds will be 100 times smaller than what they would suppose to be.

type Serializer added in v0.2.0

type Serializer interface {
	// Inspect returns back simplified and publicly accessed information that
	// is normally private for Bayes object.
	Inspect() output.Output
	// Load takes a slice of bytes that corresponds to output.Output and
	// creates a Bayes instance from it.
	Load([]byte) error
	// Dump takes an internal data of a Bayes instance, converts it to
	// object.Object and serializes it to slice of bytes.
	Dump() ([]byte, error)
}

Serializer provides methods for dumping data from Bayes object to a slice of bytes, and rebuilding Bayes object from such data.

type Trainer added in v0.2.0

type Trainer interface {
	Train([]ft.LabeledFeatures)
}

Trainer interface provides methods for training Bayes object to data from the training set.

Directories

Path Synopsis
ent

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL