botanic

package module
Version: v0.0.0-...-dbea759 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 23, 2018 License: MIT Imports: 9 Imported by: 0

README

botanic

Regression-tree library and tool written in Golang

GoDoc GoReportCard

Basic concepts

  • Feature. In machine learning, a feature is 'an individual measurable property or characteristic of a phenomenon being observed' according to Wikipedia. In botanic we distinguish between continuous features, which can take any real number value, and discrete features, which take values from a list of strings.
  • Sample. In botanic a sample represents a phenomenon being observed in terms of specific values for the features that are measured.
  • Set. In botanic a set is a compendium of samples. We usually start with a set that has all our data as samples and split it into a training set, with a majority of the samples and to be used for growing a tree, and a testing set, with a minority of the samples and to be used to test the grown tree.
  • Tree. A tree in botanic is a decision or regression tree, also known as Classification and Regression Tree (CART). With botanic you can grow trees that predict a discrete feature (called the class feature) training them with a set. Once they are grown, you can test them with another set to see how well they predict the class feature, and use them to predict the class feature of a sample.

botanic CLI

The botanic command is a CLI tool that allows you to manage

  • sets of data: converting them between formats or event splitting them into training and testing sets.
  • regression trees: growing them from a data set, testing them with another data set or using them interactively to predict the value for a sample.
Installation

Assuming you have Go set up in your system, just run:

go get -u github.com/pbanos/botanic/cmd/botanic

Make sure the $GOPATH/bin is listed in your PATH

Use

The botanic command currently accepts the following commands:

Help command

Running botanic help we get to se the available botanic commands and a description for them:

$ botanic help
A tool to grow regression trees from your data, test them, and use them to make predictions

Usage:
  botanic [command]

Available Commands:
  help        Help about any command
  set         Manage sets of data
  tree        Manage regression trees
  version     Print the version number of botanic

Flags:
  -h, --help      help for botanic
  -v, --verbose

Use "botanic [command] --help" for more information about a command.
$
Set command

The botanic set command allows dumping an existing set of samples into an other set, each in any of the following formats:

  • CSV (as a file ending in .csv or by default read from STDIN or dumped to STDOUT if nothing is specified)
  • an SQLite3 database (specified as a file ending in .db)
  • a PostgreSQL database (specified as a PostgreSQL Connection URI)

We can see the flags and subcommands available for it running it with the --help or -h flag:

$ botanic set --help
Manage sets of data

Usage:
  botanic set [flags]
  botanic set [command]

Available Commands:
  split       Split a set into two sets

Flags:
  -h, --help              help for set
  -i, --input string      path to an input CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL with data to use to grow the tree (defaults to STDIN, interpreted as CSV)
  -m, --metadata string   path to a YML file with metadata describing the different features available available on the input file (required)
  -o, --output string     path to a CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL to dump the output set (defaults to STDOUT in CSV)

Global Flags:
  -v, --verbose

Use "botanic set [command] --help" for more information about a command.
$

When working with this command or its subcommands, we will always need to specify the input set with the -i or --input flag, as well as the metadata YAML file that describes the features of the set with the -m or --metadata flag (explained below). For example, the following command will read a set from the data.csv file and output it into an SQLite3 database on the data.db file:

botanic set --input data.csv -m metadata.yml -o data.db
Metadata YAML file

When working with botanic sets, we will need a metadata YAML file that describes the features we are working with. The schema for the metadata YAML file is very simple:

  • There is a features key at the root
  • Under the features key there will be a key with the name of every feature available (as identified on the sets this describes). The value for each feature will be:
    • The string continuous if the feature is continuous
    • An array of string values that are valid for the feature if the feature is discrete

Example:

features:
  Age: continuous
  Education:
    - "high school"
    - "bachelors"
    - "masters"
  Income:
    - low
    - high
  Marital Status:
    - married
    - single
  Prediction:
    - "will buy"
    - "won't buy"
CSV sets

CSV will probably be the entry format for data into a botanic CLI workflow: nothing prevents you from generating a DB-based set from scratch, but CSV is easier. CSV sets should have a first row or header with the name of all the features of your samples (or at least those listed on your metadata YAML file). Every row after that one represents a sample, and should contain the values for the different features in the order of the row.

An example of CSV set with 3 samples would be:

Marital Status,Prediction,Age,Education,Income
married,won't buy,23,bachelors,high
married,won't buy,42,bachelors,low
married,will buy,56,masters,low
Split subcommand

The botanic set split command allows splitting an input set into 2 different sets with different samples: the output set and the split set. This will come in handy when you want to split your data into a training set and a test set.

We can see the available flags for this subcommand running it with the --help or -h flag:

$ botanic set split --help
Split a set into an output set and a split set

Usage:
  botanic set split [flags]

Flags:
  -h, --help                    help for split
  -s, --split-output string     path to a CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL to dump the output of the split set (required)
  -p, --split-probability int   probability as percent integer that a sample of the set will be assigned to the split set (default 20)

Global Flags:
  -i, --input string      path to an input CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL with data to use to grow the tree (defaults to STDIN, interpreted as CSV)
  -m, --metadata string   path to a YML file with metadata describing the different features available available on the input file (required)
  -o, --output string     path to a CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL to dump the output set (defaults to STDOUT in CSV)
  -v, --verbose
$

Most flag are self-explanatory, but the -p or --split-probability flag deserves a special mention: with it we can specify the probability that a sample will end up on the split set a percent (without the % symbol).

For example, to split the set in the data.db we generated before into a train set in a train.db SQLite3 file and a test set in a PostgreSQL database accesible at postgres://myuser:mypasswd@mypostgress.server.com/mydb, with a 90%-10% distribution among sets, we would run:

botanic set split -i data.db -o train.db -s postgres://myuser:mypasswd@mypostgress.server.com/mydb -m metadata.yml --split-probability 10
Tree command

The botanic tree command works with trees.

On its own it shows a tree on the standard output, but it has some available subcommands. We can see the flags and subcommands available for it running it with the --help or -h flag:

$ botanic tree --help
Manage regression trees and use them to predict values for samples

Usage:
  botanic tree [flags]
  botanic tree [command]

Available Commands:
  grow        Grow a tree from a set of data
  predict     Predict a value for a sample answering questions
  test        Test the performance of a tree

Flags:
  -h, --help              help for tree
  -m, --metadata string   path to a YML file with metadata describing the different features used on a tree or available on an input set (required)
  -t, --tree string       path to a file from which the tree to show will be read and parsed as JSON (required)

Global Flags:
  -v, --verbose

Use "botanic tree [command] --help" for more information about a command.
$

To show a given tree, previously generated with the grow subcommand described below, we will specify the following flags:

  • the path to the metadata YAML file describing the features in the tree, with the --metadata or -mflag
  • the path to the json file with the tree to show, with the --tree or -t flag

For example, to show a tree in a tree.json file with our metadata.yml as metadata file we would run:

botanic tree -m metadata.yml -t tree.json
Grow subcommand

The botanic tree grow command grows a tree from an input set of data: the training set of data.

We can see the flags available for it running it with the --help or -h flag:

$ botanic tree grow --help
Grow a tree from a set of data to predict a certain feature.

Usage:
  botanic tree grow [flags]

Flags:
  -c, --class-feature string   name of the feature the generated tree should predict (required)
      --concurrency int        limit to concurrent workers on the tree and on DB connections opened at a time (defaults to 1) (default 1)
      --cpu-intensive          force the use of cpu-intensive subsetting to decrease memory use at the cost of increasing time
  -h, --help                   help for grow
  -i, --input string           path to an input CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL with data to use to grow the tree (defaults to STDIN, interpreted as CSV)
      --memory-intensive       force the use of memory-intensive subsetting to decrease time at the cost of increasing memory use
  -o, --output string          path to a file to which the generated tree will be written in JSON format (defaults to STDOUT)
  -p, --prune string           pruning strategy to apply, the following are valid: default, minimum-information-gain:[VALUE], none (default "default")

Global Flags:
  -m, --metadata string   path to a YML file with metadata describing the different features used on a tree or available on an input set (required)
  -v, --verbose
$

To grow a tree we will probably have to specify:

  • an input or training set, with the --input or -i flag. This can be a CSV file, an SQLite3 file (with the .db) or a PostgreSQL URL. The default is a CSV set read from STDIN.
  • the class feature the tree has to predict, with the --class-feature or -cflag
  • the path to the metadata YAML file describing the features in the set, with the --metadata or -mflag

The following optional flags can also be useful:

  • --output or -o specifies where to store the resulting tree, in JSON format. It defaults to STDOUT.
  • --prune or -p defines the pruning strategy to apply while growing the tree: branches whose development does not help in improving predictions enough will be pruned, that is, their subbranches will be discarded. The following strategies are available:
    • default: the default one
    • minimum-information-gain: this strategy imposes a minimum value for the information gain obtained from the subbranching. This value can be specified appending :VALUE to the strategy, for example: --prune minimum-information-gain:0.05
    • none: this strategy disables pruning

If the input or training set is in a CSV file, the following optional flags are available:

  • --cpu-intensive selects a set implementation that will keep in memory a single copy of the set's samples, at the cost of a longer time of process
  • --memory-intensive selects a set implementation that will make copies of the samples of the training set for every subset it needs to build to grow the tree. This speeds up the processing time at the cost of a significant increased of memory.

If the input or training set is in an SQLite3 database file, the flag --max-db-conns allows limiting the number of file handlers open to the given number. This is specially useful if running the tool with a considerable number of features and samples on an OS that imposes a low limit on the number of files a process can have open at a given time, such as Mac OS X.

For example, to grow a tree that predicts the Prediction feature, using the training set we generated before in the SQLite3 file train.db, our metadata.yml as metadata file and so that the output tree is written to a tree.json file we would run:

botanic tree grow -c Prediction -m metadata.yml -i train.db -o tree.json
Test subcommand

The botanic tree test command takes a tree and a test set and provides information on how well the tree predicts the samples on the testing set.

We can see the flags available for it running it with the --help or -h flag:

$ botanic tree test --help
Test the performance of a tree against a test data set

Usage:
  botanic tree test [flags]

Flags:
  -h, --help           help for test
  -i, --input string   path to an input CSV (.csv) or SQLite3 (.db) file, or a PostgreSQL DB connection URL with data to use to grow the tree (defaults to STDIN, interpreted as CSV)
  -t, --tree string    path to a file from which the tree to test will be read and parsed as JSON (required)

Global Flags:
  -m, --metadata string   path to a YML file with metadata describing the different features used on a tree or available on an input set (required)
  -v, --verbose
$

For example, to test the tree we grown on file tree.json with the testing set we split into our PostgreSQL database accessible at postgres://myuser:mypasswd@mypostgress.server.com/mydb we would run:

botanic tree test -i postgres://myuser:mypasswd@mypostgress.server.com/mydb -m metadata.yml -t tree.json

The output could be similar to:

0.969696 success rate, failed to make a prediction for 0 samples

The success rate indicates the rate of successful predictions over the number of samples in the training set, while the failures to make a prediction indicate the situation where the generated tree does not have data to make a prediction for a sample at all.

Predict subcommand

The botanic tree predict subcommand can be used to predict the value for the class feature of a sample using a generated tree. The subcommand does not expect to have all available data for the sample, but rather will interact with the user to gather the values for the features as they are needed to traverse the tree.

We can see the flags available for the subcommand running it with the --help or -h flag:

$ botanic tree predict --help
Use the loaded tree to predict the class feature value for a sample answering a reduced set of question about its features

Usage:
  botanic tree predict [flags]

Flags:
  -h, --help                     help for predict
  -t, --tree string              path to a file from which the tree to test will be read and parsed as JSON (required)
  -u, --undefined-value string   value to input to define a sample's value for a feature as undefined (default "?")

Global Flags:
  -m, --metadata string   path to a YML file with metadata describing the different features used on a tree or available on an input set (required)
  -v, --verbose
$

Again, most of the flags are self-explanatory, but the --undefined-value or -u flag deserves a special mention. A generated tree allows predicting a sample even when this has no available value for a feature that determines the subtree to go down to: at every level a subtree for the scenario where the value is undefined is developed. This flag allows specifying which answer to a feature should be interpreted by the subcommand as the undefined value. You should make sure the one you use does not match an available feature's value.

Version command

The botanic version command shows the version number for the botanic command:

$ botanic version
botanic v0.0.1
$

State and roadmap

The project is curently unstable and APIs and tool commands may suffer some changes.

These are the items planned for future development of the botanic tool and libraries:

  • Use dep to manage dependencies
  • Implement multi-process distributed growing of a tree
  • Develop an etcd-backed tree node store implementation
  • Implement other DB adaptors for sets:
    • MySQL
    • MongoDB

Documentation

Overview

Package botanic provides functions to grow a regression tree (tree.Tree)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BranchOut

func BranchOut(ctx context.Context, task *queue.Task, t *tree.Tree, ps *PruningStrategy) (tasks []*queue.Task, e error)

BranchOut takes a context, a task, a tree and a pruning strategy, develops the node in the task using the task's set and available feature to predict the tree's class feature and returns a set of tasks to develop the resulting children nodes or an error.

func Seed

func Seed(ctx context.Context, classFeature feature.Feature, features []feature.Feature, s set.Set, q queue.Queue, ns tree.NodeStore) (*tree.Tree, error)

Seed takes a context, a class feature, a slice of features, a set of data, a queue and a node store and sets everything up so that workers that consume from the queue afterwards grow a tree that predicts the given class feature using the features in the given slice and according to the training data on the given set. Specifically it will create the root node of the tree on the node store and push a task to branch it out on the queue. The function returns the tree that can be grown or an error if the node cannot be created on the store, or the task pushed to the queue (in the amount of time allowed by the given context).

func Work

func Work(ctx context.Context, t *tree.Tree, q queue.Queue, ps *PruningStrategy, emptyQueueSleep time.Duration) error

Work takes a context, a tree, a queue, a pruning strategy and an emptyQueueSleep duration and enters a loop in which it:

* pulls a task for the queue,
* branches its node out into new subnodes using BranchOut
* pushes the tasks for the new subnodes into the queue
* marks the task as completed on the queue

If at some point no task can be pulled from the queue and the sum of tasks running and pending on the queue is 0, the worker ends returning nil. If no task can be pulled but the sum is not 0, then the worker will sleep for the given emptyQueueSleep duration and then retry.

Work will return a non-nil error if the given context times out or is cancelled, if BranchOut returns a non-nil error or if an operation with the given queue returns a non-nil error.

Types

type Partition

type Partition struct {
	Feature feature.Feature
	Tasks   []*queue.Task
	// contains filtered or unexported fields
}

Partition represents a partition of a set according to a feature into subtrees with an information gain to predict the class feature

func NewContinuousPartition

func NewContinuousPartition(ctx context.Context, s set.Set, f *feature.ContinuousFeature, classFeature feature.Feature, p Pruner) (*Partition, error)

NewContinuousPartition takes a context.Context, a set, a continuous feature and a class feature and returns a partition of the set for the given feature. The result may be nil if the obtained information gain is considered insufficient

func NewDiscretePartition

func NewDiscretePartition(ctx context.Context, s set.Set, f *feature.DiscreteFeature, classFeature feature.Feature, p Pruner) (*Partition, error)

NewDiscretePartition takes a context.Context, a set, a discrete feature and a class feature and returns a partition of the set for the given feature. The result may be nil if the obtained information gain is considered insufficient

type Pruner

type Pruner interface {
	Prune(ctx context.Context, s set.Set, p *Partition, classFeature feature.Feature) (bool, error)
}

Pruner is an interface wrapping the Prune method, that can be used to decide whether a partition is good enough to become part of a tree or if it must be pruned instead.

The Prune method takes a context, set, a partition and a class Feature and returns a boolean: true to indicate the partition must be pruned, false to allow its adding to the tree and further development.

func DefaultPruner

func DefaultPruner() Pruner

DefaultPruner returns a Pruner whose Prune method evaluates a minimum information gain for the partition and returns true if the partition information gain is below this minimum and false otherwise. This minimum is calculated as (1/N) x log2(N-1) + (1/N) x [ log2 (3k-2) - (k x Entropy(S) – k1 x Entropy(S1) – k2 x Entropy(S2) ... - ki x Entropy(Si)] with

* N begin the number of elements in the set
* k being the number of different values for the class feature on the set
* k1, k2, ... ki being the number of different values for the class feature on the subset for the partition subtree 1, 2, ... i
* S1, S2, ... Si begin the subset of data for the partition subtree 1, 2, ... i

func FixedInformationGainPruner

func FixedInformationGainPruner(informationGainThreshold float64) Pruner

FixedInformationGainPruner takes an informationGainThreshold float64 value and returns a Pruner whose Prune method returns whether the informationGainThreshold is greater or equal to the received partition's information gain

func NoPruner

func NoPruner() Pruner

NoPruner returns a Pruner whose Prune method always returns false, that is, never prunes.

type PrunerFunc

type PrunerFunc func(ctx context.Context, s set.Set, p *Partition, classFeature feature.Feature) (bool, error)

PrunerFunc wraps a function with the Prune method signature to implement the Pruner interface

func (PrunerFunc) Prune

func (pf PrunerFunc) Prune(ctx context.Context, s set.Set, p *Partition, classFeature feature.Feature) (bool, error)

Prune takes a context.Context, a set, a partition and a class Feature and invokes the PrunerFunc with those parameters to return its boolean result.

type PruningStrategy

type PruningStrategy struct {
	// Pruner is applied during the partition
	// of a node's set with a feature to determine
	// if the result is worth incorporating
	// into the tree.
	Pruner
	// MinimumEntropy is the maximum value of
	// entropy for a node that prevents it from
	// being branched out at all. In other words,
	// nodes whose training set of data has an
	// entropy equal or below this will not be
	// developed.
	MinimumEntropy float64
}

PruningStrategy holds the configuration for when a node not be partition further or at all.

Directories

Path Synopsis
cmd
botanic
Botanic is a tool to grow and use regression trees from sets of samples
Botanic is a tool to grow and use regression trees from sets of samples
Package feature defines features and criteria for those features
Package feature defines features and criteria for those features
yaml
Package yaml provides methods to parse feature.Feature specifications also known as metadata, from YAML documents.
Package yaml provides methods to parse feature.Feature specifications also known as metadata, from YAML documents.
Package queue defines tasks to be performed to grow a tree as well as an interface for a Queue to manage them.
Package queue defines tasks to be performed to grow a tree as well as an interface for a Queue to manage them.
set
Package set defines interfaces for sets as samples.
Package set defines interfaces for sets as samples.
csv
Package csv provides functions to read/write a set.Set as CSV
Package csv provides functions to read/write a set.Set as CSV
inputsample
Package inputsample provides an implementation of set.Sample that is read from an io.Reader.
Package inputsample provides an implementation of set.Sample that is read from an io.Reader.
sqlset
Package sqlset provides implementations of set.Set that use SQL database as backends.
Package sqlset provides implementations of set.Set that use SQL database as backends.
sqlset/pgadapter
Package pgadapter provides an implementation of the Adapter interface in the sqlset package that works over a PostgreSQL database.
Package pgadapter provides an implementation of the Adapter interface in the sqlset package that works over a PostgreSQL database.
sqlset/sqlite3adapter
Package sqlite3adapter provides an implementation of the Adapter interface in the sqlset package that works over a SQLite3 database.
Package sqlite3adapter provides an implementation of the Adapter interface in the sqlset package that works over a SQLite3 database.
Package tree defines a regression tree
Package tree defines a regression tree
json
Package json provides functions that marshall/unmarshall a tree.Tree as/from JSON
Package json provides functions that marshall/unmarshall a tree.Tree as/from JSON

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
t or T : Toggle theme light dark auto
y or Y : Canonical URL