featuremill

package module
v0.0.0-...-c214923 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 26, 2023 License: Apache-2.0 Imports: 8 Imported by: 0

README

MLFeatureExtractorGo\n\nBuild Status\n\nMLFeatureExtractorGo is a powerful, stateless, and deterministic feature extractor implemented in Go for machine learning use-cases.\n\nThe text feature extractor uses a hashing vectorizer heavily enabling swift feature extraction with very low overhead. This approach of hash vectorizing doesn't entail any state storage and maintains consistency thus boasts the following advantages:\n\n- Fast, owing to the lack of need for pre-processing or look-ups as hashing vectorizer creates feature IDs\n- Economical memory usage even with large volumes of data - no index or state data stored\n- Can be scaled horizontally easily - no stored state negates the need for coordination\n- Ability to add features and values at a later date\n\nIt's important to note that these advantages are not without their costs. MLFeatureExtractorGo does not support transformations like Inverse Document Frequency (IDF) for text features, which can be useful to diminish the effect of common text features. If this significantly impacts your algorithm or dataset's prediction, consider adding this support or explore a different approach or tool. Other optimizations like number scaling can be implemented given some basic number data metrics from your datastore, such as the dataset's maximum value.\n\n## Output format\n\nlibsvm format (<index_number>:<value>)\nA slice of them, if the feature generates multiple vectors (text, timestamp, date).\nThis format makes this library easy to use along with hector\n\n## Supported data\n\n- IP - Converted to integer representation (preserves locality) and min/max scaled\n- timestamp - Represented as 3 seasonality vectors: minute of hour, hour of day, day of week\n- date - Represented as 2 seasonality vectors: day of week, and month of year\n- text - Tokenized by word and one-hot encoded using hashing vectorizer\n- numerical - Utilizes log scaler (default), or min/max rescaler\n- boolean - represented as 0 or 1\n- categorical - Untokenized but one-hot encoded using a hashing vectorizer\n\n## Documentation\n\n### Godocs and tests\n\ngodocs\n\nLook at the included example and the tests for exemplary usage.\n\n### Using MLFeatureExtractorGo\n\nFor each incoming processed sample, append the returned string or expanded slice to a features slice.\nTo assemble the final sample in libsvm format, just sort and join it with the sample label:\n\n go\nsort.Strings(features)\nsample := "0 " + strings.Join(features, " ")\n

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractBoolean

func ExtractBoolean(field, boolean string) (string, error)

ExtractBoolean returns a 0/1 vector for the deterministic feature ID

func ExtractCategorical

func ExtractCategorical(field, category string) string

ExtractCategorical returns a vector that is a positive boolean for a deterministic categorical feature ID

func ExtractDate

func ExtractDate(field, date string) ([]string, error)

ExtractDate returns a slice of 2 scaled seasonality vectors: day of week, and month of year each with a deterministic feature id

func ExtractIP

func ExtractIP(field, addr string) (string, error)

ExtractIP returns a vector that is a scaled integer representation of IPv4 and IPv6 IPs with a deterministic feature ID

func ExtractNumerical

func ExtractNumerical(field string, num float32, min float32, max float32) string

ExtractNumerical returns a min/max scalled vector to a deterministic feature ID

func ExtractNumericalMaxMin

func ExtractNumericalMaxMin(field string, num float32, min float32, max float32) string

ExtractNumericalMaxMin returns a min/max scalled vector to a deterministic feature ID

func ExtractText

func ExtractText(text, delim string) []string

ExtractText returns a slice of "featureID:1" strings for each token in the string

func ExtractTimestamp

func ExtractTimestamp(field, timestamp string) ([]string, error)

ExtractTimestamp returns a slice of 3 scaled seasonality vectors: minute of hour, hour of day, day of week each with a deterministic feature ID

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL