data

package
v0.16.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 19, 2024 License: Apache-2.0 Imports: 32 Imported by: 2

Documentation

Overview

Package data is a collection of tools that facilitate data loading and preprocessing.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Batch added in v0.2.0

func Batch(backend backends.Backend, ds train.Dataset, batchSize int, createLeadingAxis, dropIncompleteBatch bool) train.Dataset

Batch creates dataset that batches `ds` into batches of size `batchSize`.

It uses GoMLX to batch the tensors themselves, so it takes a graph.Backend as its first parameter. Also, it means that it yields examples already stored "on device" -- whichever the platform Backend was configured with.

Typically, Batch can benefit from ReadAhead, so while training or evaluation of batch is happening, the next batch is being built. Consider using ReadAhead on the Batch dataset, even with a buffer of only 1.

Args:

  • `backend`: will be used to create the graph that actually does the batching.
  • `ds`: the dataset to be batched.
  • `batch_size`: size of each batch, except when there are no more examples, in which case batches can be smaller (except if `dropIncompleteBatch` was selected).
  • `createLeadingAxis`: usually set to true, it will create a new leading axis that becomes the batch dimension. Otherwise, it simply concatenates the individual results at the axis 0 -- this can be used for instance to increase the size of a batch.
  • `dropIncompleteBatch`: at the end of an epoch, if there are not enough examples to fill a batch, and this is set to true, the last batch is dropped. Otherwise, it returns only a partial batch -- with a different shape this may trigger the re-compilation of a graph. Usually desirable for evaluation, but not desirable for training.

Returns a `train.Dataset` that yields batched examples.

func ByteCountIEC added in v0.2.0

func ByteCountIEC[T interface {
	int | int64 | uint64 | uint | uintptr
}](count T) string

ByteCountIEC converts a byte count to string using the appropriate unit (B, Kb, MiB, GiB, ...). It uses the binary prefix system from IEC -- so powers of 1024 (as opposed to powers 1000).

func CopyWithProgressBar added in v0.2.0

func CopyWithProgressBar(dst io.Writer, src io.Reader, contentLength int64) (n int64, err error)

CopyWithProgressBar is similar to io.Copy, but updates the progress bar with the amount of data copied.

It requires knowing the amount of data to copy up-front.

func Download

func Download(url, filePath string, showProgressBar bool) (size int64, err error)

Download file from url and save at given path. Attempts to create directory if it doesn't yet exist.

Optionally, use showProgressBar.

func DownloadAndUntarIfMissing

func DownloadAndUntarIfMissing(url, baseDir, tarFile, targetUntarDir, checkHash string) error

DownloadAndUntarIfMissing downloads tarFile from given url, if file not there yet, and then untar it if the target directory is missing.

If checkHash is provided, it checks that the file has the hash or fail.

func DownloadAndUnzipIfMissing

func DownloadAndUnzipIfMissing(url, zipFile, unzipBaseDir, targetUnzipDir, checkHash string) error

DownloadAndUnzipIfMissing downloads `zipFile` from given url, if file not there yet. And then unzip it under directory `unzipBaseDir`, if the target `targetUnzipDir` directory is missing.

It's recommended that all paths be absolute.

If checkHash is provided, it checks that the file has the hash or fail.

func DownloadIfMissing

func DownloadIfMissing(url, filePath, checkHash string) error

DownloadIfMissing will check if the path exists already, and if not it will download the file from the given URL.

If checkHash is provided, it checks that the file has the hash or fail.

func FileExists

func FileExists(path string) bool

FileExists returns true if file or directory exists.

func Freeing added in v0.9.0

func Freeing(ds train.Dataset) *freeingDataset

Freeing implements a sequential dataset (it should not to be parallelized) that immediately releases the yielded inputs and labels in between each `Yield` call, not waiting for garbage collection.

This is needed for datasets with large inputs, to prevent more than one input to be alive in the accelerator's (GPU) memory at the same time, in case garbage collection hasn't run yet. Notice Go's garbage collection has no notion of associated resource usage in GPU and is not able to respond to that by itself.

It works by keeping a reference to previously yielded values and freeing them before yielding the next one.

While you can wrap a parallelized (with Parallel) dataset with Freeing, the other way around will break: Freeing will free the yielded tensor before they are actually used.

func Map added in v0.4.0

func Map(ds train.Dataset, mapFn MapExampleFn) train.Dataset

Map maps a dataset through a transformation with a (normal Go) function that runs in the host cpu.

See MapWithGraphFn for a function that runs on the accelerator, with a graph building function.

func MapWithGraphFn added in v0.9.0

func MapWithGraphFn(backend backends.Backend, ctx *context.Context, dataset train.Dataset, graphFn MapGraphFn) train.Dataset

MapWithGraphFn returns a `train.Dataset` with the result of applying (mapping) the batches yielded by the provided `dataset` by the graph function `graphFn`. The function is executed by the `backend` given. If `ctx` is nil, a new one is created.

The graph building function `graphFn` can return a different number of `inputs` or `labels` than what it was given, but these numbers should never change -- always return the same number of inputs and labels.

func NewConstantDataset added in v0.13.0

func NewConstantDataset() train.Dataset

NewConstantDataset returns a dataset that yields always the scalar 0.

This is useful when training something that generates its own inputs and labels -- like trying to approximate a function with another function.

It loops indefinitely.

func Normalization added in v0.4.0

func Normalization(backend backends.Backend, ds train.Dataset, inputsIndex int, independentAxes ...int) (mean, stddev *tensors.Tensor, err error)

Normalization calculates the normalization parameters `mean` and `stddev` for the `inputsIndex`-th input from the given dataset.

These values can later be used for normalization by simply applying `Div(Sub(x, mean), stddev)`. To use as side inputs in an ML model, just set them to variables.

The parameter `independentAxes` list axes that should not be normalized together. A typical value is -1, the feature axis (last axis), so that each feature gets its own normalization.

Notice for any feature that happens to be constant, the `stddev` will be 0. If trying to normalize (divide) by that will result in error. Use ReplaceZerosByOnes below to avoid the numeric issues.

func ParseGzipCSVFile added in v0.9.0

func ParseGzipCSVFile(filePath string, perRowFn func(row []string) error) error

ParseGzipCSVFile opens a `CSV.gz` file and iterates over each of its rows, calling `perRowFn`, with a slice of strings for each cell value in the row.

func ReadAhead added in v0.2.0

func ReadAhead(ds train.Dataset, bufferSize int) train.Dataset

ReadAhead returns a Dataset that reads bufferSize elements of the given `ds` so that when Yield is called, the results are immediate.

It uses ParallelDataset to implement it.

func ReplaceTildeInDir

func ReplaceTildeInDir(dir string) string

ReplaceTildeInDir by the user's home directory. Returns dir if it doesn't start with "~".

It may panic with an error if `dir` has an unknown user (e.g: `~unknown/...`)

func ReplaceZerosByOnes added in v0.4.0

func ReplaceZerosByOnes(x *Node) *Node

ReplaceZerosByOnes replaces any zero values in x by one. This is useful if normalizing a value with a standard deviation (`stddev`) that has zeros.

func Take added in v0.4.0

func Take(ds train.Dataset, n int) train.Dataset

Take returns a wrapper to `ds`, a `train.Dataset` that only yields `n` batches.

func Untar

func Untar(baseDir, tarFile string) error

Untar file, using decompression flags according to suffix: .gz for gzip, bz2 for bzip2.

func Unzip

func Unzip(zipFile, zipBaseDir string) error

Unzip file, from the given zipBaseDir.

func ValidateChecksum

func ValidateChecksum(path, checkHash string) error

ValidateChecksum verifies that the checksum of the file in the given path matches the checksum given. If it fails, it will remove the file (!) and return and error.

Types

type InMemoryDataset added in v0.2.0

type InMemoryDataset struct {
	// contains filtered or unexported fields
}

InMemoryDataset represents a Dataset that has been completely read into the memory of the device it was created with -- the platform of the associated `graph.Backend`.

It supports batching, shuffling (with and without replacement) and can be duplicated (only one copy of the underlying data is used).

Finally, it supports serialization and deserialization, to accelerate loading of the data -- in case generating the original dataset is expensive (e.g: image transformations).

func GobDeserializeInMemory added in v0.2.0

func GobDeserializeInMemory(backend backends.Backend, deviceNums []backends.DeviceNum, decoder *gob.Decoder) (mds *InMemoryDataset, err error)

GobDeserializeInMemory dataset from the decoder. It requires a `graph.Backend` and the deviceNum(s) where the data is going to be stored -- it drops the local storage copy of the values.

If deviceNums is nil, it defaults to []DeviceNum{0}, which is safe in most cases.

No sampling configuration is recovered, and the InMemoryDataset created is sequential (no random sampling) that reads through only one epoch. The random number generator is also newly initialized (see InMemoryDataset.WithRand).

func InMemory added in v0.2.0

func InMemory(backend backends.Backend, ds train.Dataset, dsIsBatched bool) (mds *InMemoryDataset, err error)

InMemory creates dataset that reads the whole contents of `ds` into memory.

It uses GoMLX to batch the tensors themselves, so it takes a graph.Backend as its first parameter. Flat will be cached in the platform (device) the Backend was configured with.

Args:

  • `backend`: will be used to create the graph that does the caching and the ca actually does the batching.
  • `ds`: dataset to be cached. It is read in full once, concatenating the results in the cache.
  • `dsIsBatched`: whether the input `ds` is batched, and it's leading (first) axis is a batch size. If true, count of examples is adjusted accordingly. Notice if true, the batch size must be the same for all elements of the inputs and labels yielded by `ds`.

Returns a `InMemoryDataset`, that is initially not shuffled and not batched. You can configure how you want to use it with the other configuration methods.

func InMemoryFromData added in v0.4.0

func InMemoryFromData(backend backends.Backend, name string, inputs []any, labels []any) (mds *InMemoryDataset, err error)

InMemoryFromData creates an InMemoryDataset from the static data given -- it is immediately converted to a tensor, if not a tensor already. The first dimension of each element of inputs and labels must be the batch size, and the same for every element.

This is useful to writing unit tests, with small datasets provided inline.

Example: A dataset with one input tensor and one label tensor. Each with two examples.

mds, err := InMemoryFromData(backend, "test",
	[]any{[][]float32{{1, 2}, {3, 4}}},
	[]any{[][]float32{{3}, {7}}})

func (*InMemoryDataset) BatchSize added in v0.2.0

func (mds *InMemoryDataset) BatchSize(n int, dropIncompleteBatch bool) *InMemoryDataset

BatchSize configures the InMemoryDataset to return batches of the given size. dropIncompleteBatch is set to true, it will simply drop examples if there are not enough to fill a batch -- this can only happen on the last batch of an epoch. Otherwise, it will return a partially filled batch.

If `n` is set to 0, it reverts back to yielding one example at a time.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Copy added in v0.2.0

func (mds *InMemoryDataset) Copy() *InMemoryDataset

Copy returns a copy of the dataset. It uses the same underlying data -- so very little memory is used.

The copy comes configured by default with sequential reading (not random sampling), non-looping, and reset.

func (*InMemoryDataset) FinalizeAll added in v0.2.0

func (mds *InMemoryDataset) FinalizeAll()

FinalizeAll will immediately free all the underlying data (and not wait for the garbage collector). This invalidates not only this InMemoryDataset, but also all other copies that use the same data (created with Copy).

This is not concurrency safe: if there are concurrent calls to sampling, this may lead to an undefined state or errors.

func (*InMemoryDataset) GobSerialize added in v0.2.0

func (mds *InMemoryDataset) GobSerialize(encoder *gob.Encoder) (err error)

GobSerialize in-memory content to the encoder.

Only the underlying data is serialized. The graph.Backend or the sampling configuration is not serialized. The contents of the `spec` (see WithSpec) is also not serialized.

func (*InMemoryDataset) Infinite added in v0.2.0

func (mds *InMemoryDataset) Infinite(infinite bool) *InMemoryDataset

Infinite sets whether the dataset should loop indefinitely. The default is `infinite = false`, which causes the dataset to going through the data only once before returning io.EOF.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Memory added in v0.2.0

func (mds *InMemoryDataset) Memory() uintptr

Memory returns an approximation of the memory being used.

func (*InMemoryDataset) Name added in v0.2.0

func (mds *InMemoryDataset) Name() string

Name implements `train.Dataset`

func (*InMemoryDataset) NumExamples added in v0.2.0

func (mds *InMemoryDataset) NumExamples() int

NumExamples cached.

func (*InMemoryDataset) RandomWithReplacement added in v0.2.0

func (mds *InMemoryDataset) RandomWithReplacement() *InMemoryDataset

RandomWithReplacement configures the InMemoryDataset to return random elements with replacement. If this is configured, Shuffle is canceled.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Reset added in v0.2.0

func (mds *InMemoryDataset) Reset()

Reset implements `train.Dataset`

func (*InMemoryDataset) SetName added in v0.4.0

func (mds *InMemoryDataset) SetName(name string) *InMemoryDataset

SetName sets the name of the dataset, and returns the updated dataset.

func (*InMemoryDataset) Shuffle added in v0.2.0

func (mds *InMemoryDataset) Shuffle() *InMemoryDataset

Shuffle configures the InMemoryDataset to shuffle the order of the data. It returns random elements without replacement. If this is configured, RandomWithReplacement is canceled.

At each call to Reset() it is reshuffled. It happens automatically if dataset is configured to Loop.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) TakeN added in v0.11.0

func (mds *InMemoryDataset) TakeN(n int) *InMemoryDataset

TakeN configures dataset to only take N examples before returning io.EOF. If set to 0 or -1, it takes as many as there is data. If configured, it automatically disables InMemoryDataset.Infinite

func (*InMemoryDataset) WithRand added in v0.2.0

func (mds *InMemoryDataset) WithRand(rng *rand.Rand) *InMemoryDataset

WithRand sets the random number generator (RNG) for shuffling or random sampling. This allows for repeatable deterministic random sampling, if one wants. The default is to use an RNG initialized with the current nanosecond time.

If dataset is configured with Shuffle, this re-shuffles the dataset immediately.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) WithSpec added in v0.2.0

func (mds *InMemoryDataset) WithSpec(spec any) *InMemoryDataset

WithSpec sets the `spec` that is returned in Yield. The default is to use the one read from the original dataset passed to InMemory call. This allows one to set to something different.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Yield added in v0.2.0

func (mds *InMemoryDataset) Yield() (spec any, inputs []*tensors.Tensor, labels []*tensors.Tensor, err error)

Yield implements `train.Dataset`.

Returns next batch's inputs and labels or single example if BatchSize is set to 0.

type MapExampleFn added in v0.9.0

type MapExampleFn func(inputs, labels []*tensors.Tensor) (mappedInputs, mappedLabels []*tensors.Tensor)

MapExampleFn if normal Go function that applies a transformation to the inputs/labels of a dataset.

type MapGraphFn added in v0.4.0

type MapGraphFn func(ctx *context.Context, inputs, labels []*Node) (mappedInputs, mappedLabels []*Node)

MapGraphFn if a graph building function that transforms inputs and labels.

type ParallelDataset

type ParallelDataset struct {
	Dataset train.Dataset
	// contains filtered or unexported fields
}

ParallelDataset is a wrapper around a `train.Dataset` that parallelize calls to Yield. See details in CustomParallel.

func CustomParallel added in v0.2.0

func CustomParallel(ds train.Dataset) *ParallelDataset

CustomParallel builds a ParallelDataset that can be used to parallelize any train.Dataset, as long as the underlying dataset ds is thread-safe.

ParallelDataset can be further configured (see SetParallelism and Buffer), and then one has to call Start before actually using the Dataset.

To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.

The order of the yields is not preserved -- the parallelization may yield results in different order, and in some exceptional circumstance may create an order bias (faster results to generate being yield first).

Example:

	var ds train.Dataset
	ds = NewMyDataset(...)
 	ds = data.CustomParallel(ds).Buffer(10).Start()
 	MyTrainFunc(ds)

func Parallel added in v0.2.0

func Parallel(ds train.Dataset) *ParallelDataset

Parallel parallelizes yield calls of any tread-safe train.Dataset.

It uses CustomParallel and automatically starts it with the default parameters.

To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.

The order of the yields is not preserved -- the parallelization may yield results in different order, and in some exceptional circumstance may create an order bias (faster results to generate being yield first).

Example:

var ds train.Dataset
ds = NewMyDataset(...)
ds = data.Parallel(ds)
MyTrainFunc(ds)

func (*ParallelDataset) Buffer added in v0.2.0

func (pd *ParallelDataset) Buffer(n int) *ParallelDataset

Buffer reserved in the channel that collects the parallel yields. Notice there is already an intrinsic buffering that happens in the goroutines sampling in parallel.

This must be called before a call to Start.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Done added in v0.11.0

func (pd *ParallelDataset) Done()

Done stops all the parallel dataset and wait them to finish.

func (*ParallelDataset) Name

func (pd *ParallelDataset) Name() string

Name implements train.Dataset.

func (*ParallelDataset) Parallelism

func (pd *ParallelDataset) Parallelism(n int) *ParallelDataset

Parallelism is the number of goroutines to start, each calling `ds.Yield()` in parallel to accelerate the generation of batches. If set to 0 (the default), and it will use the number of cores in the system plus 1.

It also allocates a buffer (in a Go channel) for each goroutine.

This must be called before a call to Start.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Reset

func (pd *ParallelDataset) Reset()

Reset implements train.Dataset.

func (*ParallelDataset) Start added in v0.2.0

func (pd *ParallelDataset) Start() *ParallelDataset

Start indicates that the dataset is finished to be configured, and starts being a valid Dataset.

After Start its configuration can no longer be changed.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) WithName added in v0.9.0

func (pd *ParallelDataset) WithName(name string) *ParallelDataset

WithName sets the name of the parallel dataset. It defaults to the original dataset name.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Yield

func (pd *ParallelDataset) Yield() (spec any, inputs []*tensors.Tensor, labels []*tensors.Tensor, err error)

Yield implements train.Dataset.

Directories

Path Synopsis
Package downloader implements download in parallel of various URLs, with various progress report callback.
Package downloader implements download in parallel of various URLs, with various progress report callback.
Package hdf5 provides a trivial API to access HDF5 file contents.
Package hdf5 provides a trivial API to access HDF5 file contents.
Package huggingface 🤗 provides functionality do download HuggingFace (HF) models and extract tensors stored in the ".safetensors" format.
Package huggingface 🤗 provides functionality do download HuggingFace (HF) models and extract tensors stored in the ".safetensors" format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL