Documentation
¶
Overview ¶
Package data is a collection of tools that facilitate data loading and preprocessing.
Index ¶
- func Batch(manager *Manager, ds train.Dataset, batchSize int, ...) train.Dataset
- func ByteCountIEC(count int64) string
- func CopyWithProgressBar(dst io.Writer, src io.Reader, contentLength int64) (n int64, err error)
- func Download(url, filePath string, showProgressBar bool) (size int64, err error)
- func DownloadAndUntarIfMissing(url, baseDir, tarFile, targetUntarDir, checkHash string) error
- func DownloadAndUnzipIfMissing(url, zipFile, unzipBaseDir, targetUnzipDir, checkHash string) error
- func DownloadIfMissing(url, filePath, checkHash string) error
- func FileExists(path string) bool
- func ReadAhead(ds train.Dataset, bufferSize int) train.Dataset
- func ReplaceTildeInDir(dir string) string
- func Untar(baseDir, tarFile string) error
- func Unzip(zipFile, zipBaseDir string) error
- func ValidateChecksum(path, checkHash string) error
- type InMemoryDataset
- func (mds *InMemoryDataset) BatchSize(n int, dropIncompleteBatch bool) *InMemoryDataset
- func (mds *InMemoryDataset) Copy() *InMemoryDataset
- func (mds *InMemoryDataset) FinalizeAll()
- func (mds *InMemoryDataset) GobSerialize(encoder *gob.Encoder) (err error)
- func (mds *InMemoryDataset) Infinite(infinite bool) *InMemoryDataset
- func (mds *InMemoryDataset) Memory() int64
- func (mds *InMemoryDataset) Name() string
- func (mds *InMemoryDataset) NumExamples() int
- func (mds *InMemoryDataset) RandomWithReplacement() *InMemoryDataset
- func (mds *InMemoryDataset) Reset()
- func (mds *InMemoryDataset) Shuffle() *InMemoryDataset
- func (mds *InMemoryDataset) WithRand(rng *rand.Rand) *InMemoryDataset
- func (mds *InMemoryDataset) WithSpec(spec any) *InMemoryDataset
- func (mds *InMemoryDataset) Yield() (spec any, inputs []tensor.Tensor, labels []tensor.Tensor, err error)
- type ParallelDataset
- func (pd *ParallelDataset) Buffer(n int) *ParallelDataset
- func (pd *ParallelDataset) Name() string
- func (pd *ParallelDataset) Parallelism(n int) *ParallelDataset
- func (pd *ParallelDataset) Reset()
- func (pd *ParallelDataset) Start() *ParallelDataset
- func (pd *ParallelDataset) Yield() (spec any, inputs []tensor.Tensor, labels []tensor.Tensor, err error)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Batch ¶ added in v0.2.0
func Batch(manager *Manager, ds train.Dataset, batchSize int, createLeadingAxis, dropIncompleteBatch bool) train.Dataset
Batch creates dataset that batches `ds` into batches of size `batchSize`.
It uses GoMLX to batch the tensors themselves, so it takes a graph.Manager as its first parameter. Also, it means that it yields examples already stored "on device" -- whichever the platform Manager was configured with.
Typically, Batch can benefit from ReadAhead, so while training or evaluation of batch is happening, the next batch is being built. Consider using ReadAhead on the Batch dataset, even with a buffer of only 1.
Args:
- `manager`: will be used to create the graph that actually does the batching.
- `ds`: the dataset to be batched.
- `batch_size`: size of each batch, except when there are no more examples, in which case batches can be smaller (except if `dropIncompleteBatch` was selected).
- `createLeadingAxis`: usually set to true, it will create a new leading axis that becomes the batch dimension. Otherwise, it simply concatenates the individual results at the axis 0 -- this can be used for instance to increase the size of a batch.
- `dropIncompleteBatch`: at the end of an epoch, if there are not enough examples to fill a batch, and this is set to true, the last batch is dropped. Otherwise, it returns only a partial batch -- with a different shape this may trigger the re-compilation of a graph. Usually desirable for evaluation, but not desirable for training.
Returns a `train.Dataset` that yields batched examples.
func ByteCountIEC ¶ added in v0.2.0
ByteCountIEC converts a byte count to string using the appropriate unit (B, Kb, MiB, GiB, ...). It uses the binary prefix system from IEC -- so powers of 1024 (as opposed to powers 1000).
func CopyWithProgressBar ¶ added in v0.2.0
CopyWithProgressBar is similar to io.Copy, but updates the progress bar with the amount of data copied.
It requires knowing the amount of data to copy up-front.
func Download ¶
Download file from url and save at given path. Attempts to create directory if it doesn't yet exist.
Optionally, use showProgressBar.
func DownloadAndUntarIfMissing ¶
DownloadAndUntarIfMissing downloads tarFile from given url, if file not there yet, and then untar it if the target directory is missing.
If checkHash is provided, it checks that the file has the hash or fail.
func DownloadAndUnzipIfMissing ¶
DownloadAndUnzipIfMissing downloads zipFile from given url, if file not there yet, and then unzip it under directory `unzipBaseDir`. If the target `targetUnzipDir` directory is missing.
If checkHash is provided, it checks that the file has the hash or fail.
func DownloadIfMissing ¶
DownloadIfMissing will check if the path exists already, and if not it will download the file from the given URL.
If checkHash is provided, it checks that the file has the hash or fail.
func FileExists ¶
FileExists returns true if file or directory exists.
func ReadAhead ¶ added in v0.2.0
ReadAhead returns a Dataset that reads bufferSize elements of the given `ds` so that when Yield is called, the results are immediate.
It uses ParallelDataset to implement it.
func ReplaceTildeInDir ¶
ReplaceTildeInDir by the user's home directory. Returns dir if it doesn't start with "~".
func Untar ¶
Untar file, using decompression flags according to suffix: .gz for gzip, bz2 for bzip2.
func ValidateChecksum ¶
ValidateChecksum verifies that the checksum of the file in the given path matches the checksum given. If it fails, it will remove the file (!) and return and error.
Types ¶
type InMemoryDataset ¶ added in v0.2.0
type InMemoryDataset struct {
// contains filtered or unexported fields
}
InMemoryDataset represents a Dataset that has been completely read into the memory of the device it was created with -- the platform of the associated `graph.Manager`.
It supports batching, shuffling (with and without replacement) and can be duplicated (only one copy of the underlying data is used).
Finally, it supports serialization and deserialization, to accelerate loading of the data -- in case generating the original dataset is expensive (e.g: image transformations).
func GobDeserializeInMemory ¶ added in v0.2.0
func GobDeserializeInMemory(manager *Manager, decoder *gob.Decoder) (mds *InMemoryDataset, err error)
GobDeserializeInMemory dataset from the decoder. It requires a `graph.Manager` to properly be recreated.
No sampling configuration is recovered, and the InMemoryDataset created is sequential (no random sampling) that reads through only one epoch. The random number generator is also newly initialized (see InMemoryDataset.WithRand).
func InMemory ¶ added in v0.2.0
func InMemory(manager *Manager, ds train.Dataset, dsIsBatched bool) (mds *InMemoryDataset, err error)
InMemory creates dataset that reads the whole contents of `ds` into memory.
It uses GoMLX to batch the tensors themselves, so it takes a graph.Manager as its first parameter. Flat will be cached in the platform (device) the Manager was configured with.
Args:
- `manager`: will be used to create the graph that does the caching and the ca actually does the batching.
- `ds`: dataset to be cached. It is read in full once, concatenating the results in the cache.
- `dsIsBatched`: whether the input `ds` is batched, and it's leading (first) axis is a batch size. If true, count of examples is adjusted accordingly. Notice if true, the batch size must be the same for all elements of the inputs and labels yielded by `ds`.
Returns a `InMemoryDataset`, that is initially not shuffled and not batched. You can configure how you want to use it with the other configuration methods.
func (*InMemoryDataset) BatchSize ¶ added in v0.2.0
func (mds *InMemoryDataset) BatchSize(n int, dropIncompleteBatch bool) *InMemoryDataset
BatchSize configures the InMemoryDataset to return batches of the given size. dropIncompleteBatch is set to true, it will simply drop examples if there are not enough to fill a batch -- this can only happen on the last batch of an epoch. Otherwise, it will return a partially filled batch.
If `n` is set to 0, it reverts back to yielding one example at a time.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
func (*InMemoryDataset) Copy ¶ added in v0.2.0
func (mds *InMemoryDataset) Copy() *InMemoryDataset
Copy returns a copy of the dataset. It uses the same underlying data -- so very little memory is used.
The copy comes configured by default with sequential reading (not random sampling), non-looping, and reset.
func (*InMemoryDataset) FinalizeAll ¶ added in v0.2.0
func (mds *InMemoryDataset) FinalizeAll()
FinalizeAll will immediately free all the underlying data (and not wait for the garbage collector). This invalidates not only this InMemoryDataset, but also all other copies that use the same data (created with Copy).
This is not concurrency safe: if there are concurrent calls to sampling, this may lead to an undefined state or errors.
func (*InMemoryDataset) GobSerialize ¶ added in v0.2.0
func (mds *InMemoryDataset) GobSerialize(encoder *gob.Encoder) (err error)
GobSerialize in-memory content to the encoder.
Only the underlying data is serialized. The graph.Manager or the sampling configuration is not serialized. The contents of the `spec` (see WithSpec) is also not serialized.
func (*InMemoryDataset) Infinite ¶ added in v0.2.0
func (mds *InMemoryDataset) Infinite(infinite bool) *InMemoryDataset
Infinite sets whether the dataset should loop indefinitely. The default is `infinite = false`, which causes the dataset to going through the data only once before returning io.EOF.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
func (*InMemoryDataset) Memory ¶ added in v0.2.0
func (mds *InMemoryDataset) Memory() int64
Memory returns an approximation of the memory being used.
func (*InMemoryDataset) Name ¶ added in v0.2.0
func (mds *InMemoryDataset) Name() string
Name implements `train.Dataset`
func (*InMemoryDataset) NumExamples ¶ added in v0.2.0
func (mds *InMemoryDataset) NumExamples() int
NumExamples cached.
func (*InMemoryDataset) RandomWithReplacement ¶ added in v0.2.0
func (mds *InMemoryDataset) RandomWithReplacement() *InMemoryDataset
RandomWithReplacement configures the InMemoryDataset to return random elements with replacement. If this is configured, Shuffle is canceled.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
func (*InMemoryDataset) Reset ¶ added in v0.2.0
func (mds *InMemoryDataset) Reset()
Reset implements `train.Dataset`
func (*InMemoryDataset) Shuffle ¶ added in v0.2.0
func (mds *InMemoryDataset) Shuffle() *InMemoryDataset
Shuffle configures the InMemoryDataset to shuffle the order of the data. It returns random elements with without replacement. If this is configured, RandomWithReplacement is canceled.
At each call to Reset() it is reshuffled. It happens automatically if dataset is configured to Loop.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
func (*InMemoryDataset) WithRand ¶ added in v0.2.0
func (mds *InMemoryDataset) WithRand(rng *rand.Rand) *InMemoryDataset
WithRand sets the random number generator (RNG) for shuffling or random sampling. This allows for repeatable deterministic random sampling, if one wants. The default is to use an RNG initialized with the current nanosecond time.
If dataset is configured with Shuffle, this re-shuffles the dataset immediately.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
func (*InMemoryDataset) WithSpec ¶ added in v0.2.0
func (mds *InMemoryDataset) WithSpec(spec any) *InMemoryDataset
WithSpec sets the `spec` that is returned in Yield. The default is to use the one read from the original dataset passed to InMemory call. This allows one to set to something different.
It returns the modified InMemoryDataset, so calls can be cascaded if one wants.
type ParallelDataset ¶
ParallelDataset is a wrapper around a `train.Dataset` that parallelize calls to Yield. See details in CustomParallel.
func CustomParallel ¶ added in v0.2.0
func CustomParallel(ds train.Dataset) *ParallelDataset
CustomParallel builds a ParallelDataset that can be used to parallelize any train.Dataset, as long as the underlying dataset ds is thread-safe.
ParallelDataset can be further configured (see SetParallelism and Buffer), and then one has to call Start before actually using the Dataset.
To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.
Example:
var ds train.Dataset ds = NewMyDataset(...) ds = data.CustomParallel(ds).Buffer(10).Start() MyTrainFunc(ds)
func Parallel ¶ added in v0.2.0
func Parallel(ds train.Dataset) *ParallelDataset
Parallel parallelizes yield calls of any tread-safe train.Dataset.
It uses CustomParallel and automatically starts it with the default parameters.
To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.
Example:
var ds train.Dataset ds = NewMyDataset(...) ds = data.Parallel(ds) MyTrainFunc(ds)
func (*ParallelDataset) Buffer ¶ added in v0.2.0
func (pd *ParallelDataset) Buffer(n int) *ParallelDataset
Buffer reserved in the channel that collects the parallel yields. Notice there is already a intrinsic buffering that happens
This must be called before a call to Start.
It returns the updated ParallelDataset, so calls can be cascaded.
func (*ParallelDataset) Name ¶
func (pd *ParallelDataset) Name() string
Name implements train.Dataset.
func (*ParallelDataset) Parallelism ¶
func (pd *ParallelDataset) Parallelism(n int) *ParallelDataset
Parallelism is the number of goroutines to start, each calling `ds.Yield()` in parallel to accelerate the generation of batches. If set to 0 (the default), and it will use the number of cores in the system plus 1.
It also allocates a buffer (in a Go channel) for each goroutine.
This must be called before a call to Start.
It returns the updated ParallelDataset, so calls can be cascaded.
func (*ParallelDataset) Start ¶ added in v0.2.0
func (pd *ParallelDataset) Start() *ParallelDataset
Start indicates that the dataset is finished to be configured, and starts being a valid Dataset.
After Start its configuration can no longer be changed.
It returns the updated ParallelDataset, so calls can be cascaded.