data

package

v0.3.1 Latest Latest Go to latest Published: Jun 5, 2023 License: Apache-2.0 Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

Documentation ¶

Overview ¶

Package data is a collection of tools that facilitate data loading and preprocessing.

Index ¶

func Batch(manager *Manager, ds train.Dataset, batchSize int, ...) train.Dataset
func ByteCountIEC(count int64) string
func CopyWithProgressBar(dst io.Writer, src io.Reader, contentLength int64) (n int64, err error)
func Download(url, filePath string, showProgressBar bool) (size int64, err error)
func DownloadAndUntarIfMissing(url, baseDir, tarFile, targetUntarDir, checkHash string) error
func DownloadAndUnzipIfMissing(url, zipFile, unzipBaseDir, targetUnzipDir, checkHash string) error
func DownloadIfMissing(url, filePath, checkHash string) error
func FileExists(path string) bool
func ReadAhead(ds train.Dataset, bufferSize int) train.Dataset
func ReplaceTildeInDir(dir string) string
func Untar(baseDir, tarFile string) error
func Unzip(zipFile, zipBaseDir string) error
func ValidateChecksum(path, checkHash string) error
type InMemoryDataset
- func GobDeserializeInMemory(manager *Manager, decoder *gob.Decoder) (mds *InMemoryDataset, err error)
- func InMemory(manager *Manager, ds train.Dataset, dsIsBatched bool) (mds *InMemoryDataset, err error)
type ParallelDataset
- func CustomParallel(ds train.Dataset) *ParallelDataset
- func Parallel(ds train.Dataset) *ParallelDataset

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Batch ¶ added in v0.2.0

func Batch(manager *Manager, ds train.Dataset, batchSize int, createLeadingAxis, dropIncompleteBatch bool) train.Dataset

Batch creates dataset that batches `ds` into batches of size `batchSize`.

It uses GoMLX to batch the tensors themselves, so it takes a graph.Manager as its first parameter. Also, it means that it yields examples already stored "on device" -- whichever the platform Manager was configured with.

Typically, Batch can benefit from ReadAhead, so while training or evaluation of batch is happening, the next batch is being built. Consider using ReadAhead on the Batch dataset, even with a buffer of only 1.

Args:

`manager`: will be used to create the graph that actually does the batching.
`ds`: the dataset to be batched.
`batch_size`: size of each batch, except when there are no more examples, in which case batches can be smaller (except if `dropIncompleteBatch` was selected).
`createLeadingAxis`: usually set to true, it will create a new leading axis that becomes the batch dimension. Otherwise, it simply concatenates the individual results at the axis 0 -- this can be used for instance to increase the size of a batch.
`dropIncompleteBatch`: at the end of an epoch, if there are not enough examples to fill a batch, and this is set to true, the last batch is dropped. Otherwise, it returns only a partial batch -- with a different shape this may trigger the re-compilation of a graph. Usually desirable for evaluation, but not desirable for training.

Returns a `train.Dataset` that yields batched examples.

func ByteCountIEC ¶ added in v0.2.0

func ByteCountIEC(count int64) string

ByteCountIEC converts a byte count to string using the appropriate unit (B, Kb, MiB, GiB, ...). It uses the binary prefix system from IEC -- so powers of 1024 (as opposed to powers 1000).

func CopyWithProgressBar ¶ added in v0.2.0

func CopyWithProgressBar(dst io.Writer, src io.Reader, contentLength int64) (n int64, err error)

CopyWithProgressBar is similar to io.Copy, but updates the progress bar with the amount of data copied.

It requires knowing the amount of data to copy up-front.

func Download ¶

func Download(url, filePath string, showProgressBar bool) (size int64, err error)

Download file from url and save at given path. Attempts to create directory if it doesn't yet exist.

Optionally, use showProgressBar.

func DownloadAndUntarIfMissing ¶

func DownloadAndUntarIfMissing(url, baseDir, tarFile, targetUntarDir, checkHash string) error

DownloadAndUntarIfMissing downloads tarFile from given url, if file not there yet, and then untar it if the target directory is missing.

If checkHash is provided, it checks that the file has the hash or fail.

func DownloadAndUnzipIfMissing ¶

func DownloadAndUnzipIfMissing(url, zipFile, unzipBaseDir, targetUnzipDir, checkHash string) error

DownloadAndUnzipIfMissing downloads zipFile from given url, if file not there yet, and then unzip it under directory `unzipBaseDir`. If the target `targetUnzipDir` directory is missing.

If checkHash is provided, it checks that the file has the hash or fail.

func DownloadIfMissing ¶

func DownloadIfMissing(url, filePath, checkHash string) error

DownloadIfMissing will check if the path exists already, and if not it will download the file from the given URL.

If checkHash is provided, it checks that the file has the hash or fail.

func FileExists ¶

func FileExists(path string) bool

FileExists returns true if file or directory exists.

func ReadAhead ¶ added in v0.2.0

func ReadAhead(ds train.Dataset, bufferSize int) train.Dataset

ReadAhead returns a Dataset that reads bufferSize elements of the given `ds` so that when Yield is called, the results are immediate.

It uses ParallelDataset to implement it.

func ReplaceTildeInDir ¶

func ReplaceTildeInDir(dir string) string

ReplaceTildeInDir by the user's home directory. Returns dir if it doesn't start with "~".

func Untar ¶

func Untar(baseDir, tarFile string) error

Untar file, using decompression flags according to suffix: .gz for gzip, bz2 for bzip2.

func Unzip ¶

func Unzip(zipFile, zipBaseDir string) error

Unzip file, from the given zipBaseDir.

func ValidateChecksum ¶

func ValidateChecksum(path, checkHash string) error

ValidateChecksum verifies that the checksum of the file in the given path matches the checksum given. If it fails, it will remove the file (!) and return and error.

Types ¶

type InMemoryDataset ¶ added in v0.2.0

type InMemoryDataset struct {
	// contains filtered or unexported fields
}

InMemoryDataset represents a Dataset that has been completely read into the memory of the device it was created with -- the platform of the associated `graph.Manager`.

It supports batching, shuffling (with and without replacement) and can be duplicated (only one copy of the underlying data is used).

Finally, it supports serialization and deserialization, to accelerate loading of the data -- in case generating the original dataset is expensive (e.g: image transformations).

func GobDeserializeInMemory ¶ added in v0.2.0

func GobDeserializeInMemory(manager *Manager, decoder *gob.Decoder) (mds *InMemoryDataset, err error)

GobDeserializeInMemory dataset from the decoder. It requires a `graph.Manager` to properly be recreated.

No sampling configuration is recovered, and the InMemoryDataset created is sequential (no random sampling) that reads through only one epoch. The random number generator is also newly initialized (see InMemoryDataset.WithRand).

func InMemory ¶ added in v0.2.0

func InMemory(manager *Manager, ds train.Dataset, dsIsBatched bool) (mds *InMemoryDataset, err error)

InMemory creates dataset that reads the whole contents of `ds` into memory.

It uses GoMLX to batch the tensors themselves, so it takes a graph.Manager as its first parameter. Flat will be cached in the platform (device) the Manager was configured with.

Args:

`manager`: will be used to create the graph that does the caching and the ca actually does the batching.
`ds`: dataset to be cached. It is read in full once, concatenating the results in the cache.
`dsIsBatched`: whether the input `ds` is batched, and it's leading (first) axis is a batch size. If true, count of examples is adjusted accordingly. Notice if true, the batch size must be the same for all elements of the inputs and labels yielded by `ds`.

Returns a `InMemoryDataset`, that is initially not shuffled and not batched. You can configure how you want to use it with the other configuration methods.

func (*InMemoryDataset) BatchSize ¶ added in v0.2.0

func (mds *InMemoryDataset) BatchSize(n int, dropIncompleteBatch bool) *InMemoryDataset

BatchSize configures the InMemoryDataset to return batches of the given size. dropIncompleteBatch is set to true, it will simply drop examples if there are not enough to fill a batch -- this can only happen on the last batch of an epoch. Otherwise, it will return a partially filled batch.

If `n` is set to 0, it reverts back to yielding one example at a time.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Copy ¶ added in v0.2.0

func (mds *InMemoryDataset) Copy() *InMemoryDataset

Copy returns a copy of the dataset. It uses the same underlying data -- so very little memory is used.

The copy comes configured by default with sequential reading (not random sampling), non-looping, and reset.

func (*InMemoryDataset) FinalizeAll ¶ added in v0.2.0

func (mds *InMemoryDataset) FinalizeAll()

FinalizeAll will immediately free all the underlying data (and not wait for the garbage collector). This invalidates not only this InMemoryDataset, but also all other copies that use the same data (created with Copy).

This is not concurrency safe: if there are concurrent calls to sampling, this may lead to an undefined state or errors.

func (*InMemoryDataset) GobSerialize ¶ added in v0.2.0

func (mds *InMemoryDataset) GobSerialize(encoder *gob.Encoder) (err error)

GobSerialize in-memory content to the encoder.

Only the underlying data is serialized. The graph.Manager or the sampling configuration is not serialized. The contents of the `spec` (see WithSpec) is also not serialized.

func (*InMemoryDataset) Infinite ¶ added in v0.2.0

func (mds *InMemoryDataset) Infinite(infinite bool) *InMemoryDataset

Infinite sets whether the dataset should loop indefinitely. The default is `infinite = false`, which causes the dataset to going through the data only once before returning io.EOF.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Memory ¶ added in v0.2.0

func (mds *InMemoryDataset) Memory() int64

Memory returns an approximation of the memory being used.

func (*InMemoryDataset) Name ¶ added in v0.2.0

func (mds *InMemoryDataset) Name() string

Name implements `train.Dataset`

func (*InMemoryDataset) NumExamples ¶ added in v0.2.0

func (mds *InMemoryDataset) NumExamples() int

NumExamples cached.

func (*InMemoryDataset) RandomWithReplacement ¶ added in v0.2.0

func (mds *InMemoryDataset) RandomWithReplacement() *InMemoryDataset

RandomWithReplacement configures the InMemoryDataset to return random elements with replacement. If this is configured, Shuffle is canceled.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Reset ¶ added in v0.2.0

func (mds *InMemoryDataset) Reset()

Reset implements `train.Dataset`

func (*InMemoryDataset) Shuffle ¶ added in v0.2.0

func (mds *InMemoryDataset) Shuffle() *InMemoryDataset

Shuffle configures the InMemoryDataset to shuffle the order of the data. It returns random elements with without replacement. If this is configured, RandomWithReplacement is canceled.

At each call to Reset() it is reshuffled. It happens automatically if dataset is configured to Loop.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) WithRand ¶ added in v0.2.0

func (mds *InMemoryDataset) WithRand(rng *rand.Rand) *InMemoryDataset

WithRand sets the random number generator (RNG) for shuffling or random sampling. This allows for repeatable deterministic random sampling, if one wants. The default is to use an RNG initialized with the current nanosecond time.

If dataset is configured with Shuffle, this re-shuffles the dataset immediately.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) WithSpec ¶ added in v0.2.0

func (mds *InMemoryDataset) WithSpec(spec any) *InMemoryDataset

WithSpec sets the `spec` that is returned in Yield. The default is to use the one read from the original dataset passed to InMemory call. This allows one to set to something different.

It returns the modified InMemoryDataset, so calls can be cascaded if one wants.

func (*InMemoryDataset) Yield ¶ added in v0.2.0

func (mds *InMemoryDataset) Yield() (spec any, inputs []tensor.Tensor, labels []tensor.Tensor, err error)

Yield implements `train.Dataset`.

Returns next batch's inputs and labels or single example if BatchSize is set to 0.

type ParallelDataset ¶

type ParallelDataset struct {
	Dataset train.Dataset
	// contains filtered or unexported fields
}

ParallelDataset is a wrapper around a `train.Dataset` that parallelize calls to Yield. See details in CustomParallel.

func CustomParallel ¶ added in v0.2.0

func CustomParallel(ds train.Dataset) *ParallelDataset

CustomParallel builds a ParallelDataset that can be used to parallelize any train.Dataset, as long as the underlying dataset ds is thread-safe.

ParallelDataset can be further configured (see SetParallelism and Buffer), and then one has to call Start before actually using the Dataset.

To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.

Example:

	var ds train.Dataset
	ds = NewMyDataset(...)
 	ds = data.CustomParallel(ds).Buffer(10).Start()
 	MyTrainFunc(ds)

func Parallel ¶ added in v0.2.0

func Parallel(ds train.Dataset) *ParallelDataset

Parallel parallelizes yield calls of any tread-safe train.Dataset.

It uses CustomParallel and automatically starts it with the default parameters.

To avoid leaking goroutines, call ParallelDataset.Cancel when exiting.

Example:

var ds train.Dataset
ds = NewMyDataset(...)
ds = data.Parallel(ds)
MyTrainFunc(ds)

func (*ParallelDataset) Buffer ¶ added in v0.2.0

func (pd *ParallelDataset) Buffer(n int) *ParallelDataset

Buffer reserved in the channel that collects the parallel yields. Notice there is already a intrinsic buffering that happens

This must be called before a call to Start.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Name ¶

func (pd *ParallelDataset) Name() string

Name implements train.Dataset.

func (*ParallelDataset) Parallelism ¶

func (pd *ParallelDataset) Parallelism(n int) *ParallelDataset

Parallelism is the number of goroutines to start, each calling `ds.Yield()` in parallel to accelerate the generation of batches. If set to 0 (the default), and it will use the number of cores in the system plus 1.

It also allocates a buffer (in a Go channel) for each goroutine.

This must be called before a call to Start.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Reset ¶

func (pd *ParallelDataset) Reset()

Reset implements train.Dataset.

func (*ParallelDataset) Start ¶ added in v0.2.0

func (pd *ParallelDataset) Start() *ParallelDataset

Start indicates that the dataset is finished to be configured, and starts being a valid Dataset.

After Start its configuration can no longer be changed.

It returns the updated ParallelDataset, so calls can be cascaded.

func (*ParallelDataset) Yield ¶

func (pd *ParallelDataset) Yield() (spec any, inputs []tensor.Tensor, labels []tensor.Tensor, err error)

Yield implements train.Dataset.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
hdf5 Package hdf5 provides a trivial API to access HDF5 file contents.	Package hdf5 provides a trivial API to access HDF5 file contents.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL