data

package
v0.2.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 16, 2024 License: Apache-2.0 Imports: 2 Imported by: 0

Documentation

Overview

Package data provides primitives for representing and organizing the given data sets. In addition to the traditional sharded data set, it supports a partitioned data set where the data is split into multiple data partitions across nodes in the cluster.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Dataset

type Dataset interface {
	// Getitem retrieves a data sample with the given arguments.  This must provide
	// an index identifying the scheduled data sample and its size.
	Getitem(rank int, size int64) (int64, int64)

	// Rand retrieves an arbitrary data sample from the data set.
	Rand(rank int) (int64, int64)

	// OnEpochEnd is called at the end of an epoch during training.
	OnEpochEnd(epoch int64)

	// OnTrainEnd terminates the training environment.
	OnTrainEnd()
}

Dataset represents the given data set. In addition to Getitem and Rand to retrieve data samples, one should implement callbacks called OnEpochEnd and OnTrainEnd, which are called at the end of each epoch and training, respectively.

func New

func New(sizes, groups []int64, seed int64, partition bool) Dataset

New creates a new data set with the given arguments.

type PartitionedDataset

type PartitionedDataset struct {
	// contains filtered or unexported fields
}

PartitionedDataset represents a partitioned data set where each of the nodes in the cluster holds only a portion of the given data set.

func NewPartitionedDataset

func NewPartitionedDataset(sizes, groups []int64, seed int64) *PartitionedDataset

NewPartitionedDataset creates a new partitioned data set with the given arguments.

func (*PartitionedDataset) Getitem

func (d *PartitionedDataset) Getitem(rank int, size int64) (_, _ int64)

Getitem looks for the data sample with the size nearest to the given size in the partition with the given rank.

func (*PartitionedDataset) OnEpochEnd

func (d *PartitionedDataset) OnEpochEnd(epoch int64)

OnEpochEnd restores the data partitions.

func (*PartitionedDataset) OnTrainEnd

func (d *PartitionedDataset) OnTrainEnd()

OnTrainEnd terminates the training environment.

func (*PartitionedDataset) Rand

func (d *PartitionedDataset) Rand(rank int) (_, _ int64)

Rand selects a random data sample from the data set.

type Sample

type Sample struct {
	btree.ItemBase
}

Sample represents a single data sample in the data set.

func NewSample

func NewSample(index, size int64) Sample

NewSample creates a new data sample with the given arguments.

func (Sample) Less

func (s Sample) Less(than btree.Item) bool

Less tests whether the current data sample is less than the given argument. This allows the underlying container to non-deterministically return items for a given key while keeping the sorting order.

type ShardedDataset

type ShardedDataset struct {
	// contains filtered or unexported fields
}

ShardedDataset represents a sharded data set where every node in the cluster has a replica of the given data set; hence it ignores rank when looking for the data sample.

func NewShardedDataset

func NewShardedDataset(sizes []int64, seed int64) *ShardedDataset

NewShardedDataset creates a new sharded data set with the given argument.

func (*ShardedDataset) Getitem

func (d *ShardedDataset) Getitem(rank int, size int64) (_, _ int64)

Getitem looks for the data sample with the size nearest to the given size.

func (*ShardedDataset) OnEpochEnd

func (d *ShardedDataset) OnEpochEnd(epoch int64)

OnEpochEnd restores the data samples.

func (*ShardedDataset) OnTrainEnd

func (d *ShardedDataset) OnTrainEnd()

OnTrainEnd terminates the training environment.

func (*ShardedDataset) Rand

func (d *ShardedDataset) Rand(rank int) (_, _ int64)

Rand selects a random data sample from the data set.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL