Documentation
¶
Overview ¶
Package dataset contains utilities for working with datasets. Datasets hold columnar data across multiple pages.
Index ¶
- func CompareValues(a, b Value) int
- func WalkPredicate(p Predicate, fn func(p Predicate) bool)
- type AndPredicate
- type BuilderOptions
- type Column
- type ColumnBuilder
- type ColumnInfo
- type CompressionOptions
- type Dataset
- type EqualPredicate
- type FalsePredicate
- type FuncPredicate
- type GreaterThanPredicate
- type LessThanPredicate
- type MemColumn
- type MemPage
- type NotPredicate
- type OrPredicate
- type Page
- type PageData
- type PageInfo
- type Pages
- type Predicate
- type Reader
- type ReaderOptions
- type Row
- type StatisticsOptions
- type Value
- func (v Value) ByteArray() []byte
- func (v Value) Int64() int64
- func (v Value) IsNil() bool
- func (v Value) IsZero() bool
- func (v Value) MarshalBinary() (data []byte, err error)
- func (v Value) String() string
- func (v Value) Type() datasetmd.ValueType
- func (v Value) Uint64() uint64
- func (v *Value) UnmarshalBinary(data []byte) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CompareValues ¶
CompareValues returns -1 if a<b, 0 if a==b, or 1 if a>b. CompareValues panics if a and b are not the same type.
As a special case, either a or b may be nil. Two nil values are equal, and a nil value is always less than a non-nil value.
func WalkPredicate ¶ added in v3.5.0
WalkPredicate traverses a predicate in depth-first order: it starts by calling fn(p). If fn(p) returns true, WalkPredicate is invoked recursively with fn for each of the non-nil children of p, followed by a call of fn(nil).
Types ¶
type AndPredicate ¶ added in v3.5.0
type AndPredicate struct{ Left, Right Predicate }
An AndPredicate is a Predicate which asserts that a row may only be included if both the Left and Right Predicate are true.
type BuilderOptions ¶
type BuilderOptions struct { // PageSizeHint is the soft limit for the size of the page. Builders try to // fill pages as close to this size as possible, but the actual size may be // slightly larger or smaller. PageSizeHint int // Value is the value type of data to write. Value datasetmd.ValueType // Encoding is the encoding algorithm to use for values. Encoding datasetmd.EncodingType // Compression is the compression algorithm to use for values. Compression datasetmd.CompressionType // CompressionOptions holds optional configuration for compression. CompressionOptions CompressionOptions // StatisticsOptions holds optional configuration for statistics. Statistics StatisticsOptions }
BuilderOptions configures common settings for building pages.
type Column ¶
type Column interface { // ColumnInfo returns the metadata for the Column. ColumnInfo() *ColumnInfo // ListPages returns the set of ordered pages in the column. ListPages(ctx context.Context) result.Seq[Page] }
A Column represents a sequence of values within a dataset. Columns are split up across one or more [Page]s to limit the amount of memory needed to read a portion of the column at a time.
type ColumnBuilder ¶
type ColumnBuilder struct {
// contains filtered or unexported fields
}
A ColumnBuilder builds a sequence of Value entries of a common type into a column. Values are accumulated into a buffer and then flushed into [MemPage]s once the size of data exceeds a configurable limit.
func NewColumnBuilder ¶
func NewColumnBuilder(name string, opts BuilderOptions) (*ColumnBuilder, error)
NewColumnBuilder creates a new ColumnBuilder from the optional name and provided options. NewColumnBuilder returns an error if the options are invalid.
func (*ColumnBuilder) Append ¶
func (cb *ColumnBuilder) Append(row int, value Value) error
Append adds a new value into cb with the given zero-indexed row number. If the row number is higher than the current number of rows in cb, null values are added up to the new row.
Append returns an error if the row number is out-of-order.
func (*ColumnBuilder) Backfill ¶
func (cb *ColumnBuilder) Backfill(row int)
Backfill adds NULLs into cb up to (but not including) the provided row number. If values exist up to the provided row number, Backfill does nothing.
func (*ColumnBuilder) EstimatedSize ¶
func (cb *ColumnBuilder) EstimatedSize() int
EstimatedSize returns the estimated size of all data in cb. EstimatedSize includes the compressed size of all cut pages in cb, followed by the size estimate of the in-progress page.
Because compression isn't considered for the in-progress page, EstimatedSize tends to overestimate the actual size after flushing.
func (*ColumnBuilder) Flush ¶
func (cb *ColumnBuilder) Flush() (*MemColumn, error)
Flush converts data in cb into a MemColumn. Afterwards, cb is reset to a fresh state and can be reused.
func (*ColumnBuilder) Reset ¶
func (cb *ColumnBuilder) Reset()
Reset clears all data in cb and resets it to a fresh state.
type ColumnInfo ¶
type ColumnInfo struct { Name string // Name of the column, if any. Type datasetmd.ValueType // Type of values in the column. Compression datasetmd.CompressionType // Compression used for the column. RowsCount int // Total number of rows in the column. ValuesCount int // Total number of non-NULL values in the column. CompressedSize int // Total size of all pages in the column after compression. UncompressedSize int // Total size of all pages in the column before compression. Statistics *datasetmd.Statistics // Optional statistics for the column. }
ColumnInfo describes a column.
type CompressionOptions ¶
type CompressionOptions struct { // Zstd holds encoding options for Zstd compression. Only used for // [datasetmd.COMPRESSION_TYPE_ZSTD]. Zstd []zstd.EOption }
CompressionOptions customizes the compressor used when building pages.
type Dataset ¶
type Dataset interface { // ListColumns returns the set of [Column]s in the Dataset. The order of // Columns in the returned sequence must be consistent across calls. ListColumns(ctx context.Context) result.Seq[Column] // ListPages retrieves a set of [Pages] given a list of [Column]s. // Implementations of Dataset may use ListPages to optimize for batch reads. // The order of [Pages] in the returned sequence must match the order of the // columns argument. ListPages(ctx context.Context, columns []Column) result.Seq[Pages] // ReadPages returns the set of [PageData] for the specified slice of pages. // Implementations of Dataset may use ReadPages to optimize for batch reads. // The order of [PageData] in the returned sequence must match the order of // the pages argument. ReadPages(ctx context.Context, pages []Page) result.Seq[PageData] }
A Dataset holds a collection of [Columns], each of which is split into a set of Pages and further split into a sequence of [Values].
Dataset is read-only; callers must not modify any of the values returned by methods in Dataset.
func FromMemory ¶
FromMemory returns an in-memory Dataset from the given list of [MemColumn]s.
type EqualPredicate ¶ added in v3.5.0
type EqualPredicate struct { Column Column // Column to check. Value Value // Value to check equality for. }
An EqualPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is equal to the Value.
type FalsePredicate ¶ added in v3.5.0
type FalsePredicate struct{}
FalsePredicate is a Predicate which always returns false.
type FuncPredicate ¶ added in v3.5.0
type FuncPredicate struct { Column Column // Column to check. // Keep is invoked with the column and value pair to check. Keep is given // the Column instance to allow for reusing the same function across // multiple columns, if necessary. // // If Keep returns true, the row is kept. Keep func(column Column, value Value) bool }
FuncPredicate is a Predicate which asserts that a row may only be included if the Value of the Column passes the Keep function.
Instances of FuncPredicate are ineligible for page filtering and should only be used when there isn't a more explicit Predicate implementation.
type GreaterThanPredicate ¶ added in v3.5.0
type GreaterThanPredicate struct { Column Column // Column to check. Value Value // Value for which rows in Column must be greater than. }
A GreaterThanPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is greater than the provided Value.
type LessThanPredicate ¶ added in v3.5.0
type LessThanPredicate struct { Column Column // Column to check. Value Value // Value for which rows in Column must be less than. }
A LessThanPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is less than the provided Value.
type MemColumn ¶
type MemColumn struct { Info ColumnInfo // Information about the column. Pages []*MemPage // The set of pages in the column. }
MemColumn holds a set of pages of a common type.
func (*MemColumn) ColumnInfo ¶
func (c *MemColumn) ColumnInfo() *ColumnInfo
ColumnInfo implements Column and returns c.Info.
type MemPage ¶
type MemPage struct { Info PageInfo // Information about the page. Data PageData // Data for the page. }
MemPage holds an encoded (and optionally compressed) sequence of Value entries of a common type. Use ColumnBuilder to construct sets of pages.
type NotPredicate ¶ added in v3.5.0
type NotPredicate struct{ Inner Predicate }
A NotePredicate is a Predicate which asserts that a row may only be included if the inner Predicate is false.
type OrPredicate ¶ added in v3.5.0
type OrPredicate struct{ Left, Right Predicate }
An OrPredicate is a Predicate which asserts that a row may only be included if either the Left or Right Predicate are true.
type Page ¶
type Page interface { // PageInfo returns the metadata for the Page. PageInfo() *PageInfo // ReadPage returns the [PageData] for the Page. ReadPage(ctx context.Context) (PageData, error) }
A Page holds an encoded and optionally compressed sequence of [Value]s within a Column.
type PageData ¶
type PageData []byte
PageData holds the raw data for a page. Data is formatted as:
<uvarint(presence-bitmap-size)> <presence-bitmap> <values-data>
The presence-bitmap is a bitmap-encoded sequence of booleans, where values describe which rows are present (1) or nil (0). The presence bitmap is always stored uncompressed.
values-data is then the encoded and optionally compressed sequence of non-NULL values.
type PageInfo ¶
type PageInfo struct { UncompressedSize int // UncompressedSize is the size of a page before compression. CompressedSize int // CompressedSize is the size of a page after compression. CRC32 uint32 // CRC32 checksum of the page after encoding and compression. RowCount int // RowCount is the number of rows in the page, including NULLs. ValuesCount int // ValuesCount is the number of non-NULL values in the page. Encoding datasetmd.EncodingType // Encoding used for values in the page. Stats *datasetmd.Statistics // Optional statistics for the page. }
PageInfo describes a page.
type Predicate ¶ added in v3.5.0
type Predicate interface {
// contains filtered or unexported methods
}
Predicate is an expression used to filter rows in a Reader.
type Reader ¶ added in v3.5.0
type Reader struct {
// contains filtered or unexported fields
}
A Reader reads [Row]s from a Dataset.
func NewReader ¶ added in v3.5.0
func NewReader(opts ReaderOptions) *Reader
NewReader creates a new Reader from the provided options.
func (*Reader) Close ¶ added in v3.5.0
Close closes the Reader. Closed Readers can be reused by calling Reader.Reset.
func (*Reader) Read ¶ added in v3.5.0
Read reads up to the next len(s) rows from r and stores them into s. It returns the number of rows read and any error encountered. At the end of the Dataset, Read returns 0, io.EOF.
func (*Reader) Reset ¶ added in v3.5.0
func (r *Reader) Reset(opts ReaderOptions)
Reset discards any state and resets the Reader with a new set of options. This permits reusing a Reader rather than allocating a new one.
type ReaderOptions ¶ added in v3.5.0
type ReaderOptions struct { Dataset Dataset // Dataset to read from. // Columns to read from the Dataset. It is invalid to provide a Column that // is not in Dataset. // // The set of Columns can include columns not used in Predicate; such columns // are considered non-predicate columns. Columns []Column // Predicate filters the data returned by a Reader. Predicate is optional; if // nil, all rows from Columns are returned. // // Expressions in Predicate may only reference columns in Columns. Predicate Predicate // TargetCacheSize configures the amount of memory to target for caching // pages in memory. The cache may exceed this size if the combined size of // all pages required for a single call to [Reader.Reead] exceeds this value. // // TargetCacheSize is used to download and cache additional pages in advance // of when they're needed. If TargetCacheSize is 0, only immediately required // pages are cached. TargetCacheSize int }
ReaderOptions configures how a Reader will read [Row]s.
type Row ¶
type Row struct { Index int // Index of the row in the dataset. Values []Value // Values for the row, one per [Column]. }
A Row in a Dataset is a set of values across multiple columns with the same row number.
type StatisticsOptions ¶ added in v3.5.0
type StatisticsOptions struct { // StoreRangeStats indicates whether to store value range statistics for the // column and pages. StoreRangeStats bool // StoreCardinalityStats indicates whether to store cardinality estimations, // facilitated by hyperloglog StoreCardinalityStats bool }
StatisticsOptions customizes the collection of statistics for a column.
type Value ¶
type Value struct {
// contains filtered or unexported fields
}
A Value represents a single value within a dataset. Unlike [any], Values can be constructed without allocations. The zero Value corresponds to nil.
func ByteArrayValue ¶ added in v3.5.0
ByteArrayValue returns a Value for a byte slice representing a string.
func (Value) ByteArray ¶ added in v3.5.0
ByteSlice returns v's value as a byte slice. If v is not a string, ByteSlice returns a byte slice of the form "VALUE_TYPE_T", where T is the underlying type of v.
func (Value) Int64 ¶
Int64 returns v's value as an int64. It panics if v is not a datasetmd.VALUE_TYPE_INT64.
func (Value) MarshalBinary ¶ added in v3.5.0
MarshalBinary encodes v into a binary representation. Non-NULL values encode first with the type (encoded as uvarint), followed by an encoded value, where:
- datasetmd.VALUE_TYPE_INT64 encodes as a varint.
- datasetmd.VALUE_TYPE_UINT64 encodes as a uvarint.
- datasetmd.VALUE_TYPE_STRING encodes the string as a sequence of bytes.
NULL values encode as nil.
func (Value) String ¶
String returns v's value as a string. Because of Go's String method convention, if v is not a string, String returns a string of the form "VALUE_TYPE_T", where T is the underlying type of v.
func (Value) Type ¶
Type returns the datasetmd.ValueType of v. If v is nil, Type returns datasetmd.VALUE_TYPE_UNSPECIFIED.
func (Value) Uint64 ¶
Uint64 returns v's value as a uint64. It panics if v is not a datasetmd.VALUE_TYPE_UINT64.
func (*Value) UnmarshalBinary ¶ added in v3.5.0
UnmarshalBinary decodes a Value from a binary representation. See Value.MarshalBinary for the encoding format.
Source Files
¶
- column.go
- column_builder.go
- column_reader.go
- column_stats.go
- dataset.go
- page.go
- page_builder.go
- page_compress_writer.go
- page_reader.go
- predicate.go
- reader.go
- reader_basic.go
- reader_downloader.go
- row_range.go
- row_ranges.go
- value.go
- value_encoding.go
- value_encoding_bitmap.go
- value_encoding_delta.go
- value_encoding_plain.go