utils

package
v7.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2022 License: Apache-2.0, BSD-2-Clause, BSD-3-Clause, + 8 more Imports: 17 Imported by: 0

Documentation

Overview

Package utils contains various internal utilities for the parquet library that aren't intended to be exposed to external consumers such as interfaces and bitmap readers/writers including the RLE encoder/decoder and so on.

Index

Constants

View Source
const (
	MaxIndexType = math.MaxInt32
	MinIndexType = math.MinInt32
)

Max and Min constants for the IndexType

View Source
const (
	MaxValuesPerLiteralRun = (1 << 6) * 8
)

Variables

View Source
var (
	ToLEUint32  = func(x uint32) uint32 { return x }
	ToLEUint64  = func(x uint64) uint64 { return x }
	ToLEInt32   = func(x int32) int32 { return x }
	ToLEInt64   = func(x int64) int64 { return x }
	ToLEFloat32 = func(x float32) float32 { return x }
	ToLEFloat64 = func(x float64) float64 { return x }
)

Functions

func BytesToBools

func BytesToBools(in []byte, out []bool)

BytesToBools efficiently populates a slice of booleans from an input bitmap

func GetMinMaxInt32

func GetMinMaxInt32(v []int32) (min, max int32)

GetMinMaxInt32 returns the min and max for a int32 slice, using AVX2 or SSE4 cpu extensions if available, falling back to a pure go implementation if they are unavailable or built with the noasm tag.

func GetMinMaxInt64

func GetMinMaxInt64(v []int64) (min, max int64)

GetMinMaxInt64 returns the min and max for a int64 slice, using AVX2 or SSE4 cpu extensions if available, falling back to a pure go implementation if they are unavailable or built with the noasm tag.

func GetMinMaxUint32

func GetMinMaxUint32(v []uint32) (min, max uint32)

GetMinMaxUint32 returns the min and max for a uint32 slice, using AVX2 or SSE4 cpu extensions if available, falling back to a pure go implementation if they are unavailable or built with the noasm tag.

func GetMinMaxUint64

func GetMinMaxUint64(v []uint64) (min, max uint64)

GetMinMaxUint64 returns the min and max for a uint64 slice, using AVX2 or SSE4 cpu extensions if available, falling back to a pure go implementation if they are unavailable or built with the noasm tag.

func IsMultipleOf64

func IsMultipleOf64(v int64) bool

IsMultipleOf64 returns whether v is a multiple of 64.

func LeastSignificantBitMask

func LeastSignificantBitMask(index int64) uint64

LeastSignificantBitMask returns a bit mask to return the least significant bits for a value starting from the bit index passed in. ie: if you want a mask for the 4 least significant bits, you call LeastSignificantBitMask(4)

func Max

func Max(a, b int64) int64

Max is a convenience Max function for int64

func MaxBufferSize

func MaxBufferSize(width, numValues int) int

func MaxInt

func MaxInt(a, b int) int

MaxInt is a convenience Max function for int

func Min

func Min(a, b int64) int64

Min is a convenience Min function for int64

func MinBufferSize

func MinBufferSize(bitWidth int) int

func MinInt

func MinInt(a, b int) int

MinInt is a convenience Min function for int

func VisitBitBlocks

func VisitBitBlocks(bitmap []byte, offset, length int64, visitValid func(pos int64), visitInvalid func())

VisitBitBlocks is a utility for easily iterating through the blocks of bits in a bitmap, calling the appropriate visitValid/visitInvalid function as we iterate through the bits. visitValid is called with the bitoffset of the valid bit. Don't use this inside a tight loop when performance is needed and instead prefer manually constructing these loops in that scenario.

func VisitSetBitRuns

func VisitSetBitRuns(bitmap []byte, bitmapOffset int64, length int64, visitFn VisitFn) error

VisitSetBitRuns is just a convenience function for calling NewSetBitRunReader and then VisitSetBitRuns

Types

type BitBlockCount

type BitBlockCount struct {
	Len    int16
	Popcnt int16
}

BitBlockCount is returned by the various bit block counter utilities in order to return a length of bits and the population count of that slice of bits.

func (BitBlockCount) AllSet

func (b BitBlockCount) AllSet() bool

AllSet returns true if ALL the bits were 1 in this set, ie: Popcnt == Len

func (BitBlockCount) NoneSet

func (b BitBlockCount) NoneSet() bool

NoneSet returns true if ALL the bits were 0 in this set, ie: Popcnt == 0

type BitBlockCounter

type BitBlockCounter struct {
	// contains filtered or unexported fields
}

BitBlockCounter is a utility for grabbing chunks of a bitmap at a time and efficiently counting the number of bits which are 1.

func NewBitBlockCounter

func NewBitBlockCounter(bitmap []byte, startOffset, nbits int64) *BitBlockCounter

NewBitBlockCounter returns a BitBlockCounter for the passed bitmap starting at startOffset of length nbits.

func (*BitBlockCounter) NextFourWords

func (b *BitBlockCounter) NextFourWords() BitBlockCount

NextFourWords returns the next run of available bits, usually 256. The returned pair contains the size of run and the number of true values. The last block will have a length less than 256 if the bitmap length is not a multiple of 256, and will return 0-length blocks in subsequent invocations.

func (*BitBlockCounter) NextWord

func (b *BitBlockCounter) NextWord() BitBlockCount

NextWord returns the next run of available bits, usually 64. The returned pair contains the size of run and the number of true values. The last block will have a length less than 64 if the bitmap length is not a multiple of 64, and will return 0-length blocks in subsequent invocations.

type BitReader

type BitReader struct {
	// contains filtered or unexported fields
}

BitReader implements functionality for reading bits or bytes buffering up to a uint64 at a time from the reader in order to improve efficiency. It also provides methods to read multiple bytes in one read such as encoded ints/values.

This BitReader is the basis for the other utility classes like RLE decoding and such, providing the necessary functions for interpreting the values.

func NewBitReader

func NewBitReader(r reader) *BitReader

NewBitReader takes in a reader that implements io.Reader, io.ReaderAt and io.Seeker interfaces and returns a BitReader for use with various bit level manipulations.

func (*BitReader) CurOffset

func (b *BitReader) CurOffset() int64

CurOffset returns the current Byte offset into the data that the reader is at.

func (*BitReader) GetAligned

func (b *BitReader) GetAligned(nbytes int, v interface{}) bool

GetAligned reads nbytes from the underlying stream into the passed interface value. Returning false if there aren't enough bytes remaining in the stream or if an invalid type is passed. The bytes are read aligned to byte boundaries.

v must be a pointer to a byte or sized uint type (*byte, *uint16, *uint32, *uint64). encoded values are assumed to be little endian.

func (*BitReader) GetBatch

func (b *BitReader) GetBatch(bits uint, out []uint64) (int, error)

GetBatch fills out by decoding values repeated from the stream that are encoded using bits as the number of bits per value. The values are expected to be bit packed so we will unpack the values to populate.

func (*BitReader) GetBatchBools

func (b *BitReader) GetBatchBools(out []bool) (int, error)

GetBatchBools is like GetBatch but optimized for reading bits as boolean values

func (*BitReader) GetBatchIndex

func (b *BitReader) GetBatchIndex(bits uint, out []IndexType) (i int, err error)

GetBatchIndex is like GetBatch but for IndexType (used for dictionary decoding)

func (*BitReader) GetValue

func (b *BitReader) GetValue(width int) (uint64, bool)

GetValue returns a single value that is bit packed using width as the number of bits and returns false if there weren't enough bits remaining.

func (*BitReader) GetVlqInt

func (b *BitReader) GetVlqInt() (uint64, bool)

GetVlqInt reads a Vlq encoded int from the stream. The encoded value must start at the beginning of a byte and this returns false if there weren't enough bytes in the buffer or reader. This will call `ReadByte` which in turn retrieves byte aligned values from the reader

func (*BitReader) GetZigZagVlqInt

func (b *BitReader) GetZigZagVlqInt() (int64, bool)

GetZigZagVlqInt reads a zigzag encoded integer, returning false if there weren't enough bytes remaining.

func (*BitReader) ReadByte

func (b *BitReader) ReadByte() (byte, error)

ReadByte reads a single aligned byte from the underlying stream, or populating error if there aren't enough bytes left.

func (*BitReader) Reset

func (b *BitReader) Reset(r reader)

Reset allows reusing a BitReader by setting a new reader and resetting the internal state back to zeros.

type BitRun

type BitRun struct {
	Len int64
	Set bool
}

BitRun represents a run of bits with the same value of length Len with Set representing if the group of bits were 1 or 0.

func (BitRun) String

func (b BitRun) String() string

type BitRunReader

type BitRunReader interface {
	NextRun() BitRun
}

BitRunReader is an interface that is usable by multiple callers to provide multiple types of bit run readers such as a reverse reader and so on.

It's a convenience interface for counting contiguous set/unset bits in a bitmap. In places where BitBlockCounter can be used, then it would be preferred to use that as it would be faster than using BitRunReader.

func NewBitRunReader

func NewBitRunReader(bitmap []byte, offset int64, length int64) BitRunReader

NewBitRunReader returns a reader for the given bitmap, offset and length that grabs runs of the same value bit at a time for easy iteration.

type BitWriter

type BitWriter struct {
	// contains filtered or unexported fields
}

BitWriter is a utility for writing values of specific bit widths to a stream using a uint64 as a buffer to build up between flushing for efficiency.

func NewBitWriter

func NewBitWriter(w io.WriterAt) *BitWriter

NewBitWriter initializes a new bit writer to write to the passed in interface using WriteAt to write the appropriate offsets and values.

func (*BitWriter) Clear

func (b *BitWriter) Clear()

Clear resets the writer so that subsequent writes will start from offset 0, allowing reuse of the underlying buffer and writer.

func (*BitWriter) Flush

func (b *BitWriter) Flush(align bool)

Flush will flush any buffered data to the underlying writer, pass true if the next write should be byte-aligned after this flush.

func (*BitWriter) ReserveBytes

func (b *BitWriter) ReserveBytes(nbytes int) int

ReserveBytes reserves the next aligned nbytes, skipping them and returning the offset to use with WriteAt to write to those reserved bytes. Used for RLE encoding to fill in the indicators after encoding.

func (*BitWriter) WriteAligned

func (b *BitWriter) WriteAligned(val uint64, nbytes int) bool

WriteAligned writes the value val as a little endian value in exactly nbytes byte-aligned to the underlying writer, flushing via Flush(true) before writing nbytes without buffering.

func (*BitWriter) WriteAt

func (b *BitWriter) WriteAt(val []byte, off int64) (int, error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered. This allows writing full bytes directly to the underlying writer.

func (*BitWriter) WriteValue

func (b *BitWriter) WriteValue(v uint64, nbits uint) error

WriteValue writes the value v using nbits to pack it, returning false if it fails for some reason.

func (*BitWriter) WriteVlqInt

func (b *BitWriter) WriteVlqInt(v uint64) bool

WriteVlqInt writes v as a vlq encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) WriteZigZagVlqInt

func (b *BitWriter) WriteZigZagVlqInt(v int64) bool

WriteZigZagVlqInt writes a zigzag encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) Written

func (b *BitWriter) Written() int

Written returns the number of bytes that have been written to the BitWriter, not how many bytes have been flushed. Use Flush to ensure that all data is flushed to the underlying writer.

type BitmapWriter

type BitmapWriter interface {
	// Set sets the current bit that will be written
	Set()
	// Clear clears the current bit that will be written
	Clear()
	// Next advances to the next bit for the writer
	Next()
	// Finish flushes the current byte out to the bitmap slice
	Finish()
	// AppendWord takes nbits from word which should be an LSB bitmap and appends them to the bitmap.
	AppendWord(word uint64, nbits int64)
	// AppendBools appends the bit representation of the bools slice, returning the number
	// of bools that were able to fit in the remaining length of the bitmapwriter.
	AppendBools(in []bool) int
	// Pos is the current position that will be written next
	Pos() int
	// Reset allows reusing the bitmapwriter by resetting Pos to start with length as
	// the number of bits that the writer can write.
	Reset(start, length int)
}

BitmapWriter is an interface for bitmap writers so that we can use multiple implementations or swap if necessary.

func NewBitmapWriter

func NewBitmapWriter(bitmap []byte, start, length int) BitmapWriter

func NewFirstTimeBitmapWriter

func NewFirstTimeBitmapWriter(buf []byte, start, length int64) BitmapWriter

NewFirstTimeBitmapWriter creates a bitmap writer that might clobber any bit values following the bits written to the bitmap, as such it is faster than the bitmapwriter that is created with NewBitmapWriter

type DictionaryConverter

type DictionaryConverter interface {
	// Copy takes an interface{} which must be a slice of the appropriate type, and will be populated
	// by the dictionary values at the indexes from the IndexType slice
	Copy(interface{}, []IndexType) error
	// Fill fills interface{} which must be a slice of the appropriate type, with the value
	// specified by the dictionary index passed in.
	Fill(interface{}, IndexType) error
	// FillZero fills interface{}, which must be a slice of the appropriate type, with the zero value
	// for the given type.
	FillZero(interface{})
	// IsValid validates that all of the indexes passed in are valid indexes for the dictionary
	IsValid(...IndexType) bool
}

DictionaryConverter is an interface used for dealing with RLE decoding and encoding when working with dictionaries to get values from indexes.

type IndexType

type IndexType = int32

IndexType is the type we're going to use for Dictionary indexes, currently an alias to int32

type OptionalBitBlockCounter

type OptionalBitBlockCounter struct {
	// contains filtered or unexported fields
}

OptionalBitBlockCounter is a useful counter to iterate through a possibly non-existent validity bitmap to allow us to write one code path for both the with-nulls and no-nulls cases without giving up a lot of performance.

func NewOptionalBitBlockCounter

func NewOptionalBitBlockCounter(bitmap []byte, offset, length int64) *OptionalBitBlockCounter

NewOptionalBitBlockCounter constructs and returns a new bit block counter that can properly handle the case when a bitmap is null, if it is guaranteed that the the bitmap is not nil, then prefer NewBitBlockCounter here.

func (*OptionalBitBlockCounter) NextBlock

func (obc *OptionalBitBlockCounter) NextBlock() BitBlockCount

NextBlock returns block count for next word when the bitmap is available otherwise return a block with length up to INT16_MAX when there is no validity bitmap (so all the referenced values are not null).

func (*OptionalBitBlockCounter) NextWord

func (obc *OptionalBitBlockCounter) NextWord() BitBlockCount

NextWord is like NextBlock, but returns a word-sized block even when there is no validity bitmap

type RleDecoder

type RleDecoder struct {
	// contains filtered or unexported fields
}

func NewRleDecoder

func NewRleDecoder(data *bytes.Reader, width int) *RleDecoder

func (*RleDecoder) GetBatch

func (r *RleDecoder) GetBatch(values []uint64) int

func (*RleDecoder) GetBatchSpaced

func (r *RleDecoder) GetBatchSpaced(vals []uint64, nullcount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDict

func (r *RleDecoder) GetBatchWithDict(dc DictionaryConverter, vals interface{}) (int, error)

func (*RleDecoder) GetBatchWithDictByteArray

func (r *RleDecoder) GetBatchWithDictByteArray(dc DictionaryConverter, vals []parquet.ByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFloat32

func (r *RleDecoder) GetBatchWithDictFloat32(dc DictionaryConverter, vals []float32) (int, error)

func (*RleDecoder) GetBatchWithDictFloat64

func (r *RleDecoder) GetBatchWithDictFloat64(dc DictionaryConverter, vals []float64) (int, error)

func (*RleDecoder) GetBatchWithDictInt32

func (r *RleDecoder) GetBatchWithDictInt32(dc DictionaryConverter, vals []int32) (int, error)

func (*RleDecoder) GetBatchWithDictInt64

func (r *RleDecoder) GetBatchWithDictInt64(dc DictionaryConverter, vals []int64) (int, error)

func (*RleDecoder) GetBatchWithDictInt96

func (r *RleDecoder) GetBatchWithDictInt96(dc DictionaryConverter, vals []parquet.Int96) (int, error)

func (*RleDecoder) GetBatchWithDictSpaced

func (r *RleDecoder) GetBatchWithDictSpaced(dc DictionaryConverter, vals interface{}, nullCount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDictSpacedByteArray

func (r *RleDecoder) GetBatchWithDictSpacedByteArray(dc DictionaryConverter, vals []parquet.ByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictSpacedFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat32

func (r *RleDecoder) GetBatchWithDictSpacedFloat32(dc DictionaryConverter, vals []float32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat64

func (r *RleDecoder) GetBatchWithDictSpacedFloat64(dc DictionaryConverter, vals []float64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt32

func (r *RleDecoder) GetBatchWithDictSpacedInt32(dc DictionaryConverter, vals []int32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt64

func (r *RleDecoder) GetBatchWithDictSpacedInt64(dc DictionaryConverter, vals []int64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt96

func (r *RleDecoder) GetBatchWithDictSpacedInt96(dc DictionaryConverter, vals []parquet.Int96, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetValue

func (r *RleDecoder) GetValue() (uint64, bool)

func (*RleDecoder) Next

func (r *RleDecoder) Next() bool

func (*RleDecoder) Reset

func (r *RleDecoder) Reset(data *bytes.Reader, width int)

type RleEncoder

type RleEncoder struct {
	BitWidth int
	// contains filtered or unexported fields
}

func NewRleEncoder

func NewRleEncoder(w io.WriterAt, width int) *RleEncoder

func (*RleEncoder) Clear

func (r *RleEncoder) Clear()

func (*RleEncoder) Flush

func (r *RleEncoder) Flush() int

func (*RleEncoder) Put

func (r *RleEncoder) Put(value uint64) error

Put buffers input values 8 at a time. after seeing all 8 values, it decides whether they should be encoded as a literal or repeated run.

type SetBitRun

type SetBitRun struct {
	Pos    int64
	Length int64
}

SetBitRun describes a run of contiguous set bits in a bitmap with Pos being the starting position of the run and Length being the number of bits.

func (SetBitRun) AtEnd

func (s SetBitRun) AtEnd() bool

AtEnd returns true if this bit run is the end of the set by checking that the length is 0.

func (SetBitRun) Equal

func (s SetBitRun) Equal(rhs SetBitRun) bool

Equal returns whether rhs is the same run as s

type SetBitRunReader

type SetBitRunReader interface {
	// NextRun will return the next run of contiguous set bits in the bitmap
	NextRun() SetBitRun
	// Reset allows re-using the reader by providing a new bitmap, offset and length. The arguments
	// match the New function for the reader being used.
	Reset([]byte, int64, int64)
	// VisitSetBitRuns calls visitFn for each set in a loop starting from the current position
	// it's roughly equivalent to simply looping, calling NextRun and calling visitFn on the run
	// for each run.
	VisitSetBitRuns(visitFn VisitFn) error
}

SetBitRunReader is an interface for reading groups of contiguous set bits from a bitmap. The interface allows us to create different reader implementations that share the same interface easily such as a reverse set reader.

func NewReverseSetBitRunReader

func NewReverseSetBitRunReader(validBits []byte, startOffset, numValues int64) SetBitRunReader

NewReverseSetBitRunReader returns a SetBitRunReader like NewSetBitRunReader, except it will return runs starting from the end of the bitmap until it reaches startOffset rather than starting at startOffset and reading from there. The SetBitRuns will still operate the same, so Pos will still be the position of the "left-most" bit of the run or the "start" of the run. It just returns runs starting from the end instead of starting from the beginning.

func NewSetBitRunReader

func NewSetBitRunReader(validBits []byte, startOffset, numValues int64) SetBitRunReader

NewSetBitRunReader returns a SetBitRunReader for the bitmap starting at startOffset which will read numvalues bits.

type TellWrapper

type TellWrapper struct {
	io.Writer
	// contains filtered or unexported fields
}

TellWrapper wraps any io.Writer to add a Tell function that tracks the position based on calls to Write. It does not take into account any calls to Seek or any Writes that don't go through the TellWrapper

func (*TellWrapper) Close

func (w *TellWrapper) Close() error

Close makes TellWrapper an io.Closer so that calling Close will also call Close on the wrapped writer if it has a Close function.

func (*TellWrapper) Tell

func (w *TellWrapper) Tell() int64

func (*TellWrapper) Write

func (w *TellWrapper) Write(p []byte) (n int, err error)

type VisitFn

type VisitFn func(pos int64, length int64) error

VisitFn is a callback function for visiting runs of contiguous bits

type WriteCloserTell

type WriteCloserTell interface {
	io.WriteCloser
	Tell() int64
}

WriteCloserTell is an interface adding a Tell function to a WriteCloser so if the underlying writer has a Close function, it is exposed and not hidden.

type WriterAtBuffer

type WriterAtBuffer struct {
	// contains filtered or unexported fields
}

WriterAtBuffer is a convenience struct for providing a WriteAt function to a byte slice for use with things that want an io.WriterAt

func (*WriterAtBuffer) Len

func (w *WriterAtBuffer) Len() int

Len returns the length of the underlying byte slice.

func (*WriterAtBuffer) WriteAt

func (w *WriterAtBuffer) WriteAt(p []byte, off int64) (n int, err error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered.

type WriterAtWithLen

type WriterAtWithLen interface {
	io.WriterAt
	Len() int
}

WriterAtWithLen is an interface for an io.WriterAt with a Len function

func NewWriterAtBuffer

func NewWriterAtBuffer(buf []byte) WriterAtWithLen

NewWriterAtBuffer returns an object which fulfills the io.WriterAt interface by taking ownership of the passed in slice.

type WriterTell

type WriterTell interface {
	io.Writer
	Tell() int64
}

WriterTell is an interface that adds a Tell function to an io.Writer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL