parquet

package
v0.3.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2025 License: AGPL-3.0 Imports: 25 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CompactDataFiles

func CompactDataFiles(ctx context.Context, updateFunc func(CompactionStatus), patterns ...PartitionPattern) error

func DeleteParquetFiles

func DeleteParquetFiles(partition *config.Partition, from time.Time) (rowCount int, err error)

func PartitionMatchesPatterns added in v0.2.0

func PartitionMatchesPatterns(table, partition string, patterns []PartitionPattern) bool

Types

type ColumnSchemaChange added in v0.3.0

type ColumnSchemaChange struct {
	Name    string
	OldType string
	NewType string
}

type CompactionStatus

type CompactionStatus struct {
	Source      int
	Dest        int
	Uncompacted int
}

func (*CompactionStatus) BriefString

func (total *CompactionStatus) BriefString() string

func (*CompactionStatus) Update

func (total *CompactionStatus) Update(counts CompactionStatus)

func (*CompactionStatus) VerboseString

func (total *CompactionStatus) VerboseString() string

type ConversionError added in v0.2.0

type ConversionError struct {
	SourceFile   string
	BaseError    error
	RowsAffected int64
	// contains filtered or unexported fields
}

func NewConversionError added in v0.2.0

func NewConversionError(msg string, rowsAffected int64, path string) *ConversionError

func (*ConversionError) Error added in v0.2.0

func (c *ConversionError) Error() string

func (*ConversionError) Merge added in v0.3.0

func (c *ConversionError) Merge(err error)

Merge adds a second error to the conversion error message.

type Converter added in v0.2.0

type Converter struct {

	// the partition being collected
	Partition *config.Partition
	// contains filtered or unexported fields
}

Converter struct executes all the conversions for a single collection it therefore has a unique execution id, and will potentially convert of multiple JSONL files each file is assumed to have the filename format <execution_id>_<chunkNumber>.jsonl so when new input files are available, we simply store the chunk number

func NewParquetConverter added in v0.2.0

func NewParquetConverter(ctx context.Context, cancel context.CancelFunc, executionId string, partition *config.Partition, sourceDir string, tableSchema *schema.TableSchema, statusFunc func(int64, int64, ...error)) (*Converter, error)

func (*Converter) AddChunk added in v0.2.0

func (w *Converter) AddChunk(executionId string, chunk int32) error

AddChunk adds a new chunk to the list of chunks to be processed if this is the first chunk, determine if we have a full conversionSchema yet and if not infer from the chunk signal the scheduler that `chunks are available

func (*Converter) Close added in v0.2.0

func (w *Converter) Close()

func (*Converter) InferSchemaForJSONLFile added in v0.3.0

func (w *Converter) InferSchemaForJSONLFile(filePath string) (*schema.TableSchema, error)

func (*Converter) WaitForConversions added in v0.2.0

func (w *Converter) WaitForConversions(ctx context.Context)

WaitForConversions waits for all jobs to be processed or for the context to be cancelled

type FileRootProvider

type FileRootProvider struct {
	// contains filtered or unexported fields
}

FileRootProvider provides a unique file root for parquet files based on the current time to the nanosecond. If multiple files are created in the same nanosecond, the provider will increment the time by a nanosecond to ensure the file root is unique.

func (*FileRootProvider) GetFileRoot

func (p *FileRootProvider) GetFileRoot() string

GetFileRoot returns a unique file root for a parquet file format is "data_<timestamp>_<microseconds>"

type PartitionPattern

type PartitionPattern struct {
	Table     string
	Partition string
}

PartitionPattern represents a pattern used to match partitions. It consists of a table pattern and a partition pattern, both of which are used to match a given table and partition name.

func NewPartitionPattern

func NewPartitionPattern(partition *config.Partition) PartitionPattern

type SchemaChangeError added in v0.3.0

type SchemaChangeError struct {
	ChangedColumns []ColumnSchemaChange
}

func (*SchemaChangeError) Error added in v0.3.0

func (e *SchemaChangeError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL