convert

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 26, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package convert turns a producer's Parquet shard into a tatami file. It is the fleet-adoption bridge: ami and ccrawl-cli already write zstd Parquet through parquet-go, and this reads one of those files column by column and re-encodes it as tatami, so the same crawl output gains tatami's blob separation, shared dictionaries, and pruning structures without a producer change.

The mapping is schema-driven, not hardcoded to one producer. It reads the Parquet leaf schema, maps each column to a tatami logical type, and applies a small set of overridable heuristics: a large body column (markdown, body, html) is separated into the blob region, the low-cardinality string columns get a dictionary hint, and the identity columns (doc_id, url, digest) get a membership filter. The format library itself stays Parquet-free; only this package and the CLI import parquet-go.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Options

type Options struct {
	// Blob names the columns to separate into the blob region as BLOBREF. When
	// nil, columns named markdown, body, or html are separated automatically.
	Blob []string
	// Bloom names the columns to build a per-group membership filter on. When nil,
	// columns named doc_id, url, or digest get one automatically.
	Bloom []string
	// Dict names the string columns to hint toward dictionary encoding. When nil,
	// every string column that is not an identity or body column is hinted.
	Dict []string
	// BatchRows is how many rows to read and append at a time. Zero selects 4096,
	// which bounds memory regardless of shard size.
	BatchRows int
	// Writer is passed through to the tatami writer (row-group size, page size,
	// blob run target).
	Writer tatami.WriterOptions
}

Options tunes a conversion. The zero value is valid: it auto-picks the blob, dictionary, and bloom columns by name and streams in the producer's row order with no sort key.

type Stats

type Stats struct {
	Rows     int64
	Columns  int
	InBytes  int64
	OutBytes int64
}

Stats summarizes one conversion.

func File

func File(inPath, outPath string, opts Options) (Stats, error)

File converts the Parquet file at inPath into a tatami file at outPath.

func (Stats) Ratio

func (s Stats) Ratio() float64

Ratio is the converted size as a fraction of the source size, so a value below one means tatami is smaller. It returns zero when the source is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL