Documentation
¶
Overview ¶
Package convert turns a producer's Parquet shard into a tatami file. It is the fleet-adoption bridge: ami and ccrawl-cli already write zstd Parquet through parquet-go, and this reads one of those files column by column and re-encodes it as tatami, so the same crawl output gains tatami's blob separation, shared dictionaries, and pruning structures without a producer change.
The mapping is schema-driven, not hardcoded to one producer. It reads the Parquet leaf schema, maps each column to a tatami logical type, and applies a small set of overridable heuristics: a large body column (markdown, body, html) is separated into the blob region, the low-cardinality string columns get a dictionary hint, and the identity columns (doc_id, url, digest) get a membership filter. The format library itself stays Parquet-free; only this package and the CLI import parquet-go.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Options ¶
type Options struct {
// Blob names the columns to separate into the blob region as BLOBREF. When
// nil, columns named markdown, body, or html are separated automatically.
Blob []string
// Bloom names the columns to build a per-group membership filter on. When nil,
// columns named doc_id, url, or digest get one automatically.
Bloom []string
// Dict names the string columns to hint toward dictionary encoding. When nil,
// every string column that is not an identity or body column is hinted.
Dict []string
// BatchRows is how many rows to read and append at a time. Zero selects 4096,
// which bounds memory regardless of shard size.
BatchRows int
// Writer is passed through to the tatami writer (row-group size, page size,
// blob run target).
Writer tatami.WriterOptions
}
Options tunes a conversion. The zero value is valid: it auto-picks the blob, dictionary, and bloom columns by name and streams in the producer's row order with no sort key.