convert

package

v0.1.0 Latest Latest Go to latest Published: Jun 26, 2026 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tamnd/tatami

Links

Open Source Insights

Documentation ¶

Overview ¶

Package convert turns a producer's Parquet shard into a tatami file. It is the fleet-adoption bridge: ami and ccrawl-cli already write zstd Parquet through parquet-go, and this reads one of those files column by column and re-encodes it as tatami, so the same crawl output gains tatami's blob separation, shared dictionaries, and pruning structures without a producer change.

The mapping is schema-driven, not hardcoded to one producer. It reads the Parquet leaf schema, maps each column to a tatami logical type, and applies a small set of overridable heuristics: a large body column (markdown, body, html) is separated into the blob region, the low-cardinality string columns get a dictionary hint, and the identity columns (doc_id, url, digest) get a membership filter. The format library itself stays Parquet-free; only this package and the CLI import parquet-go.

Index ¶

type Options
type Stats
- func File(inPath, outPath string, opts Options) (Stats, error)
- func (s Stats) Ratio() float64

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Options ¶

type Options struct {
	// Blob names the columns to separate into the blob region as BLOBREF. When
	// nil, columns named markdown, body, or html are separated automatically.
	Blob []string
	// Bloom names the columns to build a per-group membership filter on. When nil,
	// columns named doc_id, url, or digest get one automatically.
	Bloom []string
	// Dict names the string columns to hint toward dictionary encoding. When nil,
	// every string column that is not an identity or body column is hinted.
	Dict []string
	// BatchRows is how many rows to read and append at a time. Zero selects 4096,
	// which bounds memory regardless of shard size.
	BatchRows int
	// Writer is passed through to the tatami writer (row-group size, page size,
	// blob run target).
	Writer tatami.WriterOptions
}

Options tunes a conversion. The zero value is valid: it auto-picks the blob, dictionary, and bloom columns by name and streams in the producer's row order with no sort key.

type Stats ¶

type Stats struct {
	Rows     int64
	Columns  int
	InBytes  int64
	OutBytes int64
}

Stats summarizes one conversion.

func File ¶

func File(inPath, outPath string, opts Options) (Stats, error)

File converts the Parquet file at inPath into a tatami file at outPath.

func (Stats) Ratio ¶

func (s Stats) Ratio() float64

Ratio is the converted size as a fraction of the source size, so a value below one means tatami is smaller. It returns zero when the source is empty.

Source Files ¶

View all Source files

convert.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL