grizzly

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 11, 2026 License: MIT Imports: 22 Imported by: 0

README

grizzly logo

grizzly

Go Reference Go Version Dependencies

A dataframe library for Go, written from scratch on the standard library alone. Load rows, work in columns: typed columnar storage, first-class nulls, and a small, explicit API.

df, err := grizzly.FromCSV("sales.csv", grizzly.Schema{
    {Name: "store", Type: grizzly.String},
    {Name: "price", Type: grizzly.Float64},
})

warm, _ := df.Gt("price", 1.0)
out, _ := df.Where(warm)

summary, _ := out.GroupBy("store").Agg(
    grizzly.Sum("price"),
    grizzly.Avg("price").As("avg"),
)
fmt.Print(summary)

Features

  • Row-oriented ingestion, column-oriented storage. Load data the way you have it — Go structs, CSV, JSON — and grizzly transposes it once into typed, contiguous columns (float64, string, bool) that operations scan at cache speed.
  • First-class nulls. Arrow-style validity bitmaps (null-free columns pay nothing), comma-ok access (Value(i) (T, bool)), SQL semantics in aggregations (nulls are skipped, never counted as zero) and Kleene three-valued logic in filters.
  • Explicit schemas, no guessing. Untyped sources (CSV, JSON) load against a schema you declare — a "08001" zip code never silently becomes a number. Struct loading needs no schema: the types come from the fields.
  • The core operations. Comparators + combinable masks (Eq/Lt/Gt..., And/Or/Not) materialized by Where; Select; GroupBy(...).Agg(...) with inspectable aggregation specs; stable Sort/SortDesc (nulls first); whole-column Sum/Avg/Min/Max/Count.
  • Round-tripping writers. ToCSV/ToJSON mirror the loaders; JSON output reloads byte-exact.
  • Fast by measurement, not folklore. Parallel chunked CSV and JSON loading, hand-written byte-level parsers and writers, pairwise summation (NumPy's algorithm: error O(ε·log n) — and faster than the naive loop). Every optimization landed with a benchstat comparison and an oracle test against the standard library.
  • Zero dependencies. The standard library is the only import, and that is a permanent design constraint.

Install

go get github.com/gverdugo-dev/grizzly

Requires Go 1.26+.

Quick tour

Every snippet below is a runnable example test — the documentation cannot drift from the behavior.

Load

From structs (one column per exported field, renamed by tags):

type sale struct {
    Product string  `grizzly:"product"`
    Price   float64 `grizzly:"price"`
    Sold    bool    `grizzly:"sold"`
}
df, err := grizzly.FromStructs([]sale{
    {Product: "apple", Price: 1.5, Sold: true},
    {Product: "pear", Price: 2, Sold: false},
})

From CSV or JSON — file paths or any io.Reader, against an explicit schema. Empty CSV cells and JSON nulls load as real nulls, never as fake zeros:

schema := grizzly.Schema{
    {Name: "city", Type: grizzly.String},
    {Name: "temp", Type: grizzly.Float64},
}
df, err := grizzly.FromCSV("cities.csv", schema)       // or FromCSVReader
df, err = grizzly.FromJSON("cities.json", schema)      // or FromJSONReader

fmt.Print(df)
// city      temp
// madrid    21.5
// bilbao    null
// valencia  28.25
Filter

Comparators build masks; masks combine; Where materializes once. Null comparisons follow three-valued logic — unknown rows don't pass:

warm, _ := df.Gt("temp", 20.0)
mild, _ := df.Lt("temp", 30.0)
out, _ := df.Where(warm.And(mild))
Group and aggregate

Groups come out in first-appearance order — deterministic, every run:

summary, err := df.GroupBy("store").Agg(
    grizzly.Sum("price"),
    grizzly.Avg("price").As("avg"),
)
// store  price  avg
// north  2      1
// south  4      2
Sort, select, write
out, _ := df.Sort("temp")        // stable; nulls first (SortDesc to reverse)
out, _ = out.Select("temp", "city")
err = out.ToCSV("out.csv")       // nulls become empty cells
err = out.ToJSON("out.json")     // literal nulls; reloads byte-exact
Inspect

fmt.Print(df) renders a truncated table; df.Info() gives the pandas-style summary: per-column dtype, non-null counts and memory usage.

Data model

Three column types: float64, string and bool — the JSON model. Every number is a float64 (exact for integers up to 2^53); bool columns are bit-packed. Every column is nullable, backed by a validity bitmap that costs nothing while a column has no nulls. The rationale behind each of these decisions is documented in docs/design.

Performance

The loaders parse in parallel chunks and the hot paths are byte-level (no encoding/json or encoding/csv in the fast lanes — but their behavior is pinned by fuzz and oracle tests against them). On a 1M-row, 3-column dataset (commodity laptop):

Operation v0.1.0 v0.2.0
CSV file load 0.54s 0.15s
JSON file load 3.46s 0.37s
CSV/JSON write (100k rows) ~395k allocs 13 allocs

Benchmarks live in bench_test.go; methodology and comparisons against other engines are tracked in the dev notes and re-validated before being cited.

Status and versioning

grizzly is pre-1.0: the API can change between minor versions. Releases follow semver via git tags — patches are fixes only, each minor is a milestone (v0.1.0 core API → v0.2.0 performance → v0.3.0 relational, in progress). The roadmap lives in the living document.

grizzly is also a from-scratch project by design: every data structure — validity bitmaps, hash-based grouping, the JSON parser — is built and documented here rather than imported, with notes explaining each design decision and the alternatives it beat. The official documentation is in docs/; if you want to understand how a dataframe engine works from the inside, the dev notes are the guided tour.

License

MIT.

Documentation

Overview

Package grizzly is a dataframe library built from scratch as a learning project.

Data is ingested thinking in rows (structs, CSV, JSON) but stored and processed in columns: each column keeps its values in a contiguous typed slice, which is what makes operations like Sum or Filter fast.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrColumnNotFound = errors.New("column does not exist in dataframe")

ErrColumnNotFound is returned when the requested column does not exist in the dataframe.

View Source
var ErrNoValidValues = errors.New("no valid values in column")

ErrNoValidValues is returned by aggregations that have no honest answer over zero values (Avg, Min, Max of an empty or all-null column) — the Go-flavored equivalent of SQL returning NULL in that case.

View Source
var ErrTypeMismatch = errors.New("operation not supported for column type")

ErrTypeMismatch is returned when an operation is applied to a column whose type does not support it (e.g. Sum over a string column).

Functions

func SetLogger

func SetLogger(l *slog.Logger)

SetLogger routes grizzly's internal logs to the given logger.

By default grizzly is silent. Call this to see what the library is doing, e.g. for debugging:

grizzly.SetLogger(slog.Default())

Types

type AggSpec

type AggSpec struct {
	// contains filtered or unexported fields
}

AggSpec describes one aggregation for GroupedDataframe.Agg: which operation over which column, and optionally the output column's name. Build specs with the package-level constructors Sum, Avg, Min, Max and Count — package functions, distinct from the Dataframe methods of the same names (Go resolves them by receiver, so grizzly.Sum("price") and df.Sum("price") coexist) — and rename with As.

func Avg

func Avg(col string) AggSpec

Avg returns an AggSpec that averages the named column per group.

func Count

func Count(col string) AggSpec

Count returns an AggSpec that counts the valid (non-null) rows of the named column per group — SQL's COUNT(col).

func Max

func Max(col string) AggSpec

Max returns an AggSpec that takes the per-group maximum of the named column.

func Min

func Min(col string) AggSpec

Min returns an AggSpec that takes the per-group minimum of the named column.

func Sum

func Sum(col string) AggSpec

Sum returns an AggSpec that sums the named column per group.

func (AggSpec) As

func (s AggSpec) As(name string) AggSpec

As renames the aggregation's output column, like SQL's AS:

df.GroupBy("store").Agg(
	grizzly.Sum("price"),                 // output column: price
	grizzly.Avg("price").As("avg_price"), // output column: avg_price
)

Without As the output keeps the source column's name, so aggregating the same column twice requires renaming at least one of them (duplicate names fail when the result dataframe is built).

type BoolColumn

type BoolColumn struct {
	// contains filtered or unexported fields
}

BoolColumn is a column of bool values stored packed, Arrow-style: one bit per row in bitmap words, not one byte per row — 8x smaller than a []bool, and counting trues is a popcount away.

Packing breaks the "1 slice element = 1 row" coincidence the other columns enjoy: len(values) counts words (64 rows each), so the logical row count needs its own length field, and bounds checks are written by hand instead of falling out of slice indexing.

Nulls are tracked by a validity bitmap like every other column: at null rows the packed value bit is an arbitrary placeholder (0) that operations never read. A nil bitmap means the column has no nulls.

func NewBoolColumn

func NewBoolColumn(name string, values []bool) *BoolColumn

NewBoolColumn returns a BoolColumn with the given name and values, packed to one bit per row. The input slice is only read during construction; the column keeps no reference to it. The column has no nulls; use NewBoolColumnWithNulls to mark some rows as null.

func NewBoolColumnWithNulls

func NewBoolColumnWithNulls(name string, values []bool, valid []bool) (*BoolColumn, error)

NewBoolColumnWithNulls returns a BoolColumn where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.

Both slices are only read during construction (values are packed, the mask is compacted — and dropped entirely when it contains no false entries).

func (*BoolColumn) DType

func (c *BoolColumn) DType() DType

DType returns Bool.

func (*BoolColumn) IsValid

func (c *BoolColumn) IsValid(i int) bool

IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.

func (*BoolColumn) Len

func (c *BoolColumn) Len() int

Len returns the number of values in the column.

func (*BoolColumn) Name

func (c *BoolColumn) Name() string

Name returns the column's name.

func (*BoolColumn) NullCount

func (c *BoolColumn) NullCount() int

NullCount returns the number of null rows in the column.

func (*BoolColumn) Value

func (c *BoolColumn) Value(i int) (bool, bool)

Value returns the value at row i and whether it is valid (comma-ok): ok is false when the row is null, and the returned value must then be ignored. It panics if i is out of range.

type Column

type Column interface {
	// Name returns the column's name, unique within a Dataframe.
	Name() string
	// Len returns the number of values in the column.
	Len() int
	// DType returns the column's data type.
	DType() DType
	// IsValid reports whether row i holds a value (true) or a null (false).
	// It panics if i is out of range.
	IsValid(i int) bool
	// NullCount returns the number of null rows in the column.
	NullCount() int
}

Column is the contract every typed column satisfies.

A column stores its values in a contiguous typed slice (e.g. []float64), which is what makes columnar processing fast: sequential memory access, no per-value heap allocations, no type assertions in hot loops.

The interface intentionally exposes only type-agnostic behavior. Anything that needs the concrete values (Sum, Filter...) type-switches on the concrete implementation (*Float64Column, *StringColumn...) to reach the underlying slice directly.

type DType

type DType string

DType identifies the data type of a column. Grizzly supports a closed set of column types; every Column implementation reports exactly one DType.

Using a small named type instead of bare strings gives us compile-time safety wherever a function expects "a column type" rather than "any string".

const (
	Float64 DType = "float64"
	String  DType = "string"
	Bool    DType = "bool"
)

The supported column data types. A deliberately closed set: float64 is the only numeric type (integers load as float64, exact up to 2^53 — the JSON model), plus string and bool.

type Dataframe

type Dataframe struct {
	// contains filtered or unexported fields
}

Dataframe is an in-memory, column-oriented table.

Columns are kept in a slice (not a map) to preserve their insertion order, which matters when printing or serializing. Lookup by name is a linear scan — fine for the realistic number of columns in a dataframe (tens, not millions).

func FromCSV

func FromCSV(path string, schema Schema) (Dataframe, error)

FromCSV builds a Dataframe from a CSV file with a header row. See FromCSVReader for the format, schema and null rules — both paths load identical dataframes.

Unlike FromCSVReader, it reads the whole file into memory and, when the file is big enough, parses it in parallel by chunks (see from_csv_parallel.go): a file path offers random access, which is what makes splitting into byte ranges possible. Use FromCSVReader when streaming matters more than speed.

func FromCSVReader

func FromCSVReader(r io.Reader, schema Schema) (Dataframe, error)

FromCSVReader builds a Dataframe from a stream of CSV data with a header row, using the given schema to decide each column's type (CSV itself is untyped: every value arrives as a string until someone decides otherwise).

The stream is read one record at a time, never whole in memory, with the record slice reused across reads (ReuseRecord) so the hot loop does not allocate per row.

The schema also selects and orders the columns: source columns not listed in it are ignored, and the Dataframe's column order is the schema's order, not the stream's. A schema column missing from the header is an error.

Null rule: an empty cell in a Float64 column is a null (validity bit 0). In a String column it stays a real empty string — CSV cannot distinguish an empty cell from a legitimate "", and silently turning every "" into a null (as pandas does) destroys real data. Explicit over magic; a configurable null marker can be added later if needed.

Example

ExampleFromCSVReader loads typed columns from untyped CSV text: the schema decides the types, and an empty cell in a float64 column loads as a null (printed as "null", never as a fake 0).

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	data := "city,temp\nmadrid,21.5\nbilbao,\nvalencia,28.25\n"
	schema := grizzly.Schema{
		{Name: "city", Type: grizzly.String},
		{Name: "temp", Type: grizzly.Float64},
	}
	df, err := grizzly.FromCSVReader(strings.NewReader(data), schema)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Print(df)
}
Output:
city      temp
madrid    21.5
bilbao    null
valencia  28.25

func FromJSON

func FromJSON(path string, schema Schema) (Dataframe, error)

FromJSON builds a Dataframe from a JSON file containing an array of objects (one object = one row). See FromJSONReader for the format, schema and null rules — valid files load identically through both.

Unlike FromJSONReader, it reads the whole file into memory and parses the bytes directly (see from_json_bytes.go), skipping encoding/json's per-value machinery. Use FromJSONReader when streaming matters more than speed.

func FromJSONReader

func FromJSONReader(r io.Reader, schema Schema) (Dataframe, error)

FromJSONReader builds a Dataframe from a stream containing a JSON array of objects (one object = one row), using the given schema to decide each column's type and the column order — JSON objects are unordered, so the stream cannot provide an order even if it wanted to.

It decodes at token level, streaming: rows are never materialized as map[string]any. The first benchmark showed that intermediate layout (1M heap-allocated maps, 10M values boxed in any) made JSON ~6x slower than our own CSV path. Values are decoded through a reused per-column pointer, so they go stream → typed slice with no per-value allocation.

A JSON null becomes a null row in that column (validity bit 0). Schema columns missing from a row, or holding a value of the wrong JSON type, are still errors: explicit over magic — an absent key is a malformed row, a literal null is data.

func FromStructs

func FromStructs[T any](rows []T) (Dataframe, error)

FromStructs builds a Dataframe from a slice of structs: one column per exported struct field, named after the field (or its `grizzly:"..."` tag).

type Sale struct {
	Product string  `grizzly:"product"`
	Price   float64 `grizzly:"price"`
}
df, err := grizzly.FromStructs([]Sale{...})

This is the row-oriented front door of grizzly: the caller thinks in rows (one struct = one row), and this function transposes them into typed columns once, at the boundary. No schema is needed — the struct fields already declare the types.

Unexported fields are skipped. A field of an unsupported type is an error rather than silently dropped: explicit over magic.

Example

ExampleFromStructs shows the row-oriented front door: a slice of structs becomes a dataframe, one column per exported field, renamed by tags.

package main

import (
	"fmt"
	"log"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	type sale struct {
		Product string  `grizzly:"product"`
		Price   float64 `grizzly:"price"`
		Sold    bool    `grizzly:"sold"`
	}
	df, err := grizzly.FromStructs([]sale{
		{Product: "apple", Price: 1.5, Sold: true},
		{Product: "pear", Price: 2, Sold: false},
		{Product: "orange", Price: 0.5, Sold: true},
	})
	if err != nil {
		log.Fatal(err)
	}
	fmt.Print(df)
}
Output:
product  price  sold
apple    1.5    true
pear     2      false
orange   0.5    true

func NewDataframe

func NewDataframe(cols ...Column) (Dataframe, error)

NewDataframe builds a Dataframe from the given columns.

All columns must have the same length and unique names; otherwise an error is returned. This is the low-level constructor — friendlier, row-oriented constructors (from structs, CSV, JSON) build on top of it.

func (Dataframe) Avg

func (d Dataframe) Avg(name string) (float64, error)

Avg returns the average of the valid values in the numeric column with the given name: the sum of the valid values divided by their count, per the SQL aggregate convention (nulls are skipped, in both numerator and denominator).

It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (an average over zero values has no honest answer).

func (Dataframe) Column

func (d Dataframe) Column(name string) (Column, error)

Column returns the column with the given name, or ErrColumnNotFound.

func (Dataframe) Count

func (d Dataframe) Count(name string) (int, error)

Count returns the number of valid (non-null) rows in the column with the given name — SQL's COUNT(col), not COUNT(*). It works for any column type, and an empty or all-null column counts 0 (not an error: the count of nothing is zero).

It returns ErrColumnNotFound if the column does not exist.

func (Dataframe) Eq

func (d Dataframe) Eq(name string, value any) (Mask, error)

Eq returns a mask of the rows where the named column equals value. The value's type must match the column's dtype (float64, string or bool); null rows compare to unknown and are dropped by Where.

It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's.

func (Dataframe) Ge

func (d Dataframe) Ge(name string, value any) (Mask, error)

Ge returns a mask of the rows where the named column is greater than or equal to value. See Lt for the supported column types and errors.

func (Dataframe) GroupBy

func (d Dataframe) GroupBy(names ...string) GroupedDataframe

GroupBy groups the dataframe's rows by the values of the named key column, returning a GroupedDataframe to aggregate with Agg:

result, err := df.GroupBy("store").Agg(grizzly.Sum("price"))

Null keys follow the SQL rule: all nulls form one group together (for grouping, null equals null — the opposite of comparisons, where null is unknown). No row is silently dropped.

The signature is variadic for forward compatibility, but only a single key column is supported for now; multiple names are an error.

GroupBy returns no error itself so the call chains — df.GroupBy(...).Agg(...) would not compile against a multi-valued return. Errors (unknown column, unsupported key type, multiple names) are deferred and reported by Agg, the same pattern sql.Row.Scan uses.

Example

ExampleDataframe_GroupBy groups rows by a key column and aggregates per group. Output columns keep the source column's name unless renamed with As; groups appear in first-appearance order.

package main

import (
	"fmt"
	"log"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	type sale struct {
		Store string  `grizzly:"store"`
		Price float64 `grizzly:"price"`
	}
	df, _ := grizzly.FromStructs([]sale{
		{"north", 1.5}, {"south", 2.5}, {"north", 0.5}, {"south", 1.5},
	})

	out, err := df.GroupBy("store").Agg(
		grizzly.Sum("price"),
		grizzly.Avg("price").As("avg"),
	)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Print(out)
}
Output:
store  price  avg
north  2      1
south  4      2

func (Dataframe) Gt

func (d Dataframe) Gt(name string, value any) (Mask, error)

Gt returns a mask of the rows where the named column is greater than value. See Lt for the supported column types and errors.

func (Dataframe) Info

func (d Dataframe) Info() string

Info returns a structural summary of the dataframe, in the spirit of pandas' DataFrame.info(): one line per column with its name, dtype, non-null count and approximate memory footprint.

func (Dataframe) Le

func (d Dataframe) Le(name string, value any) (Mask, error)

Le returns a mask of the rows where the named column is less than or equal to value. See Lt for the supported column types and errors.

func (Dataframe) Lt

func (d Dataframe) Lt(name string, value any) (Mask, error)

Lt returns a mask of the rows where the named column is less than value. Supported on float64 and string columns (strings compare lexicographically, like Go's < operator).

It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's or the column does not support ordering (bool).

func (Dataframe) Max

func (d Dataframe) Max(name string) (float64, error)

Max returns the largest valid value in the numeric column with the given name. Null rows are skipped.

It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (the maximum of nothing does not exist).

func (Dataframe) Min

func (d Dataframe) Min(name string) (float64, error)

Min returns the smallest valid value in the numeric column with the given name. Null rows are skipped.

It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (the minimum of nothing does not exist).

func (Dataframe) Ne

func (d Dataframe) Ne(name string, value any) (Mask, error)

Ne returns a mask of the rows where the named column differs from value. Implemented as Eq().Not(), which preserves Kleene semantics: a null row is unknown for Ne too (null != x is not true in SQL either).

It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's.

func (Dataframe) NumCols

func (d Dataframe) NumCols() int

NumCols returns the number of columns in the dataframe.

func (Dataframe) NumRows

func (d Dataframe) NumRows() int

NumRows returns the number of rows in the dataframe.

func (Dataframe) Select

func (d Dataframe) Select(names ...string) (Dataframe, error)

Select returns a new Dataframe containing the named columns, in the given order — column projection, SQL's SELECT clause (Where is the WHERE). Duplicate names are an error.

Columns are shared with the original dataframe, not copied: columns are immutable after construction, so sharing is safe and Select is O(columns) regardless of row count. polars does the same with its immutable buffers.

It returns ErrColumnNotFound if any requested column does not exist.

Example

ExampleDataframe_Select projects columns by name, in the requested order, without copying any data.

package main

import (
	"fmt"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	type city struct {
		Name string  `grizzly:"city"`
		Temp float64 `grizzly:"temp"`
	}
	df, _ := grizzly.FromStructs([]city{
		{"madrid", 21.5}, {"bilbao", 12.5},
	})

	out, _ := df.Select("temp", "city")
	fmt.Print(out)
}
Output:
temp  city
21.5  madrid
12.5  bilbao

func (Dataframe) Sort

func (d Dataframe) Sort(name string) (Dataframe, error)

Sort returns a new Dataframe with the rows ordered by the named column, ascending. Null rows sort first (also under SortDesc — polars' rule: the holes group at the top in both directions). The sort is stable: rows with equal keys keep their original relative order.

Columnar engines do not reorder rows in place: Sort orders a permutation of row indices and then gathers every column through it once — the same indices-then-gather pattern GroupBy uses.

It returns ErrColumnNotFound if the column does not exist.

Example

ExampleDataframe_Sort orders rows by a column, returning a new dataframe. Nulls sort first, in both directions.

package main

import (
	"fmt"
	"strings"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	data := `[
		{"city": "sevilla", "temp": 35.5},
		{"city": "bilbao",  "temp": null},
		{"city": "madrid",  "temp": 21.5}
	]`
	schema := grizzly.Schema{
		{Name: "city", Type: grizzly.String},
		{Name: "temp", Type: grizzly.Float64},
	}
	df, _ := grizzly.FromJSONReader(strings.NewReader(data), schema)

	out, _ := df.Sort("temp")
	fmt.Print(out)
}
Output:
city     temp
bilbao   null
madrid   21.5
sevilla  35.5

func (Dataframe) SortDesc

func (d Dataframe) SortDesc(name string) (Dataframe, error)

SortDesc returns a new Dataframe with the rows ordered by the named column, descending. Null rows still sort first; see Sort.

func (Dataframe) String

func (d Dataframe) String() string

String renders the dataframe as an aligned text table.

Implementing this single method makes Dataframe satisfy fmt.Stringer, so the whole fmt package picks it up automatically: fmt.Println(df) prints this table. Output is truncated at maxPrintRows.

func (Dataframe) Sum

func (d Dataframe) Sum(name string) (float64, error)

Sum returns the sum of the numeric column with the given name.

Null rows are skipped (SQL aggregate semantics): the sum of the valid values is returned, and an all-null column sums to 0.

The summation is pairwise (recursive halving with sequential leaves), so the floating-point rounding error grows O(ε·log n) instead of the naive loop's O(ε·n) — the same algorithm NumPy uses, and the reason pandas/polars agreed with each other (and not with grizzly's old sequential loop) on the benchmark checksum's last decimals.

It returns ErrColumnNotFound if the column does not exist, and ErrTypeMismatch if the column is not numeric.

Example

ExampleDataframe_Sum shows the whole-column aggregations. Nulls are skipped, SQL-style: Avg divides by the count of valid values.

package main

import (
	"fmt"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	temp, _ := grizzly.NewFloat64ColumnWithNulls("temp",
		[]float64{21.5, 0, 28.25}, []bool{true, false, true})
	df, _ := grizzly.NewDataframe(temp)

	sum, _ := df.Sum("temp")
	avg, _ := df.Avg("temp")
	n, _ := df.Count("temp")
	fmt.Println(sum, avg, n)
}
Output:
49.75 24.875 2

func (Dataframe) ToCSV

func (d Dataframe) ToCSV(path string) error

ToCSV writes the dataframe to a CSV file with a header row, creating or truncating the file. It delegates to ToCSVWriter; see that function for the format and null rules.

func (Dataframe) ToCSVWriter

func (d Dataframe) ToCSVWriter(w io.Writer) error

ToCSVWriter writes the dataframe to a stream as CSV: one header row with the column names, then one record per row, columns in dataframe order.

Null rule (mirror of FromCSVReader): a null is written as an empty cell. For Float64 and Bool columns this round-trips exactly — an empty cell loads back as a null. For String columns it does NOT: grizzly deliberately reads an empty string cell as a real "" (see FromCSVReader), so a null string written by ToCSVWriter comes back as a valid empty string. Use ToJSONWriter when exact null round-trips matter; a configurable null marker may be added later.

Like ToJSONWriter, it writes bytes straight into a buffered writer instead of going through encoding/csv: csv.Writer only accepts []string records, which forced one string allocation per numeric cell (strconv.FormatFloat — the writer benchmark's whole allocation count). Here floats render via strconv.AppendFloat into one reused scratch buffer. Quoting follows encoding/csv's exact rules (see csvFieldNeedsQuotes), so the output is byte-for-byte what the previous implementation produced.

Example

ExampleDataframe_ToCSVWriter writes the dataframe as CSV. Nulls become empty cells (note bilbao's missing temp), which round-trip for float64 and bool columns — see ToCSVWriter for the string caveat.

package main

import (
	"log"
	"os"
	"strings"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	src := `[{"city": "madrid", "temp": 21.5}, {"city": "bilbao", "temp": null}]`
	schema := grizzly.Schema{
		{Name: "city", Type: grizzly.String},
		{Name: "temp", Type: grizzly.Float64},
	}
	df, _ := grizzly.FromJSONReader(strings.NewReader(src), schema)

	if err := df.ToCSVWriter(os.Stdout); err != nil {
		log.Fatal(err)
	}
}
Output:
city,temp
madrid,21.5
bilbao,

func (Dataframe) ToJSON

func (d Dataframe) ToJSON(path string) error

ToJSON writes the dataframe to a JSON file as an array of objects, creating or truncating the file. It delegates to ToJSONWriter; see that function for the format and null rules.

func (Dataframe) ToJSONWriter

func (d Dataframe) ToJSONWriter(w io.Writer) error

ToJSONWriter writes the dataframe to a stream as a compact JSON array of objects, one object per row with the columns as keys, in dataframe order — the exact shape FromJSONReader loads.

Null rule: a null row is written as a literal null, so the output round-trips exactly through FromJSONReader (a property CSV cannot offer for string columns; see ToCSVWriter).

Like FromJSONReader, it streams at token level: rows are emitted straight to the buffered writer, never materialized as map[string]any (which would heap-allocate one map per row and box every value).

JSON has no NaN or Infinity: a Float64 column holding one (possible via FromStructs) is an error rather than invalid output.

Example

ExampleDataframe_ToJSONWriter writes the dataframe as a compact JSON array of objects — the exact shape FromJSONReader loads, with literal nulls, so the output round-trips exactly.

package main

import (
	"log"
	"os"
	"strings"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	src := `[{"city": "madrid", "temp": 21.5}, {"city": "bilbao", "temp": null}]`
	schema := grizzly.Schema{
		{Name: "city", Type: grizzly.String},
		{Name: "temp", Type: grizzly.Float64},
	}
	df, _ := grizzly.FromJSONReader(strings.NewReader(src), schema)

	if err := df.ToJSONWriter(os.Stdout); err != nil {
		log.Fatal(err)
	}
}
Output:
[{"city":"madrid","temp":21.5},{"city":"bilbao","temp":null}]

func (Dataframe) Where

func (d Dataframe) Where(m Mask) (Dataframe, error)

Where returns a new Dataframe containing the rows whose mask bit is valid AND true — SQL WHERE semantics: an unknown (a comparison that involved a null) is not true, so the row is dropped.

This is the single materialization point of a filter: every column's surviving values (and their validity) are gathered into fresh buffers. It returns an error if the mask's length does not match the dataframe's row count.

Example

ExampleDataframe_Where filters rows by combining comparison masks: build masks with the comparators, combine with And/Or/Not, materialize once with Where. Errors are elided to keep the example focused.

package main

import (
	"fmt"

	"github.com/gverdugo-dev/grizzly"
)

func main() {
	type city struct {
		Name string  `grizzly:"city"`
		Temp float64 `grizzly:"temp"`
	}
	df, _ := grizzly.FromStructs([]city{
		{"madrid", 21.5}, {"bilbao", 12.5}, {"valencia", 28.25}, {"sevilla", 35.5},
	})

	warm, _ := df.Gt("temp", 20.0)
	mild, _ := df.Lt("temp", 30.0)
	out, _ := df.Where(warm.And(mild))
	fmt.Print(out)
}
Output:
city      temp
madrid    21.5
valencia  28.25

type Field

type Field struct {
	Name string
	Type DType
}

Field declares one column of a Schema: its name in the source and its type.

type Float64Column

type Float64Column struct {
	// contains filtered or unexported fields
}

Float64Column is a column of float64 values backed by a contiguous slice.

Nulls are tracked by a validity bitmap (see bitmap.go): at null rows the values slice holds a placeholder (0.0) that operations never read. A nil bitmap means the column has no nulls.

func NewFloat64Column

func NewFloat64Column(name string, values []float64) *Float64Column

NewFloat64Column returns a Float64Column with the given name and values.

The slice is stored as-is (not copied): the caller must not mutate it after construction. The column has no nulls; use NewFloat64ColumnWithNulls to mark some rows as null.

func NewFloat64ColumnWithNulls

func NewFloat64ColumnWithNulls(name string, values []float64, valid []bool) (*Float64Column, error)

NewFloat64ColumnWithNulls returns a Float64Column where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.

The []bool mask is the ergonomic boundary representation: it is compacted here, once, into the internal validity bitmap (and dropped entirely when it contains no false entries). The value stored at a null row is kept as a placeholder that operations never read.

func (*Float64Column) DType

func (c *Float64Column) DType() DType

DType returns Float64.

func (*Float64Column) IsValid

func (c *Float64Column) IsValid(i int) bool

IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.

func (*Float64Column) Len

func (c *Float64Column) Len() int

Len returns the number of values in the column.

func (*Float64Column) Name

func (c *Float64Column) Name() string

Name returns the column's name.

func (*Float64Column) NullCount

func (c *Float64Column) NullCount() int

NullCount returns the number of null rows in the column.

func (*Float64Column) Value

func (c *Float64Column) Value(i int) (float64, bool)

Value returns the value at row i and whether it is valid (comma-ok): ok is false when the row is null, and the returned value must then be ignored. It panics if i is out of range.

type GroupedDataframe

type GroupedDataframe struct {
	// contains filtered or unexported fields
}

GroupedDataframe is the intermediate result of Dataframe.GroupBy: the source dataframe plus the grouping decision, waiting for Agg to say which aggregations to compute.

It is cheap: grouping does not split the data into per-group copies, it only stamps every row with a group id (the factorize pattern). groupIDs is a flat []int parallel to the rows, and firstRows remembers where each group's key first appeared — which both makes the output deterministic (Go maps iterate in random order on purpose) and lets Agg rebuild the key column by gathering those rows.

func (GroupedDataframe) Agg

func (g GroupedDataframe) Agg(specs ...AggSpec) (Dataframe, error)

Agg computes the given aggregations over the groups and returns the result as a new Dataframe: the key column first (one row per group, in first-appearance order, named like the original), then one float64 column per spec — named after its source column, or as renamed with AggSpec.As (required when the same column is aggregated twice: duplicate output names are an error).

Null semantics mirror the whole-column aggregations, translated to per-group form: nulls are skipped; a group whose values are all null yields 0 for Sum and Count (the sum/count of nothing) and null for Avg, Min and Max (the per-group equivalent of ErrNoValidValues — a group cannot return an error, so it returns null).

Sum, Avg, Min and Max require a float64 column; Count accepts any. It returns ErrColumnNotFound for unknown columns and ErrTypeMismatch for unsupported ones.

type Mask

type Mask struct {
	// contains filtered or unexported fields
}

Mask is the result of a comparison over a Dataframe column: one bit per row telling whether that row satisfies the condition. Masks are produced by the Dataframe comparators (Eq, Ne, Lt, Le, Gt, Ge), combined with And, Or and Not, and finally consumed by Dataframe.Where, which materializes the surviving rows — one materialization no matter how many conditions were combined.

A mask carries Kleene three-valued logic: comparing a null row yields neither true nor false but unknown, tracked in a validity bitmap exactly like a column's (nil = no unknowns). Combinators implement the Kleene truth tables at word level — 64 rows per instruction — and Where keeps only rows whose bit is valid AND true, mirroring SQL's WHERE.

func (Mask) And

func (m Mask) And(o Mask) Mask

And returns the logical AND of the two masks under Kleene three-valued logic: the result is known when both operands are known, or when either known operand is false (false decides an AND on its own — false AND unknown is false, not unknown). It panics if the lengths differ.

func (Mask) Not

func (m Mask) Not() Mask

Not returns the logical negation of the mask. Unknown stays unknown (NOT null is null, per Kleene), so the validity bitmap is shared untouched — only the value bits flip.

func (Mask) Or

func (m Mask) Or(o Mask) Mask

Or returns the logical OR of the two masks under Kleene three-valued logic: the result is known when both operands are known, or when either known operand is true (true decides an OR on its own — true OR unknown is true, not unknown). It panics if the lengths differ.

type Schema

type Schema []Field

Schema declares the columns to load from an untyped source (CSV, JSON): which ones, their types, and their order in the resulting Dataframe.

grizzly never guesses types — a "08001" zip code stays a string because the user said so, not an int because it looked like one. The schema is a slice (not a map) so the user also controls column order, which untyped sources can't provide reliably: JSON objects are unordered by definition.

schema := grizzly.Schema{
	{Name: "product", Type: grizzly.String},
	{Name: "price", Type: grizzly.Float64},
}

type StringColumn

type StringColumn struct {
	// contains filtered or unexported fields
}

StringColumn is a column of string values backed by a contiguous slice.

Nulls are tracked by a validity bitmap (see bitmap.go): at null rows the values slice holds a placeholder ("") that operations never read. A nil bitmap means the column has no nulls.

func NewStringColumn

func NewStringColumn(name string, values []string) *StringColumn

NewStringColumn returns a StringColumn with the given name and values.

The slice is stored as-is (not copied): the caller must not mutate it after construction. The column has no nulls; use NewStringColumnWithNulls to mark some rows as null.

func NewStringColumnWithNulls

func NewStringColumnWithNulls(name string, values []string, valid []bool) (*StringColumn, error)

NewStringColumnWithNulls returns a StringColumn where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.

The []bool mask is the ergonomic boundary representation: it is compacted here, once, into the internal validity bitmap (and dropped entirely when it contains no false entries). The value stored at a null row is kept as a placeholder that operations never read.

func (*StringColumn) DType

func (c *StringColumn) DType() DType

DType returns String.

func (*StringColumn) IsValid

func (c *StringColumn) IsValid(i int) bool

IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.

func (*StringColumn) Len

func (c *StringColumn) Len() int

Len returns the number of values in the column.

func (*StringColumn) Name

func (c *StringColumn) Name() string

Name returns the column's name.

func (*StringColumn) NullCount

func (c *StringColumn) NullCount() int

NullCount returns the number of null rows in the column.

func (*StringColumn) Value

func (c *StringColumn) Value(i int) (string, bool)

Value returns the value at row i and whether it is valid (comma-ok): ok is false when the row is null, and the returned value must then be ignored. It panics if i is out of range.

Directories

Path Synopsis
cmd
playground command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL