Documentation
¶
Overview ¶
Package grizzly is a dataframe library built from scratch as a learning project.
Data is ingested thinking in rows (structs, CSV, JSON) but stored and processed in columns: each column keeps its values in a contiguous typed slice, which is what makes operations like Sum or Filter fast.
Index ¶
- Variables
- func SetLogger(l *slog.Logger)
- type AggSpec
- type BoolColumn
- type Column
- type DType
- type Dataframe
- func FromCSV(path string, schema Schema) (Dataframe, error)
- func FromCSVReader(r io.Reader, schema Schema) (Dataframe, error)
- func FromJSON(path string, schema Schema) (Dataframe, error)
- func FromJSONReader(r io.Reader, schema Schema) (Dataframe, error)
- func FromStructs[T any](rows []T) (Dataframe, error)
- func NewDataframe(cols ...Column) (Dataframe, error)
- func (d Dataframe) Avg(name string) (float64, error)
- func (d Dataframe) Column(name string) (Column, error)
- func (d Dataframe) Count(name string) (int, error)
- func (d Dataframe) Eq(name string, value any) (Mask, error)
- func (d Dataframe) Ge(name string, value any) (Mask, error)
- func (d Dataframe) GroupBy(names ...string) GroupedDataframe
- func (d Dataframe) Gt(name string, value any) (Mask, error)
- func (d Dataframe) Info() string
- func (d Dataframe) Le(name string, value any) (Mask, error)
- func (d Dataframe) Lt(name string, value any) (Mask, error)
- func (d Dataframe) Max(name string) (float64, error)
- func (d Dataframe) Min(name string) (float64, error)
- func (d Dataframe) Ne(name string, value any) (Mask, error)
- func (d Dataframe) NumCols() int
- func (d Dataframe) NumRows() int
- func (d Dataframe) Select(names ...string) (Dataframe, error)
- func (d Dataframe) Sort(name string) (Dataframe, error)
- func (d Dataframe) SortDesc(name string) (Dataframe, error)
- func (d Dataframe) String() string
- func (d Dataframe) Sum(name string) (float64, error)
- func (d Dataframe) ToCSV(path string) error
- func (d Dataframe) ToCSVWriter(w io.Writer) error
- func (d Dataframe) ToJSON(path string) error
- func (d Dataframe) ToJSONWriter(w io.Writer) error
- func (d Dataframe) Where(m Mask) (Dataframe, error)
- type Field
- type Float64Column
- type GroupedDataframe
- type Mask
- type Schema
- type StringColumn
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ErrColumnNotFound = errors.New("column does not exist in dataframe")
ErrColumnNotFound is returned when the requested column does not exist in the dataframe.
var ErrNoValidValues = errors.New("no valid values in column")
ErrNoValidValues is returned by aggregations that have no honest answer over zero values (Avg, Min, Max of an empty or all-null column) — the Go-flavored equivalent of SQL returning NULL in that case.
var ErrTypeMismatch = errors.New("operation not supported for column type")
ErrTypeMismatch is returned when an operation is applied to a column whose type does not support it (e.g. Sum over a string column).
Functions ¶
Types ¶
type AggSpec ¶
type AggSpec struct {
// contains filtered or unexported fields
}
AggSpec describes one aggregation for GroupedDataframe.Agg: which operation over which column, and optionally the output column's name. Build specs with the package-level constructors Sum, Avg, Min, Max and Count — package functions, distinct from the Dataframe methods of the same names (Go resolves them by receiver, so grizzly.Sum("price") and df.Sum("price") coexist) — and rename with As.
func Count ¶
Count returns an AggSpec that counts the valid (non-null) rows of the named column per group — SQL's COUNT(col).
func (AggSpec) As ¶
As renames the aggregation's output column, like SQL's AS:
df.GroupBy("store").Agg(
grizzly.Sum("price"), // output column: price
grizzly.Avg("price").As("avg_price"), // output column: avg_price
)
Without As the output keeps the source column's name, so aggregating the same column twice requires renaming at least one of them (duplicate names fail when the result dataframe is built).
type BoolColumn ¶
type BoolColumn struct {
// contains filtered or unexported fields
}
BoolColumn is a column of bool values stored packed, Arrow-style: one bit per row in bitmap words, not one byte per row — 8x smaller than a []bool, and counting trues is a popcount away.
Packing breaks the "1 slice element = 1 row" coincidence the other columns enjoy: len(values) counts words (64 rows each), so the logical row count needs its own length field, and bounds checks are written by hand instead of falling out of slice indexing.
Nulls are tracked by a validity bitmap like every other column: at null rows the packed value bit is an arbitrary placeholder (0) that operations never read. A nil bitmap means the column has no nulls.
func NewBoolColumn ¶
func NewBoolColumn(name string, values []bool) *BoolColumn
NewBoolColumn returns a BoolColumn with the given name and values, packed to one bit per row. The input slice is only read during construction; the column keeps no reference to it. The column has no nulls; use NewBoolColumnWithNulls to mark some rows as null.
func NewBoolColumnWithNulls ¶
func NewBoolColumnWithNulls(name string, values []bool, valid []bool) (*BoolColumn, error)
NewBoolColumnWithNulls returns a BoolColumn where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.
Both slices are only read during construction (values are packed, the mask is compacted — and dropped entirely when it contains no false entries).
func (*BoolColumn) IsValid ¶
func (c *BoolColumn) IsValid(i int) bool
IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.
func (*BoolColumn) Len ¶
func (c *BoolColumn) Len() int
Len returns the number of values in the column.
func (*BoolColumn) NullCount ¶
func (c *BoolColumn) NullCount() int
NullCount returns the number of null rows in the column.
type Column ¶
type Column interface {
// Name returns the column's name, unique within a Dataframe.
Name() string
// Len returns the number of values in the column.
Len() int
// DType returns the column's data type.
DType() DType
// IsValid reports whether row i holds a value (true) or a null (false).
// It panics if i is out of range.
IsValid(i int) bool
// NullCount returns the number of null rows in the column.
NullCount() int
}
Column is the contract every typed column satisfies.
A column stores its values in a contiguous typed slice (e.g. []float64), which is what makes columnar processing fast: sequential memory access, no per-value heap allocations, no type assertions in hot loops.
The interface intentionally exposes only type-agnostic behavior. Anything that needs the concrete values (Sum, Filter...) type-switches on the concrete implementation (*Float64Column, *StringColumn...) to reach the underlying slice directly.
type DType ¶
type DType string
DType identifies the data type of a column. Grizzly supports a closed set of column types; every Column implementation reports exactly one DType.
Using a small named type instead of bare strings gives us compile-time safety wherever a function expects "a column type" rather than "any string".
type Dataframe ¶
type Dataframe struct {
// contains filtered or unexported fields
}
Dataframe is an in-memory, column-oriented table.
Columns are kept in a slice (not a map) to preserve their insertion order, which matters when printing or serializing. Lookup by name is a linear scan — fine for the realistic number of columns in a dataframe (tens, not millions).
func FromCSV ¶
FromCSV builds a Dataframe from a CSV file with a header row. See FromCSVReader for the format, schema and null rules — both paths load identical dataframes.
Unlike FromCSVReader, it reads the whole file into memory and, when the file is big enough, parses it in parallel by chunks (see from_csv_parallel.go): a file path offers random access, which is what makes splitting into byte ranges possible. Use FromCSVReader when streaming matters more than speed.
func FromCSVReader ¶
FromCSVReader builds a Dataframe from a stream of CSV data with a header row, using the given schema to decide each column's type (CSV itself is untyped: every value arrives as a string until someone decides otherwise).
The stream is read one record at a time, never whole in memory, with the record slice reused across reads (ReuseRecord) so the hot loop does not allocate per row.
The schema also selects and orders the columns: source columns not listed in it are ignored, and the Dataframe's column order is the schema's order, not the stream's. A schema column missing from the header is an error.
Null rule: an empty cell in a Float64 column is a null (validity bit 0). In a String column it stays a real empty string — CSV cannot distinguish an empty cell from a legitimate "", and silently turning every "" into a null (as pandas does) destroys real data. Explicit over magic; a configurable null marker can be added later if needed.
Example ¶
ExampleFromCSVReader loads typed columns from untyped CSV text: the schema decides the types, and an empty cell in a float64 column loads as a null (printed as "null", never as a fake 0).
package main
import (
"fmt"
"log"
"strings"
"github.com/gverdugo-dev/grizzly"
)
func main() {
data := "city,temp\nmadrid,21.5\nbilbao,\nvalencia,28.25\n"
schema := grizzly.Schema{
{Name: "city", Type: grizzly.String},
{Name: "temp", Type: grizzly.Float64},
}
df, err := grizzly.FromCSVReader(strings.NewReader(data), schema)
if err != nil {
log.Fatal(err)
}
fmt.Print(df)
}
Output: city temp madrid 21.5 bilbao null valencia 28.25
func FromJSON ¶
FromJSON builds a Dataframe from a JSON file containing an array of objects (one object = one row). See FromJSONReader for the format, schema and null rules — valid files load identically through both.
Unlike FromJSONReader, it reads the whole file into memory and parses the bytes directly (see from_json_bytes.go), skipping encoding/json's per-value machinery. Use FromJSONReader when streaming matters more than speed.
func FromJSONReader ¶
FromJSONReader builds a Dataframe from a stream containing a JSON array of objects (one object = one row), using the given schema to decide each column's type and the column order — JSON objects are unordered, so the stream cannot provide an order even if it wanted to.
It decodes at token level, streaming: rows are never materialized as map[string]any. The first benchmark showed that intermediate layout (1M heap-allocated maps, 10M values boxed in any) made JSON ~6x slower than our own CSV path. Values are decoded through a reused per-column pointer, so they go stream → typed slice with no per-value allocation.
A JSON null becomes a null row in that column (validity bit 0). Schema columns missing from a row, or holding a value of the wrong JSON type, are still errors: explicit over magic — an absent key is a malformed row, a literal null is data.
func FromStructs ¶
FromStructs builds a Dataframe from a slice of structs: one column per exported struct field, named after the field (or its `grizzly:"..."` tag).
type Sale struct {
Product string `grizzly:"product"`
Price float64 `grizzly:"price"`
}
df, err := grizzly.FromStructs([]Sale{...})
This is the row-oriented front door of grizzly: the caller thinks in rows (one struct = one row), and this function transposes them into typed columns once, at the boundary. No schema is needed — the struct fields already declare the types.
Unexported fields are skipped. A field of an unsupported type is an error rather than silently dropped: explicit over magic.
Example ¶
ExampleFromStructs shows the row-oriented front door: a slice of structs becomes a dataframe, one column per exported field, renamed by tags.
package main
import (
"fmt"
"log"
"github.com/gverdugo-dev/grizzly"
)
func main() {
type sale struct {
Product string `grizzly:"product"`
Price float64 `grizzly:"price"`
Sold bool `grizzly:"sold"`
}
df, err := grizzly.FromStructs([]sale{
{Product: "apple", Price: 1.5, Sold: true},
{Product: "pear", Price: 2, Sold: false},
{Product: "orange", Price: 0.5, Sold: true},
})
if err != nil {
log.Fatal(err)
}
fmt.Print(df)
}
Output: product price sold apple 1.5 true pear 2 false orange 0.5 true
func NewDataframe ¶
NewDataframe builds a Dataframe from the given columns.
All columns must have the same length and unique names; otherwise an error is returned. This is the low-level constructor — friendlier, row-oriented constructors (from structs, CSV, JSON) build on top of it.
func (Dataframe) Avg ¶
Avg returns the average of the valid values in the numeric column with the given name: the sum of the valid values divided by their count, per the SQL aggregate convention (nulls are skipped, in both numerator and denominator).
It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (an average over zero values has no honest answer).
func (Dataframe) Count ¶
Count returns the number of valid (non-null) rows in the column with the given name — SQL's COUNT(col), not COUNT(*). It works for any column type, and an empty or all-null column counts 0 (not an error: the count of nothing is zero).
It returns ErrColumnNotFound if the column does not exist.
func (Dataframe) Eq ¶
Eq returns a mask of the rows where the named column equals value. The value's type must match the column's dtype (float64, string or bool); null rows compare to unknown and are dropped by Where.
It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's.
func (Dataframe) Ge ¶
Ge returns a mask of the rows where the named column is greater than or equal to value. See Lt for the supported column types and errors.
func (Dataframe) GroupBy ¶
func (d Dataframe) GroupBy(names ...string) GroupedDataframe
GroupBy groups the dataframe's rows by the values of the named key column, returning a GroupedDataframe to aggregate with Agg:
result, err := df.GroupBy("store").Agg(grizzly.Sum("price"))
Null keys follow the SQL rule: all nulls form one group together (for grouping, null equals null — the opposite of comparisons, where null is unknown). No row is silently dropped.
The signature is variadic for forward compatibility, but only a single key column is supported for now; multiple names are an error.
GroupBy returns no error itself so the call chains — df.GroupBy(...).Agg(...) would not compile against a multi-valued return. Errors (unknown column, unsupported key type, multiple names) are deferred and reported by Agg, the same pattern sql.Row.Scan uses.
Example ¶
ExampleDataframe_GroupBy groups rows by a key column and aggregates per group. Output columns keep the source column's name unless renamed with As; groups appear in first-appearance order.
package main
import (
"fmt"
"log"
"github.com/gverdugo-dev/grizzly"
)
func main() {
type sale struct {
Store string `grizzly:"store"`
Price float64 `grizzly:"price"`
}
df, _ := grizzly.FromStructs([]sale{
{"north", 1.5}, {"south", 2.5}, {"north", 0.5}, {"south", 1.5},
})
out, err := df.GroupBy("store").Agg(
grizzly.Sum("price"),
grizzly.Avg("price").As("avg"),
)
if err != nil {
log.Fatal(err)
}
fmt.Print(out)
}
Output: store price avg north 2 1 south 4 2
func (Dataframe) Gt ¶
Gt returns a mask of the rows where the named column is greater than value. See Lt for the supported column types and errors.
func (Dataframe) Info ¶
Info returns a structural summary of the dataframe, in the spirit of pandas' DataFrame.info(): one line per column with its name, dtype, non-null count and approximate memory footprint.
func (Dataframe) Le ¶
Le returns a mask of the rows where the named column is less than or equal to value. See Lt for the supported column types and errors.
func (Dataframe) Lt ¶
Lt returns a mask of the rows where the named column is less than value. Supported on float64 and string columns (strings compare lexicographically, like Go's < operator).
It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's or the column does not support ordering (bool).
func (Dataframe) Max ¶
Max returns the largest valid value in the numeric column with the given name. Null rows are skipped.
It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (the maximum of nothing does not exist).
func (Dataframe) Min ¶
Min returns the smallest valid value in the numeric column with the given name. Null rows are skipped.
It returns ErrColumnNotFound if the column does not exist, ErrTypeMismatch if the column is not numeric, and ErrNoValidValues if the column is empty or all-null (the minimum of nothing does not exist).
func (Dataframe) Ne ¶
Ne returns a mask of the rows where the named column differs from value. Implemented as Eq().Not(), which preserves Kleene semantics: a null row is unknown for Ne too (null != x is not true in SQL either).
It returns ErrColumnNotFound if the column does not exist and ErrTypeMismatch if the value's type does not match the column's.
func (Dataframe) Select ¶
Select returns a new Dataframe containing the named columns, in the given order — column projection, SQL's SELECT clause (Where is the WHERE). Duplicate names are an error.
Columns are shared with the original dataframe, not copied: columns are immutable after construction, so sharing is safe and Select is O(columns) regardless of row count. polars does the same with its immutable buffers.
It returns ErrColumnNotFound if any requested column does not exist.
Example ¶
ExampleDataframe_Select projects columns by name, in the requested order, without copying any data.
package main
import (
"fmt"
"github.com/gverdugo-dev/grizzly"
)
func main() {
type city struct {
Name string `grizzly:"city"`
Temp float64 `grizzly:"temp"`
}
df, _ := grizzly.FromStructs([]city{
{"madrid", 21.5}, {"bilbao", 12.5},
})
out, _ := df.Select("temp", "city")
fmt.Print(out)
}
Output: temp city 21.5 madrid 12.5 bilbao
func (Dataframe) Sort ¶
Sort returns a new Dataframe with the rows ordered by the named column, ascending. Null rows sort first (also under SortDesc — polars' rule: the holes group at the top in both directions). The sort is stable: rows with equal keys keep their original relative order.
Columnar engines do not reorder rows in place: Sort orders a permutation of row indices and then gathers every column through it once — the same indices-then-gather pattern GroupBy uses.
It returns ErrColumnNotFound if the column does not exist.
Example ¶
ExampleDataframe_Sort orders rows by a column, returning a new dataframe. Nulls sort first, in both directions.
package main
import (
"fmt"
"strings"
"github.com/gverdugo-dev/grizzly"
)
func main() {
data := `[
{"city": "sevilla", "temp": 35.5},
{"city": "bilbao", "temp": null},
{"city": "madrid", "temp": 21.5}
]`
schema := grizzly.Schema{
{Name: "city", Type: grizzly.String},
{Name: "temp", Type: grizzly.Float64},
}
df, _ := grizzly.FromJSONReader(strings.NewReader(data), schema)
out, _ := df.Sort("temp")
fmt.Print(out)
}
Output: city temp bilbao null madrid 21.5 sevilla 35.5
func (Dataframe) SortDesc ¶
SortDesc returns a new Dataframe with the rows ordered by the named column, descending. Null rows still sort first; see Sort.
func (Dataframe) String ¶
String renders the dataframe as an aligned text table.
Implementing this single method makes Dataframe satisfy fmt.Stringer, so the whole fmt package picks it up automatically: fmt.Println(df) prints this table. Output is truncated at maxPrintRows.
func (Dataframe) Sum ¶
Sum returns the sum of the numeric column with the given name.
Null rows are skipped (SQL aggregate semantics): the sum of the valid values is returned, and an all-null column sums to 0.
The summation is pairwise (recursive halving with sequential leaves), so the floating-point rounding error grows O(ε·log n) instead of the naive loop's O(ε·n) — the same algorithm NumPy uses, and the reason pandas/polars agreed with each other (and not with grizzly's old sequential loop) on the benchmark checksum's last decimals.
It returns ErrColumnNotFound if the column does not exist, and ErrTypeMismatch if the column is not numeric.
Example ¶
ExampleDataframe_Sum shows the whole-column aggregations. Nulls are skipped, SQL-style: Avg divides by the count of valid values.
package main
import (
"fmt"
"github.com/gverdugo-dev/grizzly"
)
func main() {
temp, _ := grizzly.NewFloat64ColumnWithNulls("temp",
[]float64{21.5, 0, 28.25}, []bool{true, false, true})
df, _ := grizzly.NewDataframe(temp)
sum, _ := df.Sum("temp")
avg, _ := df.Avg("temp")
n, _ := df.Count("temp")
fmt.Println(sum, avg, n)
}
Output: 49.75 24.875 2
func (Dataframe) ToCSV ¶
ToCSV writes the dataframe to a CSV file with a header row, creating or truncating the file. It delegates to ToCSVWriter; see that function for the format and null rules.
func (Dataframe) ToCSVWriter ¶
ToCSVWriter writes the dataframe to a stream as CSV: one header row with the column names, then one record per row, columns in dataframe order.
Null rule (mirror of FromCSVReader): a null is written as an empty cell. For Float64 and Bool columns this round-trips exactly — an empty cell loads back as a null. For String columns it does NOT: grizzly deliberately reads an empty string cell as a real "" (see FromCSVReader), so a null string written by ToCSVWriter comes back as a valid empty string. Use ToJSONWriter when exact null round-trips matter; a configurable null marker may be added later.
Like ToJSONWriter, it writes bytes straight into a buffered writer instead of going through encoding/csv: csv.Writer only accepts []string records, which forced one string allocation per numeric cell (strconv.FormatFloat — the writer benchmark's whole allocation count). Here floats render via strconv.AppendFloat into one reused scratch buffer. Quoting follows encoding/csv's exact rules (see csvFieldNeedsQuotes), so the output is byte-for-byte what the previous implementation produced.
Example ¶
ExampleDataframe_ToCSVWriter writes the dataframe as CSV. Nulls become empty cells (note bilbao's missing temp), which round-trip for float64 and bool columns — see ToCSVWriter for the string caveat.
package main
import (
"log"
"os"
"strings"
"github.com/gverdugo-dev/grizzly"
)
func main() {
src := `[{"city": "madrid", "temp": 21.5}, {"city": "bilbao", "temp": null}]`
schema := grizzly.Schema{
{Name: "city", Type: grizzly.String},
{Name: "temp", Type: grizzly.Float64},
}
df, _ := grizzly.FromJSONReader(strings.NewReader(src), schema)
if err := df.ToCSVWriter(os.Stdout); err != nil {
log.Fatal(err)
}
}
Output: city,temp madrid,21.5 bilbao,
func (Dataframe) ToJSON ¶
ToJSON writes the dataframe to a JSON file as an array of objects, creating or truncating the file. It delegates to ToJSONWriter; see that function for the format and null rules.
func (Dataframe) ToJSONWriter ¶
ToJSONWriter writes the dataframe to a stream as a compact JSON array of objects, one object per row with the columns as keys, in dataframe order — the exact shape FromJSONReader loads.
Null rule: a null row is written as a literal null, so the output round-trips exactly through FromJSONReader (a property CSV cannot offer for string columns; see ToCSVWriter).
Like FromJSONReader, it streams at token level: rows are emitted straight to the buffered writer, never materialized as map[string]any (which would heap-allocate one map per row and box every value).
JSON has no NaN or Infinity: a Float64 column holding one (possible via FromStructs) is an error rather than invalid output.
Example ¶
ExampleDataframe_ToJSONWriter writes the dataframe as a compact JSON array of objects — the exact shape FromJSONReader loads, with literal nulls, so the output round-trips exactly.
package main
import (
"log"
"os"
"strings"
"github.com/gverdugo-dev/grizzly"
)
func main() {
src := `[{"city": "madrid", "temp": 21.5}, {"city": "bilbao", "temp": null}]`
schema := grizzly.Schema{
{Name: "city", Type: grizzly.String},
{Name: "temp", Type: grizzly.Float64},
}
df, _ := grizzly.FromJSONReader(strings.NewReader(src), schema)
if err := df.ToJSONWriter(os.Stdout); err != nil {
log.Fatal(err)
}
}
Output: [{"city":"madrid","temp":21.5},{"city":"bilbao","temp":null}]
func (Dataframe) Where ¶
Where returns a new Dataframe containing the rows whose mask bit is valid AND true — SQL WHERE semantics: an unknown (a comparison that involved a null) is not true, so the row is dropped.
This is the single materialization point of a filter: every column's surviving values (and their validity) are gathered into fresh buffers. It returns an error if the mask's length does not match the dataframe's row count.
Example ¶
ExampleDataframe_Where filters rows by combining comparison masks: build masks with the comparators, combine with And/Or/Not, materialize once with Where. Errors are elided to keep the example focused.
package main
import (
"fmt"
"github.com/gverdugo-dev/grizzly"
)
func main() {
type city struct {
Name string `grizzly:"city"`
Temp float64 `grizzly:"temp"`
}
df, _ := grizzly.FromStructs([]city{
{"madrid", 21.5}, {"bilbao", 12.5}, {"valencia", 28.25}, {"sevilla", 35.5},
})
warm, _ := df.Gt("temp", 20.0)
mild, _ := df.Lt("temp", 30.0)
out, _ := df.Where(warm.And(mild))
fmt.Print(out)
}
Output: city temp madrid 21.5 valencia 28.25
type Float64Column ¶
type Float64Column struct {
// contains filtered or unexported fields
}
Float64Column is a column of float64 values backed by a contiguous slice.
Nulls are tracked by a validity bitmap (see bitmap.go): at null rows the values slice holds a placeholder (0.0) that operations never read. A nil bitmap means the column has no nulls.
func NewFloat64Column ¶
func NewFloat64Column(name string, values []float64) *Float64Column
NewFloat64Column returns a Float64Column with the given name and values.
The slice is stored as-is (not copied): the caller must not mutate it after construction. The column has no nulls; use NewFloat64ColumnWithNulls to mark some rows as null.
func NewFloat64ColumnWithNulls ¶
func NewFloat64ColumnWithNulls(name string, values []float64, valid []bool) (*Float64Column, error)
NewFloat64ColumnWithNulls returns a Float64Column where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.
The []bool mask is the ergonomic boundary representation: it is compacted here, once, into the internal validity bitmap (and dropped entirely when it contains no false entries). The value stored at a null row is kept as a placeholder that operations never read.
func (*Float64Column) IsValid ¶
func (c *Float64Column) IsValid(i int) bool
IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.
func (*Float64Column) Len ¶
func (c *Float64Column) Len() int
Len returns the number of values in the column.
func (*Float64Column) NullCount ¶
func (c *Float64Column) NullCount() int
NullCount returns the number of null rows in the column.
type GroupedDataframe ¶
type GroupedDataframe struct {
// contains filtered or unexported fields
}
GroupedDataframe is the intermediate result of Dataframe.GroupBy: the source dataframe plus the grouping decision, waiting for Agg to say which aggregations to compute.
It is cheap: grouping does not split the data into per-group copies, it only stamps every row with a group id (the factorize pattern). groupIDs is a flat []int parallel to the rows, and firstRows remembers where each group's key first appeared — which both makes the output deterministic (Go maps iterate in random order on purpose) and lets Agg rebuild the key column by gathering those rows.
func (GroupedDataframe) Agg ¶
func (g GroupedDataframe) Agg(specs ...AggSpec) (Dataframe, error)
Agg computes the given aggregations over the groups and returns the result as a new Dataframe: the key column first (one row per group, in first-appearance order, named like the original), then one float64 column per spec — named after its source column, or as renamed with AggSpec.As (required when the same column is aggregated twice: duplicate output names are an error).
Null semantics mirror the whole-column aggregations, translated to per-group form: nulls are skipped; a group whose values are all null yields 0 for Sum and Count (the sum/count of nothing) and null for Avg, Min and Max (the per-group equivalent of ErrNoValidValues — a group cannot return an error, so it returns null).
Sum, Avg, Min and Max require a float64 column; Count accepts any. It returns ErrColumnNotFound for unknown columns and ErrTypeMismatch for unsupported ones.
type Mask ¶
type Mask struct {
// contains filtered or unexported fields
}
Mask is the result of a comparison over a Dataframe column: one bit per row telling whether that row satisfies the condition. Masks are produced by the Dataframe comparators (Eq, Ne, Lt, Le, Gt, Ge), combined with And, Or and Not, and finally consumed by Dataframe.Where, which materializes the surviving rows — one materialization no matter how many conditions were combined.
A mask carries Kleene three-valued logic: comparing a null row yields neither true nor false but unknown, tracked in a validity bitmap exactly like a column's (nil = no unknowns). Combinators implement the Kleene truth tables at word level — 64 rows per instruction — and Where keeps only rows whose bit is valid AND true, mirroring SQL's WHERE.
func (Mask) And ¶
And returns the logical AND of the two masks under Kleene three-valued logic: the result is known when both operands are known, or when either known operand is false (false decides an AND on its own — false AND unknown is false, not unknown). It panics if the lengths differ.
func (Mask) Not ¶
Not returns the logical negation of the mask. Unknown stays unknown (NOT null is null, per Kleene), so the validity bitmap is shared untouched — only the value bits flip.
type Schema ¶
type Schema []Field
Schema declares the columns to load from an untyped source (CSV, JSON): which ones, their types, and their order in the resulting Dataframe.
grizzly never guesses types — a "08001" zip code stays a string because the user said so, not an int because it looked like one. The schema is a slice (not a map) so the user also controls column order, which untyped sources can't provide reliably: JSON objects are unordered by definition.
schema := grizzly.Schema{
{Name: "product", Type: grizzly.String},
{Name: "price", Type: grizzly.Float64},
}
type StringColumn ¶
type StringColumn struct {
// contains filtered or unexported fields
}
StringColumn is a column of string values backed by a contiguous slice.
Nulls are tracked by a validity bitmap (see bitmap.go): at null rows the values slice holds a placeholder ("") that operations never read. A nil bitmap means the column has no nulls.
func NewStringColumn ¶
func NewStringColumn(name string, values []string) *StringColumn
NewStringColumn returns a StringColumn with the given name and values.
The slice is stored as-is (not copied): the caller must not mutate it after construction. The column has no nulls; use NewStringColumnWithNulls to mark some rows as null.
func NewStringColumnWithNulls ¶
func NewStringColumnWithNulls(name string, values []string, valid []bool) (*StringColumn, error)
NewStringColumnWithNulls returns a StringColumn where valid[i] reports whether values[i] is a real value (true) or a null (false). values and valid must have the same length; otherwise an error is returned.
The []bool mask is the ergonomic boundary representation: it is compacted here, once, into the internal validity bitmap (and dropped entirely when it contains no false entries). The value stored at a null row is kept as a placeholder that operations never read.
func (*StringColumn) IsValid ¶
func (c *StringColumn) IsValid(i int) bool
IsValid reports whether row i holds a value (true) or a null (false). It panics if i is out of range.
func (*StringColumn) Len ¶
func (c *StringColumn) Len() int
Len returns the number of values in the column.
func (*StringColumn) NullCount ¶
func (c *StringColumn) NullCount() int
NullCount returns the number of null rows in the column.