ggstat

package
v0.0.0-...-abd1f79 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 23, 2017 License: BSD-3-Clause Imports: 10 Imported by: 6

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Agg

func Agg(xs ...string) func(aggs ...Aggregator) Aggregate

Agg constructs an Aggregate transform from a grouping column and a set of Aggregators.

TODO: Does this belong in ggstat? The specific aggregator functions probably do, but the concept could go in package table.

Types

type Aggregate

type Aggregate struct {
	// Xs is the list column names to group values by before
	// computing aggregate functions.
	Xs []string

	// Aggregators is the set of Aggregator functions to apply to
	// each group of values.
	Aggregators []Aggregator
}

Aggregate computes aggregate functions of a table grouped by distinct values of a column or set of columns.

Aggregate first groups the table by the Xs columns. Each of these groups produces a single row in the output table, where the unique value of each of the Xs columns appears in the output row, along with constant columns from the input, as well as any columns that have a unique value within every group (they're "effectively" constant). Additional columns in the output row are produced by applying the Aggregator functions to the group.

func (Aggregate) F

type Aggregator

type Aggregator func(input table.Grouping, output *table.Builder)

An Aggregator is a function that aggregates each group of input into one row and adds it to output. It may be based on multiple columns from input and may add multiple columns to output.

func AggCount

func AggCount(label string) Aggregator

AggCount returns an aggregate function that computes the number of rows in each group. The resulting column will be named label, or "count" if label is "".

func AggGeoMean

func AggGeoMean(cols ...string) Aggregator

AggGeoMean returns an aggregate function that computes the geometric mean of each of cols. The resulting columns will be named "geomean <col>" and will have the same type as <col>.

func AggMax

func AggMax(cols ...string) Aggregator

AggMax returns an aggregate function that computes the maximum of each of cols. The resulting columns will be named "max <col>" and will have the same type as <col>.

func AggMean

func AggMean(cols ...string) Aggregator

AggMean returns an aggregate function that computes the mean of each of cols. The resulting columns will be named "mean <col>" and will have the same type as <col>.

func AggMin

func AggMin(cols ...string) Aggregator

AggMin returns an aggregate function that computes the minimum of each of cols. The resulting columns will be named "min <col>" and will have the same type as <col>.

func AggQuantile

func AggQuantile(prefix string, quantile float64, cols ...string) Aggregator

AggQuantile returns an aggregate function that computes a quantile of each of cols. quantile has a range of [0,1]. The resulting columns will be named "<prefix> <col>" and will have the same type as <col>.

func AggSum

func AggSum(cols ...string) Aggregator

AggSum returns an aggregate function that computes the sum of each of cols. The resulting columns will be named "sum <col>" and will have the same type as <col>.

func AggUnique

func AggUnique(cols ...string) Aggregator

AggUnique returns an aggregate function retains the unique value of each of cols within each aggregate group, or panics if some group contains more than one value for one of these columns.

Note that Aggregate will automatically retain columns that happen to be unique. AggUnique can be used to enforce at aggregation time that certain columns *must* be unique (and get a nice error if they are not).

type Bin

type Bin struct {
	// X is the name of the column to use for samples.
	X string

	// W is the optional name of the column to use for sample
	// weights. It may be "" to weight each sample as 1.
	W string

	// Width controls how wide each bin should be. If not provided
	// or 0, a width will be chosen to produce 30 bins. If X is an
	// integer column, this width will be treated as an integer as
	// well.
	Width float64

	// Breaks is the set of break points to use as boundaries
	// between bins. The interval of each bin is [Breaks[i],
	// Breaks[i+1]). Data points before the first break are
	// dropped. If provided, Width and Center are ignored.
	Breaks table.Slice

	// SplitGroups indicates that each group in the table should
	// have separate bounds based on the data in that group alone.
	// The default, false, indicates that the binning function
	// should use the bounds of all of the data combined. This
	// makes it easier to compare bins across groups.
	SplitGroups bool
}

XXX If this is just based on the number of bins, it can come up with really ugly boundary numbers. If the bin width is specified, then you could also specify the left edge and bins will be placed at [align+width*N, align+width*(N+1)]. ggplot2 also lets you specify the center alignment.

XXX In Matlab and NumPy, bins are open on the right *except* for the last bin, which is closed on both.

XXX Number of bins/bin width/specify boundaries, same bins across all groups/separate for each group/based on shared scales (don't have that information here), relative or absolute histogram (Matlab has lots more).

XXX Scale transform.

The result of Bin has two columns in addition to constant columns from the input:

- Column X is the left edge of the bin.

  • Column W is the sum of the rows' weights, or column "count" is the number of rows in the bin.

func (Bin) F

func (b Bin) F(g table.Grouping) table.Grouping

type Density

type Density struct {
	// X is the name of the column to use for samples.
	X string

	// W is the optional name of the column to use for sample
	// weights. It may be "" to uniformly weight samples.
	W string

	// N is the number of points to sample the KDE at. If N is 0,
	// a reasonable default is used.
	//
	// TODO: This is particularly sensitive to the scale
	// transform.
	//
	// TODO: Base the default on the bandwidth. If the bandwidth
	// is really narrow, we may need a lot of samples to exceed
	// the Nyquist rate.
	N int

	// Domain specifies the domain at which to sample this function.
	// If Domain is nil, it defaults to DomainData{}.
	Domain FunctionDomainer

	// Kernel is the kernel to use for the KDE.
	Kernel stats.KDEKernel

	// Bandwidth is the bandwidth to use for the KDE.
	//
	// If this is zero, the bandwidth is computed from the data
	// using a default bandwidth estimator (currently
	// stats.BandwidthScott).
	Bandwidth float64

	// BoundaryMethod is the boundary correction method to use for
	// the KDE. The default value is BoundaryReflect; however, the
	// default bounds are effectively +/-inf, which is equivalent
	// to performing no boundary correction.
	BoundaryMethod stats.KDEBoundaryMethod

	// [BoundaryMin, BoundaryMax) specify a bounded support for
	// the KDE. If both are 0 (their default values), they are
	// treated as +/-inf.
	//
	// To specify a half-bounded support, set Min to math.Inf(-1)
	// or Max to math.Inf(1).
	BoundaryMin float64
	BoundaryMax float64
}

Density constructs a probability density estimate from a set of samples using kernel density estimation.

X is the only required field. All other fields have reasonable default zero values.

The result of Density has three columns in addition to constant columns from the input:

- Column X is the points at which the density estimate is sampled.

- Column "probability density" is the density estimate.

- Column "cumulative density" is the cumulative density estimate.

func (Density) F

type DomainData

type DomainData struct {
	// Widen expands the domain by Widen times the span of the
	// data.
	//
	// A value of 1.0 means to use exactly the bounds of the data.
	// If Widen is 0, it is treated as 1.1 (that is, widen the
	// domain by 10%, or 5% on the left and 5% on the right).
	Widen float64

	// SplitGroups indicates that each group in the table should
	// have a separate domain based on the data in that group
	// alone. The default, false, indicates that the domain should
	// be based on all of the data in the table combined. This
	// makes it possible to stack functions and easier to compare
	// them across groups.
	SplitGroups bool
}

DomainData is a FunctionDomainer that computes domains based on the bounds of the data.

func (DomainData) FunctionDomain

func (r DomainData) FunctionDomain(g table.Grouping, col string) func(gid table.GroupID) (min, max float64)

type DomainFixed

type DomainFixed struct {
	Min, Max float64
}

DomainFixed is a FunctionDomainer that returns a fixed domain.

func (DomainFixed) FunctionDomain

func (r DomainFixed) FunctionDomain(g table.Grouping, col string) func(gid table.GroupID) (min, max float64)

type ECDF

type ECDF struct {
	// X is the name of the column to use for samples.
	X string

	// W is the optional name of the column to use for sample
	// weights. It may be "" to uniformly weight samples.
	W string

	// Label, if not "", gives a label for the samples. It is used
	// to construct more specific names for the output columns. It
	// should be a plural noun.
	Label string

	// Domain specifies the domain of the returned ECDF. If the
	// domain is wider than the bounds of the data in a group,
	// ECDF will add a point below the smallest sample and above
	// the largest sample to make the 0 and 1 levels clear. If
	// Domain is nil, it defaults to DomainData{}.
	Domain FunctionDomainer
}

ECDF constructs an empirical CDF from a set of samples.

X is the only required field. All other fields have reasonable default zero values.

The result of ECDF has three columns in addition to constant columns from the input. The names of the columns depend on whether Label is "".

- Column X is the points at which the CDF changes (a subset of the samples).

- Column "cumulative density" or "cumulative density of <label>" is the cumulative density estimate.

- Column "cumulative count" (if W and Label are ""), "cumulative weight" (if W is not "", but Label is "") or "cumulative <label>" (if Label is not "") is the cumulative count or weight of samples. That is, cumulative density times the total weight of the samples.

func (ECDF) F

func (s ECDF) F(g table.Grouping) table.Grouping

type Function

type Function struct {
	// X is the name of the column to use for input domain of this
	// function.
	X string

	// N is the number of points to sample the function at. If N
	// is 0, a reasonable default is used.
	N int

	// Domain specifies the domain of which to sample this function.
	// If Domain is nil, it defaults to DomainData{}.
	Domain FunctionDomainer

	// Fn is the continuous univariate function to sample. Fn will
	// be called with each table in the grouping and the X values
	// at which it should be sampled. Fn must add its output
	// columns to out. The output table will already contain the
	// sample points bound to the X column.
	Fn func(gid table.GroupID, in *table.Table, sampleAt []float64, out *table.Builder)
}

Function samples a continuous univariate function at N points in the domain computed by Domain.

The result of Function binds column X to the X values at which the function is sampled and retains constant columns from the input. The computed function can add arbitrary columns for its output.

func (Function) F

type FunctionDomainer

type FunctionDomainer interface {
	// FunctionDomain computes the domain of a particular column
	// within a table. It takes a Grouping and a column in that
	// Grouping to compute the domain of and returns a function
	// that computes the domain for a specific group in the
	// Grouping. This makes it possible for FunctionDomain to
	// easily compute either Grouping-wide domains, or per-Table
	// domains.
	//
	// The returned domain may be (NaN, NaN) to indicate that
	// there is no data and the domain is vacuous.
	FunctionDomain(g table.Grouping, col string) func(gid table.GroupID) (min, max float64)
}

A FunctionDomainer computes the domain over which to evaluate a statistical function.

type LOESS

type LOESS struct {
	// X and Y are the names of the columns to use for X and Y
	// values of data points, respectively.
	X, Y string

	// N is the number of points to sample the regression at. If N
	// is 0, a reasonable default is used.
	N int

	// Domain specifies the domain at which to sample this function.
	// If Domain is nil, it defaults to DomainData{}.
	Domain FunctionDomainer

	// Degree specifies the degree of the local fit function. If
	// it is 0, it is treated as 2.
	Degree int

	// Span controls the smoothness of the fit. If it is 0, it is
	// treated as 0.5. The span must be between 0 and 1, where
	// smaller values fit the data more tightly.
	Span float64
}

LOESS constructs a locally-weighted least squares polynomial regression for the data (X, Y).

X and Y are required. All other fields have reasonable default zero values.

The result of LOESS has two columns in addition to constant columns from the input:

- Column X is the points at which the LOESS function is sampled.

- Column Y is the result of the LEOSS function.

TODO: Confidence internals/bootstrap distributions?

TODO: Robust LOESS? See https://www.mathworks.com/help/curvefit/smoothing-data.html#bq_6ys3-3

func (LOESS) F

type LeastSquares

type LeastSquares struct {
	// X and Y are the names of the columns to use for X and Y
	// values of data points, respectively.
	X, Y string

	// N is the number of points to sample the regression at. If N
	// is 0, a reasonable default is used.
	N int

	// Domain specifies the domain at which to sample this function.
	// If Domain is nil, it defaults to DomainData{}.
	Domain FunctionDomainer

	// Degree specifies the degree of the fit polynomial. If it is
	// 0, it is treated as 1.
	Degree int
}

LeastSquares constructs a least squares polynomial regression for the data (X, Y).

X and Y are required. All other fields have reasonable default zero values.

The result of LeastSquares has two columns in addition to constant columns from the input:

- Column X is the points at which the fit function is sampled.

- Column Y is the result of the fit function.

TODO: Confidence internals/bootstrap distributions?

func (LeastSquares) F

type Normalize

type Normalize struct {
	// X is the name of the column to use to find the denominator
	// row. If X is "", Index is used instead.
	X string

	// Index is the row index of the denominator row if X is ""
	// (otherwise it is ignored). Index may be negative, in which
	// case it is added to the number of rows (e.g., -1 is the
	// last row).
	Index int

	// By is a function func([]T) int that returns the index of
	// the denominator row given column X. By may be nil, in which
	// case it defaults to generic.ArgMin.
	By interface{}

	// Cols is a slice of the names of columns to normalize
	// relative to the corresponding DenomCols value in the
	// denominator row. Cols may be nil, in which case it defaults
	// to all integral and floating point columns.
	Cols []string

	// DenomCols is a slice of the names of columns used as the
	// demoninator. DenomCols may be nil, in which case it
	// defaults to Cols (i.e. each column will be normalized to
	// the value from that column in the denominator row.)
	// Otherwise, DenomCols must be the same length as Cols.
	DenomCols []string
}

Normalize normalizes each group such that some data point is 1.

Either X or Index is required (though 0 is a reasonable value of Index).

The result of Normalize is the same as the input table, plus additional columns for each normalized column. These columns will be named "normalized <col>" where <col> is the name of the original column and will have type []float64.

func (Normalize) F

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL