synth

package
v0.10.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: MIT Imports: 16 Imported by: 0

Documentation

Overview

Package synth produces synthetic .pulse cohorts from either a schema declaration ("from-schema") or a statistical profile of a real cohort ("from-profile"). The generator is deterministic given a seed and writes directly into the .pulse binary format using the encoding package, so outputs are byte-identical for the same (spec, seed) pair.

Two top-level entry points are exported by this package:

Synth(spec Spec, opts Options) (*Result, error)
Profile(schema, records, opts ProfileOptions) (*Profile, error)

Pulse embedders should use the higher-level pulse.Pulse.Synth and pulse.Pulse.Profile facade methods instead.

Index

Constants

View Source
const (
	DistUniform             = "uniform"
	DistNormal              = "normal"
	DistLogNormal           = "lognormal"
	DistExponential         = "exponential"
	DistPoisson             = "poisson"
	DistPareto              = "pareto"
	DistBernoulli           = "bernoulli"
	DistMonotonicFrom       = "monotonic_from"
	DistWeightedCategorical = "weighted_categorical"
	DistUniformDate         = "uniform_date"
	DistRegex               = "regex"
	DistConstant            = "constant"
)

Distribution kind constants.

Variables

This section is empty.

Functions

func AllDistributions

func AllDistributions() []string

AllDistributions returns the registered kind names in sorted order. Used by the manifest and tests.

Types

type CategoricalProfile

type CategoricalProfile struct {
	Cardinality int           `json:"cardinality"`
	Top         []CategoryHit `json:"top"`
}

CategoricalProfile holds the top-K observed values and the total distinct count.

type CategoryHit

type CategoryHit struct {
	Value  string  `json:"value"`
	Weight float64 `json:"weight"`
}

CategoryHit records a categorical value with its observed weight.

type ConstraintSpec

type ConstraintSpec struct {
	Expr string `json:"expr"`
}

ConstraintSpec wraps an expression evaluated against the in-memory row.

type CorrelationSpec

type CorrelationSpec struct {
	A           string  `json:"a"`
	B           string  `json:"b"`
	Correlation float64 `json:"correlation"`
}

CorrelationSpec declares a target Pearson correlation between two numeric fields.

type CorrelationStat

type CorrelationStat struct {
	A   string  `json:"a"`
	B   string  `json:"b"`
	Rho float64 `json:"rho"`
}

CorrelationStat is a captured pairwise correlation entry.

type DateProfile

type DateProfile struct {
	Start    string `json:"start"`
	End      string `json:"end"`
	Weekdays [7]int `json:"weekdays"`
}

DateProfile holds the (start, end) range of date values plus a weekday histogram for Mode-A reconstruction.

type FieldProfile

type FieldProfile struct {
	Name        string              `json:"name"`
	Type        string              `json:"type"`
	Description string              `json:"description,omitempty"`
	NullRate    float64             `json:"null_rate"`
	Numeric     *NumericProfile     `json:"numeric,omitempty"`
	Categorical *CategoricalProfile `json:"categorical,omitempty"`
	Date        *DateProfile        `json:"date,omitempty"`
	// Precision/Scale carry decimal128 metadata so synth-from-profile can
	// reconstruct the original field shape.
	Precision uint8 `json:"precision,omitempty"`
	Scale     uint8 `json:"scale,omitempty"`
}

FieldProfile holds per-field summary statistics. Exactly one of Numeric, Categorical, or Date is populated based on the field's type.

type FieldSpec

type FieldSpec struct {
	Name         string         `json:"name"`
	Type         string         `json:"type"`
	Nullable     bool           `json:"nullable,omitempty"`
	Description  string         `json:"description,omitempty"`
	Distribution string         `json:"distribution"`
	Params       map[string]any `json:"params,omitempty"`

	// Precision and Scale apply to decimal128.
	Precision uint8 `json:"precision,omitempty"`
	Scale     uint8 `json:"scale,omitempty"`

	// NullRate is the per-row probability that the field will be null.
	// Only meaningful when Nullable is true; ignored otherwise.
	NullRate float64 `json:"null_rate,omitempty"`
}

FieldSpec is a single column declaration.

type NumericProfile

type NumericProfile struct {
	Min  float64 `json:"min"`
	Max  float64 `json:"max"`
	Mean float64 `json:"mean"`
	Std  float64 `json:"std"`
	// Percentiles holds {p1, p5, p25, p50, p75, p95, p99} when
	// IncludeStats was on; nil otherwise.
	Percentiles []float64 `json:"percentiles,omitempty"`
}

NumericProfile is the detail block for numeric fields.

type Options

type Options struct {
	// Seed makes the output deterministic. Same spec + same seed must
	// produce a byte-identical .pulse file.
	Seed int64
}

Options modulates how the spec is realized.

type Profile

type Profile struct {
	RowCount int               `json:"row_count"`
	Fields   []FieldProfile    `json:"fields"`
	Pairwise []CorrelationStat `json:"pairwise,omitempty"`
	Warnings []string          `json:"warnings,omitempty"`
	Meta     map[string]any    `json:"meta,omitempty"`
}

Profile is a serialization-friendly statistical summary of a cohort. It contains everything needed to drive synth from-profile without retaining any individual rows from the source data.

func ProfileBytes

func ProfileBytes(data []byte, opts ProfileOptions) (*Profile, error)

ProfileBytes summarizes a .pulse file given its raw bytes.

func ProfileFile

func ProfileFile(fs afero.Fs, path string, opts ProfileOptions) (*Profile, error)

ProfileFile reads a .pulse file from fs and produces a Profile.

func (*Profile) MarshalJSON

func (p *Profile) MarshalJSON() ([]byte, error)

MarshalJSON serializes the profile.

type ProfileOptions

type ProfileOptions struct {
	// TopK is the number of top categorical values to capture per
	// categorical field. Defaults to 32 when zero.
	TopK int
	// IncludeStats turns on percentile / stdev / kurtosis collection.
	// When false, only mean / min / max / null-rate are recorded for
	// numeric fields. Defaults to true.
	IncludeStats bool
	// IncludeCorrelations enables pairwise Pearson correlation capture
	// between numeric fields. Off by default to keep profile size bounded.
	IncludeCorrelations bool
	// CorrelationTopK caps the number of strongest |rho| pairs retained.
	// Defaults to 16 when IncludeCorrelations is true.
	CorrelationTopK int
	// SampleLimit caps the number of records ingested for the profile.
	// Zero = unlimited.
	SampleLimit int
}

ProfileOptions modulates how Profile summarizes a cohort.

type Result

type Result struct {
	RowsGenerated int      `json:"rows_generated"`
	RowsRejected  int      `json:"rows_rejected"`
	OutputPath    string   `json:"output_path"`
	Warnings      []string `json:"warnings,omitempty"`
}

Result is what the writer reports after a successful Synth call.

func Synth

func Synth(fs afero.Fs, spec *Spec, output string, opts Options) (*Result, error)

Synth materializes a synthetic .pulse file from a Spec into the given path on the provided filesystem. It is the package-level entry point; pulse.Pulse.Synth wraps it.

func SynthBytes

func SynthBytes(spec *Spec, opts Options) ([]byte, *Result, error)

SynthBytes returns the byte slice that Synth would have written, without touching the filesystem. Useful for round-trip tests and embedders that stream the output elsewhere.

type Spec

type Spec struct {
	// RowCount is the number of rows to generate. Required (>0).
	RowCount int `json:"row_count"`

	// Fields declares each column with its type and distribution.
	Fields []FieldSpec `json:"fields"`

	// Constraints, if non-empty, are evaluated per row by the expression
	// evaluator. Rows that fail any constraint are rejected and re-drawn.
	Constraints []ConstraintSpec `json:"constraints,omitempty"`

	// MaxRejectionRate caps the fraction of rejected rows during
	// constraint-driven rejection sampling. Defaults to 0.5 if zero.
	MaxRejectionRate float64 `json:"max_rejection_rate,omitempty"`

	// Correlations lists optional pairwise correlations to induce via
	// Gaussian copula post-processing. Each entry references two numeric
	// fields by name with a target Pearson correlation in [-1, 1]. Only
	// numeric fields can participate.
	Correlations []CorrelationSpec `json:"correlations,omitempty"`
}

Spec is the parsed top-level synthesis request. It is the in-memory shape of from-schema JSON. A from-profile call builds a Spec internally from the Profile and shares the rest of the writer pipeline.

func ParseSpec

func ParseSpec(raw []byte) (*Spec, error)

ParseSpec parses Spec JSON. Returns SERVICE_VALIDATION if shape is wrong.

func SpecFromProfile

func SpecFromProfile(p *Profile, rowCount int) *Spec

SpecFromProfile builds a Spec the synth pipeline can execute. Numeric fields are reconstructed as normal distributions (mean, std clipped at min/max), categorical fields as weighted_categorical, and date fields as uniform_date over the observed range.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL