Documentation
¶
Overview ¶
Package synth produces synthetic .pulse cohorts from either a schema declaration ("from-schema") or a statistical profile of a real cohort ("from-profile"). The generator is deterministic given a seed and writes directly into the .pulse binary format using the encoding package, so outputs are byte-identical for the same (spec, seed) pair.
Two top-level entry points are exported by this package:
Synth(spec Spec, opts Options) (*Result, error) Profile(schema, records, opts ProfileOptions) (*Profile, error)
Pulse embedders should use the higher-level pulse.Pulse.Synth and pulse.Pulse.Profile facade methods instead.
Index ¶
Constants ¶
const ( DistUniform = "uniform" DistNormal = "normal" DistLogNormal = "lognormal" DistExponential = "exponential" DistPoisson = "poisson" DistPareto = "pareto" DistBernoulli = "bernoulli" DistMonotonicFrom = "monotonic_from" DistWeightedCategorical = "weighted_categorical" DistUniformDate = "uniform_date" DistRegex = "regex" DistConstant = "constant" )
Distribution kind constants.
Variables ¶
This section is empty.
Functions ¶
func AllDistributions ¶
func AllDistributions() []string
AllDistributions returns the registered kind names in sorted order. Used by the manifest and tests.
Types ¶
type CategoricalProfile ¶
type CategoricalProfile struct {
Cardinality int `json:"cardinality"`
Top []CategoryHit `json:"top"`
}
CategoricalProfile holds the top-K observed values and the total distinct count.
type CategoryHit ¶
CategoryHit records a categorical value with its observed weight.
type ConstraintSpec ¶
type ConstraintSpec struct {
Expr string `json:"expr"`
}
ConstraintSpec wraps an expression evaluated against the in-memory row.
type CorrelationSpec ¶
type CorrelationSpec struct {
A string `json:"a"`
B string `json:"b"`
Correlation float64 `json:"correlation"`
}
CorrelationSpec declares a target Pearson correlation between two numeric fields.
type CorrelationStat ¶
CorrelationStat is a captured pairwise correlation entry.
type DateProfile ¶
type DateProfile struct {
Start string `json:"start"`
End string `json:"end"`
Weekdays [7]int `json:"weekdays"`
}
DateProfile holds the (start, end) range of date values plus a weekday histogram for Mode-A reconstruction.
type FieldProfile ¶
type FieldProfile struct {
Name string `json:"name"`
Type string `json:"type"`
Description string `json:"description,omitempty"`
NullRate float64 `json:"null_rate"`
Numeric *NumericProfile `json:"numeric,omitempty"`
Categorical *CategoricalProfile `json:"categorical,omitempty"`
Date *DateProfile `json:"date,omitempty"`
// Precision/Scale carry decimal128 metadata so synth-from-profile can
// reconstruct the original field shape.
Precision uint8 `json:"precision,omitempty"`
Scale uint8 `json:"scale,omitempty"`
}
FieldProfile holds per-field summary statistics. Exactly one of Numeric, Categorical, or Date is populated based on the field's type.
type FieldSpec ¶
type FieldSpec struct {
Name string `json:"name"`
Type string `json:"type"`
Nullable bool `json:"nullable,omitempty"`
Description string `json:"description,omitempty"`
Distribution string `json:"distribution"`
Params map[string]any `json:"params,omitempty"`
// Precision and Scale apply to decimal128.
Precision uint8 `json:"precision,omitempty"`
Scale uint8 `json:"scale,omitempty"`
// NullRate is the per-row probability that the field will be null.
// Only meaningful when Nullable is true; ignored otherwise.
NullRate float64 `json:"null_rate,omitempty"`
}
FieldSpec is a single column declaration.
type NumericProfile ¶
type NumericProfile struct {
Min float64 `json:"min"`
Max float64 `json:"max"`
Mean float64 `json:"mean"`
Std float64 `json:"std"`
// Percentiles holds {p1, p5, p25, p50, p75, p95, p99} when
// IncludeStats was on; nil otherwise.
Percentiles []float64 `json:"percentiles,omitempty"`
}
NumericProfile is the detail block for numeric fields.
type Options ¶
type Options struct {
// Seed makes the output deterministic. Same spec + same seed must
// produce a byte-identical .pulse file.
Seed int64
}
Options modulates how the spec is realized.
type Profile ¶
type Profile struct {
RowCount int `json:"row_count"`
Fields []FieldProfile `json:"fields"`
Pairwise []CorrelationStat `json:"pairwise,omitempty"`
Warnings []string `json:"warnings,omitempty"`
Meta map[string]any `json:"meta,omitempty"`
}
Profile is a serialization-friendly statistical summary of a cohort. It contains everything needed to drive synth from-profile without retaining any individual rows from the source data.
func ProfileBytes ¶
func ProfileBytes(data []byte, opts ProfileOptions) (*Profile, error)
ProfileBytes summarizes a .pulse file given its raw bytes.
func ProfileFile ¶
ProfileFile reads a .pulse file from fs and produces a Profile.
func (*Profile) MarshalJSON ¶
MarshalJSON serializes the profile.
type ProfileOptions ¶
type ProfileOptions struct {
// TopK is the number of top categorical values to capture per
// categorical field. Defaults to 32 when zero.
TopK int
// IncludeStats turns on percentile / stdev / kurtosis collection.
// When false, only mean / min / max / null-rate are recorded for
// numeric fields. Defaults to true.
IncludeStats bool
// IncludeCorrelations enables pairwise Pearson correlation capture
// between numeric fields. Off by default to keep profile size bounded.
IncludeCorrelations bool
// CorrelationTopK caps the number of strongest |rho| pairs retained.
// Defaults to 16 when IncludeCorrelations is true.
CorrelationTopK int
// SampleLimit caps the number of records ingested for the profile.
// Zero = unlimited.
SampleLimit int
}
ProfileOptions modulates how Profile summarizes a cohort.
type Result ¶
type Result struct {
RowsGenerated int `json:"rows_generated"`
RowsRejected int `json:"rows_rejected"`
OutputPath string `json:"output_path"`
Warnings []string `json:"warnings,omitempty"`
}
Result is what the writer reports after a successful Synth call.
type Spec ¶
type Spec struct {
// RowCount is the number of rows to generate. Required (>0).
RowCount int `json:"row_count"`
// Fields declares each column with its type and distribution.
Fields []FieldSpec `json:"fields"`
// Constraints, if non-empty, are evaluated per row by the expression
// evaluator. Rows that fail any constraint are rejected and re-drawn.
Constraints []ConstraintSpec `json:"constraints,omitempty"`
// MaxRejectionRate caps the fraction of rejected rows during
// constraint-driven rejection sampling. Defaults to 0.5 if zero.
MaxRejectionRate float64 `json:"max_rejection_rate,omitempty"`
// Correlations lists optional pairwise correlations to induce via
// Gaussian copula post-processing. Each entry references two numeric
// fields by name with a target Pearson correlation in [-1, 1]. Only
// numeric fields can participate.
Correlations []CorrelationSpec `json:"correlations,omitempty"`
}
Spec is the parsed top-level synthesis request. It is the in-memory shape of from-schema JSON. A from-profile call builds a Spec internally from the Profile and shares the rest of the writer pipeline.
func SpecFromProfile ¶
SpecFromProfile builds a Spec the synth pipeline can execute. Numeric fields are reconstructed as normal distributions (mean, std clipped at min/max), categorical fields as weighted_categorical, and date fields as uniform_date over the observed range.