Documentation
¶
Overview ¶
artifact.go defines run artifacts, campaign summaries, and persistence logic.
Key types:
- RunResult: captures outcome of a single instance run
- CampaignSummary: aggregates results across all runs in a campaign
- RunArtifact: on-disk representation of a run for recheck/audit
Artifacts are written to stable paths under the run root directory.
composite.go defines multi-step composite instances.
Composite instances execute a sequence of steps (each a campaign instance) with configurable gating policies to determine overall pass/fail.
Use cases:
- WAL+sync durability: write → crash → verify
- Compaction stress: write → compact → verify
filter.go implements tag-based instance filtering.
Filters allow selecting a subset of instances based on their computed tags. Syntax: "key=value,key!=value,key=v1|v2" (comma-separated AND, pipe for OR).
Example: "tier=quick,tool=stresstest" selects quick-tier stress instances.
instance.go defines the Instance type and instance-level operations.
An Instance is a named, reproducible test configuration specifying:
- Which tool to run (stresstest, crashtest, goldentest, adversarialtest)
- Command-line arguments with placeholders (<SEED>, <RUN_DIR>, <DB_PATH>)
- Seeds for deterministic reproduction
- Stop conditions defining pass/fail criteria
- Oracle requirements for C++ verification
known_failures.go implements failure fingerprinting and quarantine tracking.
Known failures are identified by stable fingerprints (hash of instance name, seed, failure kind, and key error details). When a failure recurs, it can be classified as a duplicate and optionally quarantined to prevent blocking CI.
Quarantine policies:
- QuarantineNone: failure is not quarantined, fails the campaign
- QuarantineAllowed: failure is known and allowed, does not fail the campaign
matrix.go defines the campaign instance matrix.
The matrix contains all predefined instances organized by tier (quick/nightly). Each instance is a specific test configuration with reproducible seeds.
Quick tier: fast feedback for local dev and CI pull requests. Nightly tier: comprehensive coverage for scheduled runs.
minimize.go implements failure minimization for stresstest runs.
When a stresstest fails, minimization attempts to reduce the reproduction parameters (duration, threads, keys) to find the smallest configuration that still reproduces the failure. This makes debugging faster.
Reduction strategy: binary search on each dimension independently.
oracle.go provides access to C++ RocksDB tools for consistency verification.
The oracle uses ldb and sst_dump from a RocksDB build to verify that Go-produced databases are bit-compatible with the C++ implementation.
Environment variables:
- ROCKSDB_PATH: path to RocksDB build directory (derives ldb and sst_dump)
- LDB_PATH: explicit path to ldb binary (overrides ROCKSDB_PATH)
- SST_DUMP_PATH: explicit path to sst_dump binary (overrides ROCKSDB_PATH)
recheck.go implements artifact rechecking for policy re-evaluation.
Recheck mode allows re-evaluating existing campaign artifacts against current policies and oracle tools. This is useful when:
- Oracle tools become available after initial run
- Quarantine policies change
- Stop conditions are updated
Recheck does not re-run the tool; it only re-evaluates persisted artifacts.
runner.go implements the core campaign execution engine.
The Runner orchestrates instance execution with:
- Tool invocation with timeout and cancellation
- Artifact persistence (run.json, logs, DB snapshots)
- Oracle gating (require ldb checkconsistency before pass)
- Failure fingerprinting and deduplication
- Skip policy enforcement
- Summary generation (summary.json, governance.json)
skip.go implements instance-level skip policies.
Skip policies allow excluding instances BEFORE they run, based on:
- Exact instance name
- Group prefix (e.g., "status.durability")
- Tag matching (e.g., tier=quick, tool=crashtest)
Unlike quarantine (fingerprint-based, post-failure), skip is instance-level and prevents the run from starting. Skipped instances are recorded in summary.json.
status.go defines status/durability repro instances.
These instances reproduce specific durability and consistency scenarios documented in docs/status/durability_report.md. They serve as regression tests for known failure modes and recovery behaviors.
stop.go defines stop conditions for instance runs.
Stop conditions specify what constitutes success vs failure for a run. They control termination requirements, verification passes, and oracle checks.
sweep.go implements sweep instances for parameter matrix expansion.
A sweep instance defines a base configuration with varying parameters. At run time, it expands into multiple concrete instances by taking the Cartesian product of all parameter values.
Example: cycles=[1,2,3] × mode=[sync,async] → 6 concrete runs.
synthetic.go provides synthetic failure injection for CI testing.
Synthetic failures allow deterministic testing of the minimization and failure classification pipelines without relying on actual test failures. They produce stable fingerprints for verifying deduplication and quarantine.
tags.go defines the structured tag set for instances.
Tags provide metadata for filtering, grouping, and classification. Each instance computes its tags from its configuration (tier, tool, etc.).
Package campaign implements the Jepsen-style campaign runner for RockyardKV.
This package provides:
- Taxonomy types for campaign configuration (tiers, tools, fault models)
- Instance definitions for the campaign matrix
- Oracle gating and tool execution
- Artifact bundle writing and failure fingerprinting
- Campaign execution and reporting
trace.go implements trace capture and argument injection for stresstest.
When trace capture is enabled, the runner injects -trace-out and -trace-max-size arguments into stresstest invocations to capture operation traces for debugging.
Index ¶
- Constants
- Variables
- func AllGroups() []string
- func AllTagKeys() []string
- func BuildReplayCommand(tracePath, dbPath, binDir string) string
- func CheckTraceSize(tracePath string, config TraceConfig) (size int64, exceeded bool, err error)
- func ComputeFingerprint(instanceName string, seed int64, failureKind, failureReason, logPath string) string
- func CopyFile(src, dst string) error
- func EnsureDir(path string) error
- func EnsureTraceDir(runDir string, config TraceConfig) error
- func GateCheck(instance *Instance, oracle *Oracle) error
- func GlobalTimeout(tier Tier) int
- func InjectTraceArgs(args []string, runDir string, config TraceConfig) ([]string, string)
- func InstanceTimeout(tier Tier) int
- func ResolveStepArgs(args []string, runDir string, seed int64, dbPath string) []string
- func StepRunDir(instanceRunDir string, stepName string) string
- func TracePaths(runDir string, config TraceConfig) (traceFile, truncatedMarker string)
- func ValidateSkipPolicyTags(p *SkipPolicy) error
- func WriteCampaignSummary(runRoot string, tier Tier, startTime, endTime time.Time, results []*RunResult, ...) error
- func WriteGovernanceReport(runRoot string, results []*RunResult, skipped []SkipSummary, ...) error
- func WriteReplayScript(runDir, tracePath, dbPath, binDir string) error
- func WriteRunArtifact(result *RunResult) error
- func WriteTruncatedMarker(runDir string, config TraceConfig, bytesWritten int64) error
- type CampaignSummary
- type CompositeInstance
- type CompositeResult
- type FailureClass
- type FaultErrorType
- type FaultKind
- type FaultModel
- type FaultScope
- type Filter
- type FilterClause
- type FilterOp
- type GatingPolicy
- type GovernanceFailure
- type GovernanceReport
- type Instance
- type InstanceSkipPolicies
- type KnownFailure
- type KnownFailures
- func (kf *KnownFailures) All() []*KnownFailure
- func (kf *KnownFailures) Count() int
- func (kf *KnownFailures) Get(fingerprint string) *KnownFailure
- func (kf *KnownFailures) GetQuarantinePolicy(fingerprint string) QuarantinePolicy
- func (kf *KnownFailures) IsDuplicate(fingerprint string) bool
- func (kf *KnownFailures) IsQuarantined(fingerprint string) bool
- func (kf *KnownFailures) Record(fingerprint, instance, timestamp string) bool
- type MarkerRecheckResult
- type MinBounds
- type MinimizeConfig
- type MinimizeResult
- type Minimizer
- type Oracle
- type OracleRecheckResult
- type PolicyRecheckResult
- type QuarantinePolicy
- type RecheckResult
- type Rechecker
- type ReductionStep
- type RunArtifact
- type RunResult
- type RunSummary
- type Runner
- func (r *Runner) Run(ctx context.Context) (*CampaignSummary, error)
- func (r *Runner) RunCompositeInstances(ctx context.Context, composites []CompositeInstance) (*CampaignSummary, error)
- func (r *Runner) RunGroup(ctx context.Context, group string) (*CampaignSummary, error)
- func (r *Runner) RunInstances(ctx context.Context, instances []Instance) (*CampaignSummary, error)
- func (r *Runner) RunSweepInstances(ctx context.Context, sweeps []SweepInstance) (*CampaignSummary, error)
- type RunnerConfig
- type SkipPolicy
- type SkipResult
- type SkipSummary
- type Step
- type StepResult
- type StopCondition
- type SweepCase
- type SweepInstance
- type SweepParam
- type SyntheticFailConfig
- type Tags
- type Tier
- type Tool
- type ToolResult
- type TraceConfig
- type TraceResult
Constants ¶
const SchemaVersion = "1.1.0"
SchemaVersion is the current version of the artifact schema. Bump rules:
- Major: interpretation changes (field meaning, fingerprint algorithm, pass/fail logic)
- Minor: additive fields that don't change meaning or pass/fail
- Patch: tooling bugfixes that don't change schema
Variables ¶
var ErrOracleNotConfigured = errors.New("oracle not configured: set ROCKSDB_PATH or provide explicit tool paths")
ErrOracleNotConfigured indicates the oracle tools are not available.
var ErrOracleToolNotFound = errors.New("oracle tool not found")
ErrOracleToolNotFound indicates a specific oracle tool was not found.
Functions ¶
func AllTagKeys ¶
func AllTagKeys() []string
AllTagKeys returns all valid tag keys for filter validation.
func BuildReplayCommand ¶
BuildReplayCommand generates the traceanalyzer replay command for a trace file.
func CheckTraceSize ¶
func CheckTraceSize(tracePath string, config TraceConfig) (size int64, exceeded bool, err error)
CheckTraceSize checks if a trace file exists and its size. Returns the size and whether it exceeds the limit.
func ComputeFingerprint ¶
func ComputeFingerprint(instanceName string, seed int64, failureKind, failureReason, logPath string) string
ComputeFingerprint computes a failure fingerprint that includes: - Instance name (to avoid collisions across instances) - Seed (to identify specific run) - Failure kind (enum-like category) - Failure reason (specific message) - Log tail (for extra signal)
Uses SHA-256 truncated to 16 hex chars for uniqueness.
func EnsureTraceDir ¶
func EnsureTraceDir(runDir string, config TraceConfig) error
EnsureTraceDir creates the trace directory if trace capture is enabled.
func GateCheck ¶
GateCheck verifies that the oracle is available if required by the instance. Returns an error if oracle is required but not available.
func GlobalTimeout ¶
GlobalTimeout returns the global timeout for a campaign run based on tier.
func InjectTraceArgs ¶
func InjectTraceArgs(args []string, runDir string, config TraceConfig) ([]string, string)
InjectTraceArgs adds -trace-out and -trace-max-size to the argument list if not already present. Returns the modified args and the trace file path. If -trace-out is already specified (either as "-trace-out <path>" or "-trace-out=<path>"), returns the existing path without modification (but still injects -trace-max-size if missing).
func InstanceTimeout ¶
InstanceTimeout returns the default timeout for an instance based on tier.
func ResolveStepArgs ¶
ResolveStepArgs returns args with placeholders replaced. dbPath is the DB path discovered from a previous step (or empty).
func StepRunDir ¶
StepRunDir returns the run directory for a specific step.
func TracePaths ¶
func TracePaths(runDir string, config TraceConfig) (traceFile, truncatedMarker string)
TracePaths returns the paths for trace artifacts in a run directory.
func ValidateSkipPolicyTags ¶
func ValidateSkipPolicyTags(p *SkipPolicy) error
ValidateSkipPolicyTags returns an error if any tag key in the policy is unknown.
func WriteCampaignSummary ¶
func WriteCampaignSummary(runRoot string, tier Tier, startTime, endTime time.Time, results []*RunResult, skipped []SkipSummary) error
WriteCampaignSummary writes the summary.json file to the run root.
func WriteGovernanceReport ¶
func WriteGovernanceReport(runRoot string, results []*RunResult, skipped []SkipSummary, knownFailures *KnownFailures) error
WriteGovernanceReport writes the governance.json file to the run root. This artifact provides an at-a-glance triage view for operators.
func WriteReplayScript ¶
WriteReplayScript writes a replay.sh script to the run directory.
func WriteRunArtifact ¶
WriteRunArtifact writes the run.json file to the run directory. Also writes duplicate_of.txt if the failure is a duplicate.
func WriteTruncatedMarker ¶
func WriteTruncatedMarker(runDir string, config TraceConfig, bytesWritten int64) error
WriteTruncatedMarker writes a marker file indicating trace truncation.
Types ¶
type CampaignSummary ¶
type CampaignSummary struct {
SchemaVersion string `json:"schema_version"`
Tier string `json:"tier"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
DurationMs int64 `json:"duration_ms"`
TotalRuns int `json:"total_runs"`
PassedRuns int `json:"passed_runs"`
FailedRuns int `json:"failed_runs"`
SkippedRuns int `json:"skipped_runs"`
UniqueErrors int `json:"unique_errors"`
AllPassed bool `json:"all_passed"`
Runs []RunSummary `json:"runs"`
// Skipped instances and their reasons
Skipped []SkipSummary `json:"skipped,omitempty"`
// Governance fields for failure classification and deduplication
NewFailures int `json:"new_failures"`
KnownFailures int `json:"known_failures"`
Duplicates int `json:"duplicates"`
Unquarantined int `json:"unquarantined"`
OracleRequired int `json:"oracle_required"`
OracleGated int `json:"oracle_gated"`
}
CampaignSummary is the JSON structure written to summary.json after a campaign.
func ReadCampaignSummary ¶
func ReadCampaignSummary(runRoot string) (*CampaignSummary, error)
ReadCampaignSummary reads summary.json from a run root.
type CompositeInstance ¶
type CompositeInstance struct {
Instance
// Steps are the execution steps (in order).
// If nil or empty, the Instance.Tool/Args are used as a single step.
Steps []Step
// GatingPolicy determines how step results combine.
// Default: GateAllSteps (fail if ANY step fails).
GatingPolicy GatingPolicy
}
CompositeInstance extends Instance with multi-step execution support.
func StatusCompositeInstances ¶
func StatusCompositeInstances() []CompositeInstance
StatusCompositeInstances returns composite (multi-step) instances. These instances execute multiple steps with a gating policy.
func (*CompositeInstance) IsComposite ¶
func (c *CompositeInstance) IsComposite() bool
IsComposite returns true if this instance has multiple steps.
func (*CompositeInstance) ToSteps ¶
func (c *CompositeInstance) ToSteps() []Step
ToSteps converts the instance to a list of steps. If no explicit steps, creates a single step from Instance fields.
type CompositeResult ¶
type CompositeResult struct {
// Steps contains results for each step.
Steps []StepResult
// Passed indicates if the composite instance passed per its gating policy.
Passed bool
// FailureReason summarizes why the instance failed.
FailureReason string
// GatingPolicy that was applied.
GatingPolicy GatingPolicy
}
CompositeResult captures the outcome of a composite instance.
func (*CompositeResult) ComputePassed ¶
func (c *CompositeResult) ComputePassed()
ComputePassed evaluates the gating policy against step results.
type FailureClass ¶
type FailureClass string
FailureClass categorizes failures for governance reporting.
const ( // FailureClassNone means the run passed. FailureClassNone FailureClass = "" // FailureClassNew is a new failure not previously seen. FailureClassNew FailureClass = "new_failure" // FailureClassKnown is a failure that matches a quarantined known failure. FailureClassKnown FailureClass = "known_failure" // FailureClassDuplicate is a repeat of a failure already seen in this campaign run. FailureClassDuplicate FailureClass = "duplicate" )
type FaultErrorType ¶
type FaultErrorType string
FaultErrorType represents the error type for fault injection.
const ( // ErrorTypeStatus returns a status error (retryable). ErrorTypeStatus FaultErrorType = "status" // ErrorTypeCorruption returns a corruption error (fatal). ErrorTypeCorruption FaultErrorType = "corruption" // ErrorTypeTruncated returns a truncated error. ErrorTypeTruncated FaultErrorType = "truncated" )
type FaultKind ¶
type FaultKind string
FaultKind represents the type of fault to inject.
const ( // FaultNone means no fault injection. FaultNone FaultKind = "none" // FaultRead injects read errors. FaultRead FaultKind = "read" // FaultWrite injects write errors. FaultWrite FaultKind = "write" // FaultSync injects sync/fsync errors. FaultSync FaultKind = "sync" // FaultCrash injects process crashes. FaultCrash FaultKind = "crash" // FaultCorrupt injects data corruption. FaultCorrupt FaultKind = "corrupt" )
type FaultModel ¶
type FaultModel struct {
// Kind is the type of fault to inject.
Kind FaultKind
// ErrorType is the error type for the fault (status, corruption, truncated).
ErrorType FaultErrorType
// OneIn is the probability denominator (e.g., 7 means 1/7 chance).
OneIn int
// Scope is where faults are injected.
Scope FaultScope
}
FaultModel describes the fault injection configuration.
func (FaultModel) String ¶
func (f FaultModel) String() string
String returns a human-readable description of the fault model.
type FaultScope ¶
type FaultScope string
FaultScope represents where faults are injected.
const ( // ScopeWorker injects faults in worker goroutines. ScopeWorker FaultScope = "worker" // ScopeFlusher injects faults in the flusher goroutine. ScopeFlusher FaultScope = "flusher" // ScopeReopener injects faults during DB reopen. ScopeReopener FaultScope = "reopener" // ScopeGlobal injects faults globally (all goroutines). ScopeGlobal FaultScope = "global" )
type Filter ¶
type Filter struct {
Clauses []FilterClause
}
Filter represents a parsed filter expression.
func ParseFilter ¶
ParseFilter parses a filter string into a Filter. Format: "key=value,key!=value,key=val1|val2" - Comma separates clauses (AND semantics) - Pipe separates values within a clause (OR semantics) - "=" for equality, "!=" for inequality
type FilterClause ¶
type FilterClause struct {
Key string
Op FilterOp
Values []string // Multiple values for OR (pipe-separated)
}
FilterClause represents a single filter clause (key op values).
func (FilterClause) Match ¶
func (c FilterClause) Match(tags Tags) bool
Match returns true if the tags match this clause.
func (FilterClause) String ¶
func (c FilterClause) String() string
String returns a string representation of the clause.
type GatingPolicy ¶
type GatingPolicy string
GatingPolicy defines how multi-step instance results are combined.
const ( // GateAllSteps fails if ANY step fails. GateAllSteps GatingPolicy = "all_steps" // GateLastStep fails ONLY if the last step fails. // Earlier step failures are recorded but don't fail the instance. GateLastStep GatingPolicy = "last_step" )
type GovernanceFailure ¶
type GovernanceFailure struct {
Instance string `json:"instance"`
Seed int64 `json:"seed"`
Fingerprint string `json:"fingerprint"`
IssueID string `json:"issue_id,omitempty"`
FailureKind string `json:"failure_kind,omitempty"`
}
GovernanceFailure contains details about a failure for triage.
type GovernanceReport ¶
type GovernanceReport struct {
SchemaVersion string `json:"schema_version"`
// Summary counts
TotalFailures int `json:"total_failures"`
NewFailures int `json:"new_failures"`
KnownFailures int `json:"known_failures"`
Duplicates int `json:"duplicates"`
Unquarantined int `json:"unquarantined"`
SkippedInstances int `json:"skipped_instances"`
// Actionable items
UnquarantinedDuplicates []GovernanceFailure `json:"unquarantined_duplicates,omitempty"`
QuarantinedHits []GovernanceFailure `json:"quarantined_hits,omitempty"`
SkippedList []SkipSummary `json:"skipped,omitempty"`
// Next steps for operators
NextSteps string `json:"next_steps"`
}
GovernanceReport is the machine-readable triage report for operators. Written to governance.json in the run root.
type Instance ¶
type Instance struct {
// Name is the unique instance identifier.
// Should be descriptive: "stress.read.corruption.1in7"
Name string
// Tier is the intensity level (quick or nightly).
Tier Tier
// RequiresOracle indicates if C++ oracle tools are required.
// If true, the runner will fail fast if oracle is not configured.
RequiresOracle bool
// Tool is the test binary to execute.
Tool Tool
// Args are the command-line arguments for the tool.
// Use "<RUN_DIR>" as a placeholder for the run directory.
// Use "<SEED>" as a placeholder for the seed value.
Args []string
// Env are additional environment variables for the tool.
Env map[string]string
// Seeds are the seed values to run. Each seed produces a separate run.
Seeds []int64
// FaultModel describes the fault injection configuration.
FaultModel FaultModel
// Stop defines the stopping conditions for this instance.
Stop StopCondition
}
Instance represents a single campaign test instance. Each instance defines a specific test configuration to run.
func FilterInstances ¶
FilterInstances returns instances that match the filter.
func GetInstances ¶
GetInstances returns the instances for the specified tier. Includes both campaign instances (stress, crash, golden) and status instances (durability, adversarial).
func GetStatusInstances ¶
GetStatusInstances returns status instances filtered by group prefix. If group is empty, returns all status instances.
func NightlyInstances ¶
func NightlyInstances() []Instance
NightlyInstances returns the instance matrix for the nightly tier. Nightly tier is for thorough testing that can run for hours.
func QuickInstances ¶
func QuickInstances() []Instance
QuickInstances returns the instance matrix for the quick tier. Quick tier is for local development and CI on pull requests.
func StatusInstances ¶
func StatusInstances() []Instance
StatusInstances returns the simple instance matrix for status/durability checks. For composite instances (multi-step), see StatusCompositeInstances(). For sweep instances (parameter expansion), see StatusSweepInstances().
func SyntheticInstance ¶
func SyntheticInstance() *Instance
SyntheticInstance returns a test-only instance that fails deterministically. This is gated behind ROCKYARDKV_SYNTHETIC_FAIL=1 env var to prevent accidental use.
Usage:
ROCKYARDKV_SYNTHETIC_FAIL=1 bin/campaignrunner -group=synthetic -minimize
func (*Instance) BinaryName ¶
BinaryName returns the binary name for the tool (without path).
func (*Instance) BinaryPath ¶
BinaryPath returns the full path to the tool binary. Uses binDir to construct path for test binaries (e.g., "./bin/stresstest"). For go test, returns "go" since it's expected to be on PATH.
func (*Instance) ComputeTags ¶
ComputeTags derives the Tags from an Instance.
func (*Instance) ResolveArgs ¶
ResolveArgs returns the arguments with placeholders replaced.
type InstanceSkipPolicies ¶
type InstanceSkipPolicies struct {
// contains filtered or unexported fields
}
InstanceSkipPolicies manages a set of skip policies.
func NewInstanceSkipPolicies ¶
func NewInstanceSkipPolicies(path string) *InstanceSkipPolicies
NewInstanceSkipPolicies creates a new skip policy manager. If path is non-empty, policies are loaded from disk.
func (*InstanceSkipPolicies) Add ¶
func (sp *InstanceSkipPolicies) Add(policy *SkipPolicy)
Add adds a new skip policy.
func (*InstanceSkipPolicies) Count ¶
func (sp *InstanceSkipPolicies) Count() int
Count returns the number of skip policies.
func (*InstanceSkipPolicies) LoadWithValidation ¶
func (sp *InstanceSkipPolicies) LoadWithValidation() error
LoadWithValidation loads policies and returns any validation errors. Use this when callers want to surface validation issues to users.
func (*InstanceSkipPolicies) Save ¶
func (sp *InstanceSkipPolicies) Save() error
Save writes skip policies to disk.
func (*InstanceSkipPolicies) ShouldSkip ¶
func (sp *InstanceSkipPolicies) ShouldSkip(inst *Instance) *SkipResult
ShouldSkip returns a SkipResult if the instance should be skipped, nil otherwise.
type KnownFailure ¶
type KnownFailure struct {
Fingerprint string `json:"fingerprint"`
Instance string `json:"instance"`
FirstSeen string `json:"first_seen"`
Count int `json:"count"`
Description string `json:"description,omitempty"`
// IssueID links the failure to a tracking issue (e.g., "GH-123").
IssueID string `json:"issue_id,omitempty"`
// Quarantine defines how this known failure should be handled.
// If empty, the failure is not quarantined and will fail the campaign.
Quarantine QuarantinePolicy `json:"quarantine,omitempty"`
}
KnownFailure represents a previously seen failure fingerprint.
type KnownFailures ¶
type KnownFailures struct {
// contains filtered or unexported fields
}
KnownFailures tracks failure fingerprints for deduplication.
func NewKnownFailures ¶
func NewKnownFailures(path string) *KnownFailures
NewKnownFailures creates a new known failures tracker. If path is non-empty, failures are persisted to disk.
func (*KnownFailures) All ¶
func (kf *KnownFailures) All() []*KnownFailure
All returns all known failures.
func (*KnownFailures) Count ¶
func (kf *KnownFailures) Count() int
Count returns the number of known failure fingerprints.
func (*KnownFailures) Get ¶
func (kf *KnownFailures) Get(fingerprint string) *KnownFailure
Get returns the known failure for a fingerprint, or nil if not found.
func (*KnownFailures) GetQuarantinePolicy ¶
func (kf *KnownFailures) GetQuarantinePolicy(fingerprint string) QuarantinePolicy
GetQuarantinePolicy returns the quarantine policy for a fingerprint. Returns QuarantineNone if the fingerprint is not known or not quarantined.
func (*KnownFailures) IsDuplicate ¶
func (kf *KnownFailures) IsDuplicate(fingerprint string) bool
IsDuplicate returns true if the fingerprint has been seen before.
func (*KnownFailures) IsQuarantined ¶
func (kf *KnownFailures) IsQuarantined(fingerprint string) bool
IsQuarantined returns true if the fingerprint is known AND has a quarantine policy.
func (*KnownFailures) Record ¶
func (kf *KnownFailures) Record(fingerprint, instance, timestamp string) bool
Record adds or updates a failure fingerprint. Returns true if this is a new (not duplicate) failure.
type MarkerRecheckResult ¶
type MarkerRecheckResult struct {
// Passed indicates if verification markers indicate success.
Passed bool `json:"passed"`
// Reason explains the result.
Reason string `json:"reason"`
}
MarkerRecheckResult contains verification marker re-parse details.
type MinBounds ¶
type MinBounds struct {
// MinDuration is the minimum test duration.
// Default: 5 seconds.
MinDuration time.Duration
// MinThreads is the minimum number of threads.
// Default: 4.
MinThreads int
// MinKeys is the minimum number of keys.
// Default: 500.
MinKeys int
}
MinBounds defines the minimum values for parameter reduction during minimization.
func DefaultMinBounds ¶
func DefaultMinBounds() MinBounds
DefaultMinBounds returns the Red Team approved minimization bounds.
type MinimizeConfig ¶
type MinimizeConfig struct {
// Enabled controls whether minimization is active.
Enabled bool
// Bounds defines the minimum parameter values.
Bounds MinBounds
// AllowedFailureKinds is the set of failure kinds eligible for minimization.
// Empty means all failure kinds are eligible.
AllowedFailureKinds map[string]bool
}
MinimizeConfig controls the minimization process.
func DefaultMinimizeConfig ¶
func DefaultMinimizeConfig() MinimizeConfig
DefaultMinimizeConfig returns the default minimization configuration.
type MinimizeResult ¶
type MinimizeResult struct {
// Success indicates if minimization found a smaller reproducer.
Success bool `json:"success"`
// OriginalArgs are the original instance arguments.
OriginalArgs []string `json:"original_args"`
// MinimalArgs are the minimized arguments (if successful).
MinimalArgs []string `json:"minimal_args,omitempty"`
// Steps records each reduction attempt.
Steps []ReductionStep `json:"steps"`
// FinalDuration is the minimized duration.
FinalDuration string `json:"final_duration,omitempty"`
// FinalThreads is the minimized thread count.
FinalThreads int `json:"final_threads,omitempty"`
// FinalKeys is the minimized key count.
FinalKeys int `json:"final_keys,omitempty"`
// PreservedFailureKind is the failure kind class that was preserved across reduction.
PreservedFailureKind string `json:"preserved_failure_kind,omitempty"`
// TotalAttempts is the number of runs performed during minimization.
TotalAttempts int `json:"total_attempts"`
// TotalDurationMs is the total time spent minimizing.
TotalDurationMs int64 `json:"total_duration_ms"`
}
MinimizeResult captures the outcome of a minimization attempt.
type Minimizer ¶
type Minimizer struct {
// contains filtered or unexported fields
}
Minimizer reduces failing test cases to minimal parameters.
func NewMinimizer ¶
func NewMinimizer(runner *Runner, config MinimizeConfig) *Minimizer
NewMinimizer creates a new minimizer with the given runner and config.
func (*Minimizer) Minimize ¶
Minimize attempts to reduce a failing instance to minimal parameters. It uses sequential reduction: duration → threads → keys. Within each parameter, it uses binary search.
func (*Minimizer) ShouldMinimize ¶
ShouldMinimize returns true if the failure is eligible for minimization.
type Oracle ¶
type Oracle struct {
// RocksDBPath is the path to the RocksDB source/build directory.
// Should contain ldb and sst_dump binaries.
RocksDBPath string
// LDBPath is the explicit path to the ldb binary.
// If empty, uses RocksDBPath/ldb.
LDBPath string
// SSTDumpPath is the explicit path to the sst_dump binary.
// If empty, uses RocksDBPath/sst_dump.
SSTDumpPath string
}
Oracle provides access to the C++ RocksDB tools (ldb, sst_dump). These tools are used to verify database consistency and format correctness.
func NewOracleFromEnv ¶
func NewOracleFromEnv() *Oracle
NewOracleFromEnv creates an Oracle from environment variables.
Environment variable precedence:
- LDB_PATH: explicit path to ldb binary (overrides ROCKSDB_PATH-derived path)
- SST_DUMP_PATH: explicit path to sst_dump binary (overrides ROCKSDB_PATH-derived path)
- ROCKSDB_PATH: path to RocksDB build directory (derives ldb and sst_dump from it)
Returns nil if neither ROCKSDB_PATH nor the tool-specific paths are set.
func (*Oracle) Available ¶
Available returns true if the oracle tools are configured and accessible.
func (*Oracle) CheckConsistency ¶
func (o *Oracle) CheckConsistency(dbPath string) *ToolResult
CheckConsistency runs `ldb checkconsistency` on the database. Returns OK if the database passes all consistency checks.
func (*Oracle) DumpManifest ¶
func (o *Oracle) DumpManifest(dbPath string) *ToolResult
DumpManifest runs `ldb manifest_dump` on the database.
type OracleRecheckResult ¶
type OracleRecheckResult struct {
// Performed indicates if oracle check was run.
Performed bool `json:"performed"`
// Skipped indicates oracle check was skipped (not required or oracle unavailable).
Skipped bool `json:"skipped,omitempty"`
// SkipReason explains why oracle check was skipped.
SkipReason string `json:"skip_reason,omitempty"`
// OK indicates if the oracle check passed.
OK bool `json:"ok"`
// ExitCode is the oracle tool exit code.
ExitCode int `json:"exit_code,omitempty"`
// StdoutPath is the path to captured stdout.
StdoutPath string `json:"stdout_path,omitempty"`
// StderrPath is the path to captured stderr.
StderrPath string `json:"stderr_path,omitempty"`
// Summary is a brief inline summary.
Summary string `json:"summary,omitempty"`
}
OracleRecheckResult contains oracle tool re-check details.
type PolicyRecheckResult ¶
type PolicyRecheckResult struct {
// Passed indicates if the run passes current policy.
Passed bool `json:"passed"`
// Reason explains why it passed or failed.
Reason string `json:"reason"`
// Verified indicates if the run can be marked as VERIFIED.
// False when oracle is required but missing.
Verified bool `json:"verified"`
}
PolicyRecheckResult contains stop-condition policy evaluation.
type QuarantinePolicy ¶
type QuarantinePolicy string
QuarantinePolicy defines how a known failure should be handled.
const ( // QuarantineNone means the failure is not quarantined and will fail the campaign. QuarantineNone QuarantinePolicy = "" // QuarantineAllowed means the failure is expected and allowed to occur. QuarantineAllowed QuarantinePolicy = "allowed" // QuarantineSkip means the instance should be skipped entirely. QuarantineSkip QuarantinePolicy = "skip" )
type RecheckResult ¶
type RecheckResult struct {
// RecheckTime is when the recheck was performed.
RecheckTime time.Time `json:"recheck_time"`
// RecheckSchemaVersion is the schema version used for this recheck.
RecheckSchemaVersion string `json:"recheck_schema_version"`
// OracleRecheck contains the oracle re-check outcome.
OracleRecheck *OracleRecheckResult `json:"oracle_recheck,omitempty"`
// MarkerRecheck contains the verification marker re-parse outcome.
MarkerRecheck *MarkerRecheckResult `json:"marker_recheck,omitempty"`
// FingerprintRecomputed is the recomputed fingerprint (if failure).
FingerprintRecomputed string `json:"fingerprint_recomputed,omitempty"`
// PolicyResult contains the pass/fail evaluation with current stop conditions.
PolicyResult *PolicyRecheckResult `json:"policy_result"`
}
RecheckResult captures the outcome of re-evaluating an existing run.
type Rechecker ¶
type Rechecker struct {
Oracle *Oracle
StopConditions map[string]StopCondition // instance name -> stop condition
}
Rechecker re-evaluates existing run artifacts.
func NewRechecker ¶
NewRechecker creates a new Rechecker.
func (*Rechecker) RecheckCampaign ¶
func (r *Rechecker) RecheckCampaign(runRoot string) ([]RecheckResult, error)
RecheckCampaign re-evaluates all runs in a campaign run root.
func (*Rechecker) RecheckRun ¶
func (r *Rechecker) RecheckRun(runDir string) (*RecheckResult, error)
RecheckRun re-evaluates a single run directory.
type ReductionStep ¶
type ReductionStep struct {
Parameter string `json:"parameter"` // "duration", "threads", or "keys"
OriginalVal string `json:"original_value"`
ReducedVal string `json:"reduced_value"`
StillFails bool `json:"still_fails"`
DurationMs int64 `json:"duration_ms"`
}
ReductionStep records a single step in the minimization process.
type RunArtifact ¶
type RunArtifact struct {
SchemaVersion string `json:"schema_version"`
Instance string `json:"instance"`
Seed int64 `json:"seed"`
BinaryPath string `json:"binary_path"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
DurationMs int64 `json:"duration_ms"`
ExitCode int `json:"exit_code"`
Passed bool `json:"passed"`
Failure string `json:"failure,omitempty"`
FailureKind string `json:"failure_kind,omitempty"`
Fingerprint string `json:"fingerprint,omitempty"`
IsDuplicate bool `json:"is_duplicate,omitempty"`
// Oracle check fields
OracleExitCode *int `json:"oracle_exit_code,omitempty"`
OracleOutput string `json:"oracle_output,omitempty"`
// Trace capture fields
TracePath string `json:"trace_path,omitempty"`
TraceBytesWriten int64 `json:"trace_bytes_written,omitempty"`
TraceTruncated bool `json:"trace_truncated,omitempty"`
ReplayCommand string `json:"replay_command,omitempty"`
// Minimization fields
Minimized bool `json:"minimized,omitempty"`
MinimizedResult *MinimizeResult `json:"minimized_result,omitempty"`
// Tags for filtering (computed at write time)
Tags *Tags `json:"tags,omitempty"`
}
RunArtifact is the JSON structure written to run.json in each run directory.
type RunResult ¶
type RunResult struct {
// Instance is the instance that was run.
Instance *Instance
// Seed is the seed value used for this run.
Seed int64
// RunDir is the directory containing all run artifacts.
RunDir string
// BinaryPath is the resolved path to the binary that was executed.
BinaryPath string
// StartTime is when the run started.
StartTime time.Time
// EndTime is when the run ended.
EndTime time.Time
// ExitCode is the process exit code.
ExitCode int
// Passed indicates if the run passed all stop conditions.
Passed bool
// FailureReason describes why the run failed (if it did).
FailureReason string
// FailureKind categorizes the failure type for fingerprinting.
FailureKind string
// Fingerprint is the failure fingerprint for deduplication.
// Empty string if the run passed.
Fingerprint string
// IsDuplicate indicates if this failure fingerprint was already known.
IsDuplicate bool
// FailureClass categorizes the failure for governance reporting.
FailureClass FailureClass
// QuarantinePolicy is the policy for this failure (if it's a known failure).
QuarantinePolicy QuarantinePolicy
// OracleResult is the result of oracle verification (if performed).
OracleResult *ToolResult
// TraceResult contains trace capture information (if enabled).
TraceResult *TraceResult
// MinimizeResult contains minimization results (if performed).
MinimizeResult *MinimizeResult
}
RunResult represents the outcome of a single instance run.
func RunSyntheticFailure ¶
func RunSyntheticFailure(ctx context.Context, config SyntheticFailConfig, runDir string) *RunResult
RunSyntheticFailure executes a synthetic failure for CI testing. Returns a RunResult that simulates a deterministic, classifiable failure.
type RunSummary ¶
type RunSummary struct {
Instance string `json:"instance"`
Seed int64 `json:"seed"`
Passed bool `json:"passed"`
Failure string `json:"failure,omitempty"`
Fingerprint string `json:"fingerprint,omitempty"`
FailureClass FailureClass `json:"failure_class,omitempty"`
DurationMs int64 `json:"duration_ms"`
}
RunSummary is a brief summary of each run for the campaign summary.
type Runner ¶
type Runner struct {
// contains filtered or unexported fields
}
Runner executes campaign instances.
func NewRunner ¶
func NewRunner(config RunnerConfig) *Runner
NewRunner creates a new campaign runner.
func (*Runner) Run ¶
func (r *Runner) Run(ctx context.Context) (*CampaignSummary, error)
Run executes all instances for the configured tier. Returns the campaign summary and any error.
func (*Runner) RunCompositeInstances ¶
func (r *Runner) RunCompositeInstances(ctx context.Context, composites []CompositeInstance) (*CampaignSummary, error)
RunCompositeInstances executes composite (multi-step) instances with Phase-1-grade artifacts.
func (*Runner) RunGroup ¶
RunGroup executes instances matching the group prefix. If group is empty, runs all instances for the tier. If group starts with "status.", runs status instances. Special groups "status.composite" and "status.sweep" run composite/sweep instances.
func (*Runner) RunInstances ¶
RunInstances executes the specified instances.
func (*Runner) RunSweepInstances ¶
func (r *Runner) RunSweepInstances(ctx context.Context, sweeps []SweepInstance) (*CampaignSummary, error)
RunSweepInstances expands and runs sweep instances.
type RunnerConfig ¶
type RunnerConfig struct {
// Tier is the intensity level.
Tier Tier
// RunRoot is the root directory for all run artifacts.
RunRoot string
// BinDir is the directory containing test binaries.
// Defaults to "./bin" if empty.
BinDir string
// Oracle is the C++ oracle for consistency checks.
// May be nil if oracle is not available.
Oracle *Oracle
// KnownFailures tracks failure fingerprints for deduplication.
KnownFailures *KnownFailures
// FailFast stops the campaign on the first failure.
FailFast bool
// Verbose enables verbose output.
Verbose bool
// Output is where to write progress messages.
Output io.Writer
// InstanceTimeout is the per-instance timeout in seconds.
// If 0, uses the default for the tier.
InstanceTimeout int
// GlobalTimeout is the global campaign timeout in seconds.
// If 0, uses the default for the tier.
GlobalTimeout int
// Trace controls trace capture behavior.
Trace TraceConfig
// Minimize controls minimization behavior.
Minimize MinimizeConfig
// Filter restricts which instances to run.
// If nil, all instances are run.
Filter *Filter
// RequireQuarantine enforces that repeat failures must be quarantined.
// If true, unquarantined duplicate failures cause the campaign to fail.
RequireQuarantine bool
// SkipPolicies defines instance-level skip policies.
// Instances matching a skip policy are not run and are recorded as skipped.
SkipPolicies *InstanceSkipPolicies
}
RunnerConfig configures the campaign runner.
type SkipPolicy ¶
type SkipPolicy struct {
// InstanceName is the exact instance name to skip (highest priority).
InstanceName string `json:"instance_name,omitempty"`
// Group matches instances whose name starts with this prefix.
// For example, "status.durability" matches "status.durability.cycles4".
Group string `json:"group,omitempty"`
// Tags matches instances with all specified tag values.
// For example, {"tier": "nightly", "kind": "crash"} matches all nightly crash tests.
Tags map[string]string `json:"tags,omitempty"`
// Reason is a human-readable explanation for why the instance is skipped.
Reason string `json:"reason"`
// IssueID links to a tracking issue (e.g., "GH-456").
IssueID string `json:"issue_id,omitempty"`
}
SkipPolicy represents an instance-level skip policy. Unlike fingerprint-based quarantine, this skips instances BEFORE they run.
func (*SkipPolicy) Matches ¶
func (p *SkipPolicy) Matches(inst *Instance) bool
Matches returns true if this policy matches the given instance.
type SkipResult ¶
type SkipResult struct {
InstanceName string `json:"instance_name"`
Reason string `json:"reason"`
IssueID string `json:"issue_id,omitempty"`
Policy string `json:"policy"` // Which policy matched (for debugging)
}
SkipResult records why an instance was skipped.
type SkipSummary ¶
type SkipSummary struct {
Instance string `json:"instance"`
Reason string `json:"reason"`
IssueID string `json:"issue_id,omitempty"`
}
SkipSummary records an instance that was skipped.
type Step ¶
type Step struct {
// Name identifies this step (e.g., "crashtest", "collision-check").
Name string
// Tool is the binary to execute.
Tool Tool
// Args are the command-line arguments.
// Supports placeholders: <RUN_DIR>, <SEED>, <DB_DIR>, <PREV_DB_DIR>.
Args []string
// Env are additional environment variables.
Env map[string]string
// RequiresOracle indicates if this step needs oracle tools.
RequiresOracle bool
// DiscoverDBPath indicates the runner should discover the DB path
// from the previous step's artifacts and make it available as <DB_DIR>.
DiscoverDBPath bool
}
Step represents a single execution step in a composite instance.
type StepResult ¶
type StepResult struct {
// StepName identifies which step this result is for.
StepName string
// Passed indicates if this step succeeded.
Passed bool
// ExitCode is the process exit code.
ExitCode int
// FailureReason describes why the step failed (if applicable).
FailureReason string
// DurationMs is how long the step took.
DurationMs int64
// DBPath is the discovered DB path (if DiscoverDBPath was set).
DBPath string
// LogPath is the path to this step's log file.
LogPath string
}
StepResult captures the outcome of a single step execution.
type StopCondition ¶
type StopCondition struct {
// RequireTermination requires the process to terminate within the timeout.
// If false, the runner will kill after timeout but not treat it as failure.
RequireTermination bool
// RequireFinalVerificationPass requires the tool's final verification to pass.
// For stresstest, this means expected state verification.
// For crashtest, this means recovery verification.
RequireFinalVerificationPass bool
// RequireOracleCheckConsistencyOK requires `ldb checkconsistency` to return OK.
// Only applies to instances with RequiresOracle=true.
RequireOracleCheckConsistencyOK bool
// DedupeByFingerprint enables deduplication by failure fingerprint.
// When true, repeated failures with the same fingerprint are marked as duplicates.
DedupeByFingerprint bool
}
StopCondition defines when an instance run is considered complete and what constitutes success vs failure.
func DefaultStopCondition ¶
func DefaultStopCondition() StopCondition
DefaultStopCondition returns the default stop condition for most instances.
type SweepCase ¶
type SweepCase struct {
// ID is a stable identifier for this case (e.g., "cycles_4_mode_drop").
ID string
// Params maps parameter names to their values for this case.
Params map[string]string
}
SweepCase represents a single concrete case in a sweep expansion.
func DisableWALFaultFSMinimizeCases ¶
func DisableWALFaultFSMinimizeCases() []SweepCase
DisableWALFaultFSMinimizeCases returns the sweep cases for disablewal-faultfs-minimize. These mirror the cases in scripts/status/run_durability_repros.sh.
type SweepInstance ¶
type SweepInstance struct {
// Base is the base instance (used as template).
Base Instance
// Params are the parameters to sweep over.
Params []SweepParam
// Cases are the explicit cases to run (if provided, Params is ignored).
// This allows defining arbitrary combinations rather than full cross-product.
Cases []SweepCase
}
SweepInstance defines a parameterized instance that expands into multiple runs.
func StatusSweepInstances ¶
func StatusSweepInstances() []SweepInstance
StatusSweepInstances returns sweep (parameter expansion) instances. These instances expand into multiple concrete runs.
func (*SweepInstance) Expand ¶
func (s *SweepInstance) Expand() []Instance
Expand returns the concrete instances for this sweep. Each returned instance has a unique Name derived from the sweep case.
type SweepParam ¶
type SweepParam struct {
// Name is the parameter name (e.g., "cycles", "mode").
Name string
// Values are the values to sweep over.
Values []string
}
SweepParam represents a parameter that can be varied in a sweep.
type SyntheticFailConfig ¶
type SyntheticFailConfig struct {
// Enabled activates the synthetic failure mode.
Enabled bool
// FailAfterOps causes failure after N operations.
// Used to exercise minimization: minimizer should reduce N.
FailAfterOps int
// FailureKind is the classification for the synthetic failure.
FailureKind string
// FailureMessage is the human-readable failure reason.
FailureMessage string
}
SyntheticFailConfig configures the synthetic failure hook. This is used for CI testing of minimization and failure classification.
type Tags ¶
type Tags struct {
// Campaign is the campaign identifier (e.g., "C05").
Campaign string `json:"campaign"`
// Tier is the execution tier (quick/nightly).
Tier string `json:"tier"`
// Tool is the binary used (stresstest/crashtest/goldentest/adversarialtest/sstdump).
Tool string `json:"tool"`
// Kind is the high-level category (stress/crash/golden/status/adversarial).
Kind string `json:"kind"`
// OracleRequired indicates if the instance requires C++ oracle tools.
OracleRequired bool `json:"oracle_required"`
// Group is the group prefix (e.g., "status.durability").
Group string `json:"group"`
// FaultKind is the fault injection kind (none/read/write/sync/crash/corrupt).
FaultKind string `json:"fault_kind"`
// FaultScope is the fault injection scope (worker/flusher/reopener/global).
FaultScope string `json:"fault_scope"`
// Extra contains optional instance-specific metadata.
Extra map[string]string `json:"extra,omitempty"`
}
Tags represents the structured tag set for an instance. Required tags are always present; optional tags use the Extra map.
type Tier ¶
type Tier string
Tier represents the test intensity level. Each tier has different duration, concurrency, and thoroughness settings.
type Tool ¶
type Tool string
Tool represents the test binary to execute.
const ( // ToolStress runs the stresstest binary for concurrent workloads. ToolStress Tool = "stresstest" // ToolCrash runs the crashtest binary for crash recovery testing. ToolCrash Tool = "crashtest" // ToolAdversarial runs the adversarialtest binary for corruption attacks. ToolAdversarial Tool = "adversarialtest" // ToolGolden runs the goldentest suite for C++ compatibility. ToolGolden Tool = "goldentest" // ToolSSTDump runs the sstdump binary for SST inspection/verification. ToolSSTDump Tool = "sstdump" )
type ToolResult ¶
ToolResult contains the result of running an oracle tool.
func (*ToolResult) OK ¶
func (r *ToolResult) OK() bool
OK returns true if the tool exited successfully.
type TraceConfig ¶
type TraceConfig struct {
// Enabled controls whether trace capture is active.
Enabled bool
// MaxSizeBytes is the maximum trace file size before truncation.
// Default: 256MB (256 * 1024 * 1024).
MaxSizeBytes int64
// TraceDir is the subdirectory under the run directory for trace files.
// Default: "trace".
TraceDir string
}
TraceConfig defines trace capture behavior for campaign runs.
func DefaultTraceConfig ¶
func DefaultTraceConfig() TraceConfig
DefaultTraceConfig returns the default trace configuration.
type TraceResult ¶
type TraceResult struct {
// Path is the path to the trace file (if captured).
Path string
// BytesWritten is the number of bytes written to the trace file.
BytesWritten int64
// Truncated indicates if the trace was truncated due to size limits.
Truncated bool
// ReplayCommand is the command to replay this trace.
ReplayCommand string
}
TraceResult captures the outcome of trace handling for a run.
func CollectTraceResult ¶
func CollectTraceResult(runDir, dbPath, binDir string, config TraceConfig) *TraceResult
CollectTraceResult gathers trace information after a run completes.