scoring

package
v0.3.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2026 License: MIT Imports: 17 Imported by: 0

Documentation

Overview

Package scoring defines and applies the dot-agents outcome-scoring rubric: an explainable quality score per agent-run iteration and session, computed from already-captured telemetry.

This file is the versioned rubric data structure. It is the Go twin of docs/OUTCOME_SCORING_RUBRIC.md and must agree with it: the document is the canonical contract, this code is its machine-readable form. Changing the rubric means editing both in the same commit and bumping RubricVersion.

score_iteration.go provides the single-iteration scoring entry point workflow-client-commands close-task uses to score the just-closed iteration without rerunning the whole log.

Old scores are computed from immutable inputs (iter-N.yaml + commit SHA + frozen transcript window) so they stay valid forever once written — only a RubricVersion bump or a signals-shape change warrants backfill. The default close-task flow therefore wants "score iteration N", not "score every iteration the log has ever recorded."

signal_hook_outcomes.go is the R1.5 hook-outcome objective extractor: it reads the per-iteration `.agents/active/iteration-log/iter-N.hook-outcomes.yaml` sidecar written by `da workflow hook-outcome write` (commands/workflow/hook_outcome.go, R1.5 t1) and folds the records into a single `hook_outcomes` SignalValue feeding the rubric.

Scope per R1.5 t1b (PR #97) and R1.5 design D3/D4:

  • Only `intervention_class` in {`prevent_before_action`, `remediate_at_stop`} contributes to scoring.
  • `continuity_advice` (pre_compact) and `observe_tool_result` (post-tool) are deferred to R1.5.1 and never reach the sub-score under RubricVersion 2.1.0 — they remain in the sidecar as audit-only observations.
  • A pre-action and a terminal-remediation record sharing the same `correlation_id` + `rule_id` collapse to one record (the more severe, remediate) per D4 dedup, so a prevented-then-stop record does not double-count.

Sub-score (per design D3):

  • At least one collapsed record at `remediate` ⇒ 0.0.
  • All collapsed records at `advise`, no `remediate` ⇒ 0.6.
  • All collapsed records at `allow`, no `advise`/`remediate` ⇒ 1.0.
  • No in-scope records (or sidecar missing/unreadable) ⇒ absent (preserves the "absent does not vote" invariant from scorer.go).

The extractor is read-only and pure-ish: it does file I/O bounded to the sidecar path, never mutates it, and never escalates a parse failure to an error — every recoverable failure mode degrades to AbsentSignal so a malformed sidecar cannot break the rest of the score.

Index

Constants

View Source
const BandUnscored = "unscored"

BandUnscored is the band reported when no signal was present and the iteration therefore has no numeric score.

View Source
const RubricVersion = "2.1.0"

RubricVersion is the semantic version of the active outcome-scoring rubric.

Bumping it is a deliberate, reviewable act — see the versioning policy in docs/OUTCOME_SCORING_RUBRIC.md. major: the signal set or combination method changed. minor: a weight or sub-score mapping changed. patch: docs or band thresholds only. Every persisted score records the version it was computed under, so a rubric change never silently invalidates historical scores.

Variables

This section is empty.

Functions

func DeriveLinkedTracesToOutcomes

func DeriveLinkedTracesToOutcomes(rec IterationRecord) bool

DeriveLinkedTracesToOutcomes computes the legacy iteration-log marker from the structured replacement that the structured-claims task introduced.

The proposal calls the boolean "uncomputed -- the clearest marker of the gap" between captured evidence and outcome assessment. The structured-claims schema deprecated the self-reported boolean and replaced it with a named list of trace ↔ outcome pairs: verifier.linked_traces. The boolean is now fully derivable — true exactly when the iteration recorded at least one pair — so persisting it is the close-out act for the proposal's named gap.

Returning a derived value (instead of trusting the legacy self-report) is the point: the rubber-stamp era of the boolean is over.

An iteration may carry more than one verifier entry (unit + api + ui-e2e, say); the marker is true when at least one of them recorded a pair.

func IterationScorePath

func IterationScorePath(iterLogDir string, iteration int) string

IterationScorePath returns the sidecar path for an iteration's score: the iter-N.score.yaml file adjacent to iter-N.yaml in the iteration log dir.

func ScoreIteration

func ScoreIteration(iterLogDir, repoDir string, n int, transcriptDirs ...string) (Score, IterationRecord, error)

ScoreIteration computes the Score and returns the source IterationRecord for iteration n alone. iterLogDir + repoDir + transcriptDirs match BuildSignalSets's contract so callers reuse the same path conventions.

The implementation runs BuildSignalSets across the full log (needed so the per-iteration commit-window resolver sees its neighbours) and then returns the single matching SignalSet's Score. The wider pipeline is pure-ish — the heavy work is git topology + transcript scan, both of which are bounded by the iter-log size — so the optimisation target is not "skip the pipeline" but "skip the per-iter sidecar writes," which the close-task --recompute=current default already achieves.

Returns an error when iter-N.yaml does not exist or BuildSignalSets itself errors.

func SessionScorePath

func SessionScorePath(iterLogDir, sessionID string) (string, error)

SessionScorePath returns the sidecar path for a session's score, named by the session_id so distinct sessions never collide. Callers must pass a non-empty session id; empty ids are not addressable on disk.

func WriteIterationScore

func WriteIterationScore(iterLogDir string, score Score) (string, error)

WriteIterationScore persists a single iteration score to its sidecar path and returns the written path. The file is written atomically: a temp file in the same directory is renamed into place so concurrent readers never see a partially-written sidecar.

func WriteIterationScoreWithRecord

func WriteIterationScoreWithRecord(iterLogDir string, s Score, rec IterationRecord) (string, error)

WriteIterationScoreWithRecord is the augmented per-iteration write: it derives the linked_traces_to_outcomes marker from rec and persists it alongside the score.

func WriteIterationScores

func WriteIterationScores(iterLogDir string, scores []Score) ([]string, error)

WriteIterationScores persists every per-iteration score and returns the written paths in input order. The first write failure returns immediately with the paths written up to that point — the caller decides whether to retry or roll back.

func WriteSessionScore

func WriteSessionScore(iterLogDir string, ss SessionScore) (string, error)

WriteSessionScore persists a session aggregate to its sidecar path.

func WriteSessionScores

func WriteSessionScores(iterLogDir string, sessions []SessionScore) ([]string, error)

WriteSessionScores persists every session aggregate and returns the written paths in input order.

Types

type AgentInfo

type AgentInfo struct {
	SessionID string
	Harness   string
	Model     string
}

AgentInfo identifies the agent harness behind an iteration. Populated only by v2 entries; the session ID anchors transcript backfill.

type AgentRole

type AgentRole string

AgentRole identifies which role made a self-reported claim. The v2 iteration-log blocks are role-owned, so the integrity track can attribute a claimed-vs-observed gap to the role responsible for it.

const (
	// RoleImpl is the implementation agent / loop worker.
	RoleImpl AgentRole = "impl"
	// RoleVerifier is the verification agent.
	RoleVerifier AgentRole = "verifier"
	// RoleReview is the review agent.
	RoleReview AgentRole = "review"
)

type BackfillSignals

type BackfillSignals struct {
	// Iteration is the iteration number these signals belong to.
	Iteration int

	// TokenEfficiency is the token_efficiency signal: cache_hit_rate as a
	// sub-score in [0,1]. Present from native session_tokens when the
	// iteration log captured it, otherwise backfilled from the transcript
	// window; absent when no transcript covers the window.
	TokenEfficiency SignalValue

	// ToolErrorRate is the fraction of tool calls in the iteration's window
	// that errored, feeding the assemble slice's correction_pressure signal.
	// ToolErrorRatePresent is false when no tool-call evidence was found — a
	// caller must check it before reading ToolErrorRate.
	ToolErrorRate        float64
	ToolErrorRatePresent bool
}

BackfillSignals carries the transcript-reconstructed signals for one iteration. Both fields are always set; absence is expressed inside the SignalValue / present-bool, never by a nil struct.

func BackfillIterations

func BackfillIterations(records []IterationRecord, repoDir string, transcriptDirs ...string) ([]BackfillSignals, error)

BackfillIterations reconstructs token-efficiency and tool-error signals for a sorted slice of iteration records. It is the thin, side-effecting wrapper around the pure scanner: it resolves each iteration's commit time via git in repoDir, derives the per-iteration windows, and scans every transcriptDir.

records MUST be sorted ascending by iteration (LoadIterationLog returns them so). The window for iteration N is (commitTime(N-1), commitTime(N)]; the first iteration's window is open on the left. An iteration whose commit SHA git cannot resolve is reported with both signals absent rather than failing the whole batch — squashed and rebased history is expected.

type CombinationMethod

type CombinationMethod string

CombinationMethod names how per-signal sub-scores combine into the final score. Naming it (rather than hard-coding the formula in the scorer) keeps a change of method a reviewable diff against the rubric.

const CombineWeightedMeanRenormalized CombinationMethod = "weighted_mean_renormalized"

CombineWeightedMeanRenormalized is the combination method:

score = Σ(weightᵢ · sub_scoreᵢ) / Σ(weightᵢ)   over present signals i

Absent signals drop out of both sums; the remaining weights renormalize, so a missing signal neither inflates nor deflates the score. If every signal is absent the iteration is unscored.

type GitSignals

type GitSignals struct {
	// LandedObserved is the objective `landed` signal: did the iteration's
	// commit survive into the trunk. 1.0 when reachable and not reverted, 0.0
	// when reverted or orphaned, absent when the commit SHA cannot be resolved.
	LandedObserved SignalValue
	// ScopeObserved is the objective `scope` signal: the fraction of the
	// commit's changed files that fall inside the task's declared write_scope.
	// Absent when no write_scope is resolvable for the task — true for most
	// historical iterations, where the assemble slice falls back to scope_note.
	ScopeObserved SignalValue
}

GitSignals carries the objective, git-derived sub-scores for one iteration. Each field is a SignalValue: a sub-score in [0,1] when the topology could be read, or absent when it could not — absent is first-class and the assemble slice falls back to the self-reported source.

func ExtractGitSignals

func ExtractGitSignals(rec IterationRecord, repoDir string) (GitSignals, error)

ExtractGitSignals computes the git-topology objective signals for one iteration against the repository at repoDir.

It returns an error only for unexpected failures — chiefly repoDir not being a git repository. A commit SHA that simply does not resolve is NOT an error: it is an absent signal, because v1 iteration-log entries carry abbreviated SHAs from since-rebased history and some carry no SHA at all.

type ImplBlock

type ImplBlock struct {
	Summary           string
	ScopeNote         string
	Retries           int
	FocusedTestsAdded int
	FocusedTestsPass  OptionalBool
	// TestsTotalPass is v1's iteration-wide test pass flag. v2 records the
	// equivalent per verifier, so this stays unset for v2 entries.
	TestsTotalPass OptionalBool
	SelfAssessment SelfAssessment
}

ImplBlock is the implementation-role contribution to an iteration.

type IntegrityObservation

type IntegrityObservation struct {
	Signal   SignalID
	Role     AgentRole
	Claimed  SignalValue
	Observed SignalValue
}

IntegrityObservation pairs an agent's self-reported claim with the objective observation for one two-way signal, attributed to the role that made the claim. The claimed-vs-observed gap is the integrity signal — it never feeds the numeric score (see docs/OUTCOME_SCORING_RUBRIC.md).

func (IntegrityObservation) Comparable

func (o IntegrityObservation) Comparable() bool

Comparable reports whether both sides are present, so Delta is meaningful.

func (IntegrityObservation) Delta

func (o IntegrityObservation) Delta() float64

Delta is observed minus claimed when both sides are present: a negative delta is an over-claim, the agent having reported better than reality. It is 0 when the pair is not Comparable.

type IterationObjectives

type IterationObjectives struct {
	// RanCliCommand: did the agent actually invoke a CLI tool in the window.
	RanCliCommand SignalValue
	// CommittedAfterTests: did a test command run in the window before the
	// iteration's commit.
	CommittedAfterTests SignalValue
	// ReadLoopState: was loop-state.md read in the window.
	ReadLoopState SignalValue
}

IterationObjectives carries facts about an iteration checked objectively from the agent transcripts, in place of the rubber-stamped self_assessment booleans the boolean-effectiveness analysis (iter-66 dogfood) showed carried no information.

Each entry is a SignalValue — Present with a 0/1 sub-score for a checked fact, Absent when the transcript window had no coverage. These observations surface alongside the score as a parallel record of process discipline; they do not enter the numeric score directly. The self-report counterparts have been removed from the schema, so there is nothing to pair against in the integrity track.

func ExtractIterationObjectives

func ExtractIterationObjectives(_ IterationRecord, window IterationWindow, transcriptDirs ...string) IterationObjectives

ExtractIterationObjectives runs every objective check for one iteration. window scopes the transcript scans (use the backfill's IterationWindow); empty transcriptDirs leave the transcript-derived signals absent.

The rec argument is retained for future objective checks that need diff stats or other iteration-log facts; the current set is transcript-only.

type IterationRecord

type IterationRecord struct {
	SchemaVersion int
	Iteration     int
	Date          string
	Wave          string
	TaskID        string
	Commit        string
	FilesChanged  int
	LinesAdded    int
	LinesRemoved  int
	FirstCommit   bool
	CheckpointAt  string

	Agent         AgentInfo
	SessionTokens *TokenUsage // nil when the entry never captured token telemetry

	Impl      ImplBlock
	Verifiers []VerifierRecord
	Review    ReviewBlock
}

IterationRecord is one agent-run iteration, normalized from either iteration-log schema into a single shape the signal extractors consume.

The iteration log carries two schemas: v1 is flat (test/scope/self-assessment fields at the top level), v2 nests them under role-owned impl / verifiers / review blocks. Both normalize here. Fields a given schema never carried are left zero — an OptionalBool stays unset, so "the agent did not report" is distinguishable from "the agent reported false".

func LoadIterationLog

func LoadIterationLog(dir string) ([]IterationRecord, error)

LoadIterationLog reads an iteration-log directory — every iter-*.yaml plus the historical.yaml archive — into one iteration-sorted slice of records.

historical.yaml duplicates the early iterations that also have a dedicated iter-N.yaml; the dedicated file is canonical and wins. A historical entry is kept only when no dedicated file covers that iteration.

func ParseIterationRecord

func ParseIterationRecord(data []byte) (IterationRecord, error)

ParseIterationRecord parses one iteration-log document, dispatching on its schema_version. It is the schema-aware seam: callers never see the raw v1/v2 difference.

type IterationWindow

type IterationWindow struct {
	Iteration int
	Start     time.Time
	End       time.Time
}

IterationWindow pairs an iteration with the transcript time window that belongs to it: (Start, End]. Start is the predecessor commit's time (zero for the first iteration); End is this iteration's commit time.

type IterlogSignals

type IterlogSignals struct {
	// ScopeClaimed is the self-reported scope adherence, read from the
	// impl block's scope_note: on-target → 1.0, partial → 0.5,
	// scope-breach → 0.0. A free-text note whose lowercased form begins
	// "on-target" also maps to 1.0. Empty or unrecognized → absent.
	ScopeClaimed SignalValue

	// TestsClaimed is the self-reported test outcome: the fraction of the
	// *set* tri-state pass flags (impl.focused_tests_pass,
	// impl.tests_total_pass, and each verifiers[].tests_total_pass) whose
	// value is true. Absent when no such flag is set anywhere in the entry.
	TestsClaimed SignalValue

	// Verifier is the OBJECTIVE verifier result. Over the entry's verifier
	// records whose status is pass/fail/partial (unknown excluded) it is the
	// mean of pass→1.0, partial→0.5, fail→0.0; a verifier's recorded
	// result_artifact, when readable under repoRoot, overrides its inline
	// status. For v1 entries (no verifiers array) it falls back to
	// impl.tests_total_pass as a proxy. Absent when no verifier evidence
	// exists.
	Verifier SignalValue

	// VerifierClaimed is the self-reported verification diligence — at v2.0.0
	// it is read from linked_traces_to_outcomes alone, after the
	// boolean-effectiveness analysis showed ran_cli_command and
	// committed_after_tests were rubber-stamped (~98% true, no information).
	// The objective replacements live in IterationObjectives; this field will
	// be retired once structured-claims expands linked_traces_to_outcomes into
	// a named-trace list.
	VerifierClaimed SignalValue

	// LandedClaimed is the self-reported landing outcome: the
	// self_assessment.persisted_via_workflow_commands note combined with
	// review.overall_decision (accept→1.0, reject→0.0, escalate→0.5).
	// Absent when neither source says anything.
	LandedClaimed SignalValue

	// Retries is the raw retry count: impl.retries plus the sum of every
	// verifiers[].retries. A raw input to correction_pressure.
	Retries int

	// UserCorrections is the raw count of post-invocation corrections read
	// from the review-decision.yaml artifact (review.decision_artifact
	// resolved under repoRoot): the length of post_invocation.user_corrections
	// plus post_invocation.retries_in_loop. 0 when the artifact is
	// unavailable. A raw input to correction_pressure.
	UserCorrections int
}

IterlogSignals is the bundle of signal values an iteration's own iteration-log entry yields — the self-reported claims plus the objective verifier evidence reachable from the entry's recorded artifact paths.

It is one input to the assemble slice, not a finished score. The SignalValue fields are sub-scores in [0,1] (or absent); Retries and UserCorrections are raw counts the assemble slice folds into the correction_pressure signal. This slice does no signal combination and no claimed-vs-observed integrity pairing — see docs/OUTCOME_SCORING_RUBRIC.md.

func ExtractIterlogSignals

func ExtractIterlogSignals(rec IterationRecord, repoRoot string) IterlogSignals

ExtractIterlogSignals derives the iteration-log-native signal values for one iteration record. repoRoot resolves the repo-relative verification-artifact paths the record carries (verifier result_artifact, review decision_artifact); when repoRoot is "" or an artifact is missing or unreadable the affected signal degrades gracefully — it becomes absent, or contributes 0. This function never panics and never returns an error.

type LinkedTrace

type LinkedTrace struct {
	TraceRef   string
	OutcomeRef string
}

LinkedTrace is one entry of verifier.linked_traces.

type OptionalBool

type OptionalBool struct {
	Set   bool
	Value bool
}

OptionalBool is a tri-state boolean parsed leniently from the iteration log. The schema specifies boolean|null for the test-pass flags, but real entries also carry integers (a pass count, e.g. tests_total_pass: 3) and strings. Unset (the zero value) means the agent did not report — distinct from a reported false.

func (*OptionalBool) UnmarshalYAML

func (o *OptionalBool) UnmarshalYAML(node *yaml.Node) error

UnmarshalYAML coerces a YAML scalar into a tri-state bool: null and the empty string stay unset; a real boolean maps directly; a non-zero integer is true; a parseable string is taken at face value.

type PersistedContribution

type PersistedContribution struct {
	Signal          SignalID `yaml:"signal"`
	Label           string   `yaml:"label"`
	Present         bool     `yaml:"present"`
	SubScore        float64  `yaml:"sub_score"`
	Detail          string   `yaml:"detail,omitempty"`
	NominalWeight   float64  `yaml:"nominal_weight"`
	EffectiveWeight float64  `yaml:"effective_weight"`
	Contribution    float64  `yaml:"contribution"`
}

PersistedContribution is one row of the per-signal breakdown as written to disk.

type PersistedScore

type PersistedScore struct {
	Iteration     int                     `yaml:"iteration"`
	RubricVersion string                  `yaml:"rubric_version"`
	Scored        bool                    `yaml:"scored"`
	Value         float64                 `yaml:"value"`
	Band          string                  `yaml:"band"`
	Breakdown     []PersistedContribution `yaml:"breakdown"`
	// LinkedTracesToOutcomes is the derived legacy iteration-log marker:
	// true when verifier.linked_traces names at least one trace ↔ outcome pair.
	// Computed by DeriveLinkedTracesToOutcomes; populated by BuildPersistedScore.
	// Omitted from the YAML when the score was written without an IterationRecord.
	LinkedTracesToOutcomes bool `yaml:"linked_traces_to_outcomes,omitempty"`
}

PersistedScore is the on-disk YAML shape for one iteration's score. It is the durable record consumed by R1's CLI and (later) the R2 dashboard, so the shape is explicit and stable rather than reusing the in-memory Score directly. The breakdown is preserved row-per-row in rubric order so that renderers do not need access to the rubric to reproduce a stable display.

func BuildPersistedScore

func BuildPersistedScore(s Score, rec IterationRecord) PersistedScore

BuildPersistedScore augments a Score with the derived per-iteration markers before it is written. It is the caller-side constructor for the on-disk shape: anything that comes from outside the pure scorer (the linked-traces marker today, possibly more markers later) is folded in here rather than leaking into the Rubric.Score signature.

Callers that only have the Score (no IterationRecord) can keep using the raw WriteIterationScore path; the marker field on PersistedScore is omitted from the YAML when unset, so legacy callers do not regress.

type ReviewBlock

type ReviewBlock struct {
	Phase1Decision   string // accept | reject | escalate | ""
	Phase2Decision   string
	OverallDecision  string
	FailedGates      []string
	DecisionArtifact string
}

ReviewBlock is the review-role contribution. Present only on v2 entries.

type Rubric

type Rubric struct {
	Version     string
	Combination CombinationMethod
	Signals     []SignalSpec
	Bands       []ScoreBand
}

Rubric is the versioned outcome-scoring rubric: the signal set, their weights, the combination method, and the score bands. It is data, not behaviour — the scorer task consumes it; it does not redefine it.

func DefaultRubric

func DefaultRubric() Rubric

DefaultRubric returns the active, versioned rubric (RubricVersion).

Weights (2.1.0): correctness signals (landed 0.20, verifier 0.18, tests 0.17) total 0.55 and dominate; process signals (correction_pressure 0.13, scope 0.13, hook_outcomes 0.10) total 0.36; efficiency (token_efficiency 0.09) is the remainder. The 2.0.2 → 2.1.0 rebalance introduces hook_outcomes (objective hook-gate evidence — see R1.5 design D3 in .agents/workflow/specs/r1-5-hook-enforcement-telemetry/design.md) and trims every existing weight proportionally so the correctness / process / efficiency shape is preserved. Rationale and per-signal sourcing live in docs/OUTCOME_SCORING_RUBRIC.md.

func (Rubric) Band

func (r Rubric) Band(score float64) string

Band returns the band name for a numeric score in [0, 1]. Scores outside [0, 1] clamp to the nearest band. Callers report BandUnscored themselves for the no-signal case; this method always returns a numeric-range band.

func (Rubric) Score

func (r Rubric) Score(set SignalSet) Score

Score applies the rubric to one SignalSet. It is pure — no I/O, no git, no transcripts. The combination method is weighted_mean_renormalized: absent signals drop out of both the numerator and the denominator, the remaining weights renormalize, and the present signals' contributions sum exactly to Value.

When every rubric signal is absent on set, Scored is false, Value is 0, and Band is BandUnscored. The Breakdown is still populated — every rubric signal gets a row marked Present=false — so the "unscored" explanation is complete.

func (Rubric) ScoreAll

func (r Rubric) ScoreAll(sets []SignalSet) []Score

ScoreAll applies the rubric to every SignalSet in sets, returning one Score per set in input order. A typical caller pairs ScoreAll with BuildSignalSets to produce the full per-iteration score series for an iteration log.

func (Rubric) Signal

func (r Rubric) Signal(id SignalID) (SignalSpec, bool)

Signal returns the spec for id and true, or a zero spec and false if id is not part of the rubric.

func (Rubric) TwoWaySignals

func (r Rubric) TwoWaySignals() []SignalSpec

TwoWaySignals returns the signals that carry both an objective and a self-reported source — the ones the scorer feeds into the integrity track.

func (Rubric) Validate

func (r Rubric) Validate() error

Validate checks the rubric's internal invariants: a pinned version and combination method, a non-empty signal set with unique IDs and positive weights summing to 1.0, and a non-empty band ladder sorted descending by Min with a band anchored at 0. It is the guard that keeps a rubric edit internally consistent.

type Score

type Score struct {
	Iteration     int
	RubricVersion string
	// Value is the final score in [0, 1]; 0 when Scored is false.
	Value float64
	// Scored is false when every rubric signal was absent on the input set —
	// the rubric refuses to invent a score from nothing.
	Scored bool
	// Band is the human-readable label for Value, or BandUnscored when Scored
	// is false.
	Band string
	// Breakdown carries one row per rubric signal, in the rubric's declared
	// order, for stable rendering.
	Breakdown []SignalContribution
}

Score is the rubric-applied outcome for one iteration: the numeric value, the human-readable band, and the per-signal breakdown that explains how the value was assembled.

The score is explainable by construction: every signal in the rubric contributes one row to Breakdown, and the contributions of the present signals sum exactly to Value. A signal that was absent contributes a row with Present=false, EffectiveWeight=0, Contribution=0 — present in the breakdown so the explanation is complete, but voting nothing.

type ScoreBand

type ScoreBand struct {
	Name string
	Min  float64
}

ScoreBand is a human-readable label for a numeric-score range, identified by its inclusive lower bound. Bands are held sorted descending by Min.

type SelfAssessment

type SelfAssessment struct {
	ReadLoopState                 bool         // deprecated — objective check replaces
	OneItemOnly                   OptionalBool // kept (best scope-lift); tri-state
	CommittedAfterTests           bool         // deprecated — objective check replaces
	AlignedWithCanonicalTasks     OptionalBool // kept; tri-state
	PersistedViaWorkflowCommands  string
	RanCliCommand                 bool // deprecated — objective check replaces
	ExercisedNewScenario          bool
	TestsPositiveAndNegative      bool
	TestsUsedSandbox              bool
	CliProducedActionableFeedback string
	LinkedTracesToOutcomes        bool
	StayedUnder10Files            bool         // deprecated — arbitrary threshold
	NoDestructiveCommands         OptionalBool // kept; tri-state
	ScopedTestsToWriteScope       OptionalBool // kept; tri-state
	TddRefreshPerformed           bool         // deprecated — 0/22 true historically
}

SelfAssessment is the superset of agent-reported discipline flags. v1 carries them all in one flat block; v2 splits them across the impl and verifier blocks. The integrity track reads claims from here.

Surviving keeper fields are OptionalBool, so absent in the YAML stays distinct from a reported false — the boolean-effectiveness analysis showed that conflating the two was poisoning their information content. The fields the analysis identified as dead or rubber-stamped (ReadLoopState, CommittedAfterTests, RanCliCommand, StayedUnder10Files, TddRefreshPerformed) are kept as bool only so historical v1 entries still parse; the extractor no longer reads them, the objective_checks.go transcript scanners do.

type SessionIterRef

type SessionIterRef struct {
	Iteration int     `yaml:"iteration"`
	Scored    bool    `yaml:"scored"`
	Value     float64 `yaml:"value"`
	Band      string  `yaml:"band"`
}

SessionIterRef carries enough per-iteration detail to render a session view without having to fan out and read every iter-N.score.yaml sidecar.

type SessionScore

type SessionScore struct {
	SessionID     string           `yaml:"session_id"`
	RubricVersion string           `yaml:"rubric_version"`
	Iterations    []int            `yaml:"iterations"`
	Scored        bool             `yaml:"scored"`
	Value         float64          `yaml:"value"`
	Band          string           `yaml:"band"`
	PerIteration  []SessionIterRef `yaml:"per_iteration"`
}

SessionScore is the per-session aggregate written alongside the per-iteration scores. Value is the mean of the scored per-iteration values for the session; unscored iterations drop out of the average — the same "absent does not vote" rule the rubric uses for absent signals.

A session with no scored iterations is itself unscored: Scored=false, Value=0, Band=BandUnscored.

func AggregateSessions

func AggregateSessions(r Rubric, records []IterationRecord, scores []Score) []SessionScore

AggregateSessions groups the scores by their iteration's session_id and produces one SessionScore per session.

records and scores must align by iteration order (the order BuildSignalSets and ScoreAll return). Iterations whose record carries an empty session_id are silently skipped — an unaddressable session cannot be persisted as a sidecar, and the proposal mandates per-session addressability.

Output is sorted by SessionID for deterministic iteration / diffing.

type SignalContribution

type SignalContribution struct {
	Signal          SignalID
	Label           string
	Present         bool
	SubScore        float64
	Detail          string
	NominalWeight   float64
	EffectiveWeight float64
	Contribution    float64
}

SignalContribution is one row of the explainable per-signal breakdown.

type SignalID

type SignalID string

SignalID identifies one input signal in the rubric.

const (
	// SignalLanded scores whether the iteration's work survived to master.
	SignalLanded SignalID = "landed"
	// SignalVerifier scores whether the iteration's verification gates passed.
	SignalVerifier SignalID = "verifier"
	// SignalTests scores whether the iteration's tests passed.
	SignalTests SignalID = "tests"
	// SignalCorrectionPressure scores how little the iteration had to be
	// corrected — retries, user corrections, and tool-call errors.
	SignalCorrectionPressure SignalID = "correction_pressure"
	// SignalScope scores whether the iteration stayed within its write-scope.
	SignalScope SignalID = "scope"
	// SignalTokenEfficiency scores model/cache usage efficiency.
	SignalTokenEfficiency SignalID = "token_efficiency"
	// SignalHookOutcomes scores objective hook-gate outcomes captured by
	// `da workflow hook-outcome write` (R1.5). Added at RubricVersion 2.1.0
	// per docs/OUTCOME_SCORING_RUBRIC.md; the extractor lives in
	// signal_hook_outcomes.go.
	SignalHookOutcomes SignalID = "hook_outcomes"
)

type SignalSet

type SignalSet struct {
	Iteration int

	Landed             SignalValue
	Verifier           SignalValue
	Tests              SignalValue
	CorrectionPressure SignalValue
	Scope              SignalValue
	HookOutcomes       SignalValue
	TokenEfficiency    SignalValue

	// Integrity holds one observation per two-way signal that had at least one
	// of its claimed / observed sides present. It is a parallel output — it
	// never affects the numeric score.
	Integrity []IntegrityObservation

	// Objective carries the transcript-derived process-discipline checks that
	// replaced the rubber-stamped self_assessment booleans
	// (read_loop_state / committed_after_tests / ran_cli_command). These have
	// no self-reported counterpart in the current schema, so they surface as
	// observational facts alongside the score rather than as integrity pairs.
	Objective IterationObjectives
}

SignalSet is the rubric's six typed input signals for one iteration — the objective values the scorer consumes — together with the integrity observations (claimed vs observed) for the two-way signals.

It is the output of the signals task: the seam between raw telemetry and the scorer. AssembleSignalSet builds one from the three extractor partials; BuildSignalSets builds them for a whole iteration log.

func AssembleSignalSet

func AssembleSignalSet(rec IterationRecord, il IterlogSignals, gs GitSignals, bf BackfillSignals, obj IterationObjectives, hookOutcomes SignalValue) SignalSet

AssembleSignalSet joins the extractor partials for one iteration into the rubric's typed input set. It is pure — the scorer task consumes its output.

landed, token_efficiency, and hook_outcomes are objective-only. verifier and tests come from the iteration log and its verification artifacts. scope prefers the objective git measurement and falls back to the self-reported scope_note. correction_pressure is composed from retries, user corrections, and the transcript error rate. The IterationObjectives are recorded as observational facts on the result; they do not enter the score directly.

hookOutcomes is folded from the iter-N.hook-outcomes.yaml sidecar (R1.5); pass AbsentSignal("...") when no sidecar exists or the iteration predates R1.5 — the renormalizing combination then drops it from the vote.

func BuildSignalSets

func BuildSignalSets(iterLogDir, repoDir string, transcriptDirs ...string) ([]SignalSet, error)

BuildSignalSets loads an iteration log and runs every extractor, returning one assembled SignalSet per iteration in iteration order.

repoDir is the dot-agents repo root — used for git topology and for resolving the repo-relative verification-artifact paths. transcriptDirs are the agent session-log roots (~/.claude/projects/..., ~/.codex/sessions) for token backfill; pass none to score only natively-captured token telemetry.

func (SignalSet) Value

func (s SignalSet) Value(id SignalID) SignalValue

Value returns the SignalValue for a rubric signal ID, so the scorer can walk the rubric's signal list. An unknown ID yields an absent value.

type SignalSpec

type SignalSpec struct {
	ID          SignalID
	Label       string
	Weight      float64
	Description string
	// TwoWay marks a signal that has both an objective source — which scores
	// the run — and a paired self-reported source. For a two-way signal the
	// scorer also emits a claimed-vs-observed delta into the integrity track;
	// that delta never affects the numeric score. See the integrity-track
	// section of docs/OUTCOME_SCORING_RUBRIC.md.
	TwoWay bool
}

SignalSpec is the rubric entry for one input signal: its identity, its weight in the combination, and a one-line description of what it measures.

type SignalValue

type SignalValue struct {
	Present bool
	// SubScore is meaningful only when Present; PresentSignal keeps it in [0,1].
	SubScore float64
	// Detail is a short human-readable note on what produced the value, for the
	// explainable score breakdown — e.g. "2/2 verifiers passed" or, when
	// absent, "no verifier records".
	Detail string
}

SignalValue is one signal's contribution for one iteration: either a sub-score in [0,1], or absent when the telemetry to compute it was not available.

Absent is first-class. The scorer renormalizes weights over the signals that are present, so an absent signal neither inflates nor deflates the score — see docs/OUTCOME_SCORING_RUBRIC.md. The signal extractors return SignalValue; the scorer consumes it.

func AbsentSignal

func AbsentSignal(detail string) SignalValue

AbsentSignal builds an absent SignalValue; detail explains why it is absent.

func ExtractHookOutcomesSignal

func ExtractHookOutcomesSignal(iterLogDir string, n int) SignalValue

ExtractHookOutcomesSignal computes the `hook_outcomes` sub-score for one iteration. iterLogDir is the same directory BuildSignalSets walks; n is the iteration number (matching the iter-N.yaml entry).

Returns an absent SignalValue when the sidecar does not exist, cannot be parsed, or contains no scored records. A non-existent sidecar is not an error — it is the common case for iterations recorded before R1.5 shipped and for iterations whose sentinel was inactive throughout.

func PresentSignal

func PresentSignal(subScore float64, detail string) SignalValue

PresentSignal builds a present SignalValue, clamping the sub-score to [0,1].

type TestAdded

type TestAdded struct {
	Name string
	Kind string // positive | negative | edge | regression
}

TestAdded is one entry of verifier.tests_added_by_kind.

type TokenUsage

type TokenUsage struct {
	InputTokens         int
	OutputTokens        int
	CacheReadTokens     int
	CacheCreationTokens int
	CacheHitRate        float64
}

TokenUsage is the per-iteration token telemetry. v2 entries may carry it natively; for the rest it is reconstructed by the backfill slice.

type VerifierRecord

type VerifierRecord struct {
	Type       string
	Status     string // pass | fail | partial | unknown
	GatePassed bool
	TestsAdded int
	// TestsAddedByKind is the structured replacement for the deprecated
	// tests_positive_and_negative boolean — the actual test names with their
	// scenario kind. Each name should resolve in the diff.
	TestsAddedByKind []TestAdded
	// LinkedTraces is the structured replacement for the deprecated
	// linked_traces_to_outcomes boolean — concrete trace ↔ outcome pairs.
	LinkedTraces   []LinkedTrace
	TestsTotalPass OptionalBool
	Retries        int
	ResultArtifact string
	SelfAssessment SelfAssessment
}

VerifierRecord is one verifier's contribution. Present only on v2 entries; v1 had no verifiers array.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL