Documentation
¶
Overview ¶
Package review is the Stage-1 namespace for self-review / critique / quality scoring types in package engine. See ../REFACTOR_PLAN.md.
Index ¶
- func CompareApproaches(solutions []Solution) string
- func DefaultScoreFn(solution string) float64
- func FormatConsensus(result *ConsensusResult) string
- func FormatInline(comments []ReviewComment) string
- func FormatReport(report *ReviewReport) string
- func FormatReview(result *ReviewResult) string
- func FormatSelfAssessment(a *Assessment) string
- func PairwiseSimilarity(a, b string) float64
- func ScoreByCompleteness(content string) float64
- func ScoreByLength(content string) float64
- func ShouldRetry(solutions []Solution) bool
- type Assessment
- type Bot
- type Comment
- type ConsensusResult
- type ConsensusSampler
- type Critic
- func (c *Critic) BuildPrompt(original, patched, intent string) string
- func (c *Critic) Model() string
- func (c *Critic) ParseVerdict(response string) *PatchVerdict
- func (c *Critic) PreScreenPatch(originalContent, patchedContent, intent string) *PatchVerdict
- func (c *Critic) ShouldBlock(verdict *PatchVerdict) bool
- type PatchVerdict
- type QualityScorer
- type Report
- type ResponseContext
- type ReviewBot
- type ReviewComment
- type ReviewReport
- type ReviewResult
- type ReviewRule
- type Rule
- type Sample
- type ScoreWeights
- type ScoredResponse
- type SelfAssessor
- func (sa *SelfAssessor) Assess(ctx TaskContext) *Assessment
- func (sa *SelfAssessor) AverageScore(n int) float64
- func (sa *SelfAssessor) GetTrend(dimension string) string
- func (sa *SelfAssessor) IdentifyStrengths(ctx TaskContext) []string
- func (sa *SelfAssessor) IdentifyWeaknesses(ctx TaskContext) []string
- func (sa *SelfAssessor) SuggestImprovements(ctx TaskContext) []string
- type Solution
- type SolutionReviewer
- type TaskContext
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CompareApproaches ¶
CompareApproaches analyzes how different attempts approached the problem and returns a human-readable comparison.
func DefaultScoreFn ¶
DefaultScoreFn combines length and completeness scoring.
func FormatConsensus ¶
func FormatConsensus(result *ConsensusResult) string
FormatConsensus produces a human-readable summary of the consensus result.
func FormatInline ¶
func FormatInline(comments []ReviewComment) string
FormatInline produces GitHub-style inline review comments.
func FormatReport ¶
func FormatReport(report *ReviewReport) string
FormatReport produces a human-readable summary of a review report.
func FormatReview ¶
func FormatReview(result *ReviewResult) string
FormatReview produces a formatted summary of the review result.
func FormatSelfAssessment ¶
func FormatSelfAssessment(a *Assessment) string
FormatSelfAssessment produces a human-readable summary of an assessment.
func PairwiseSimilarity ¶
PairwiseSimilarity computes the Jaccard similarity between two strings based on their word sets.
func ScoreByCompleteness ¶
ScoreByCompleteness scores based on structural indicators: code blocks, file mentions, numbered steps.
func ScoreByLength ¶
ScoreByLength scores content based on reasonable length. Too short or too long content gets penalized.
func ShouldRetry ¶
ShouldRetry determines whether additional attempts should be made. Returns true if the best score so far is below 0.7.
Types ¶
type Assessment ¶
type Assessment struct {
Score float64 // overall score 0.0 - 1.0
Dimensions map[string]float64 // per-dimension scores
Strengths []string // things that went well
Weaknesses []string // things that went poorly
Improvements []string // actionable suggestions for next time
TaskType string // classification of the task
Timestamp time.Time // when the assessment was made
}
Assessment captures the result of a self-evaluation after completing a task.
type ConsensusResult ¶
ConsensusResult holds the outcome of multi-sample consensus.
type ConsensusSampler ¶
type ConsensusSampler struct {
NumSamples int
Strategy string // "majority", "best_score", "synthesize"
ScoreFn func(solution string) float64
// contains filtered or unexported fields
}
ConsensusSampler implements the multi-sample consensus pattern inspired by SWE-agent's "Ask Colleagues" approach: generate N solutions in parallel, then select the best one using a configurable strategy.
func NewConsensusSampler ¶
func NewConsensusSampler(numSamples int) *ConsensusSampler
NewConsensusSampler creates a ConsensusSampler with the given number of samples. If numSamples is <= 0, it defaults to 3.
func (*ConsensusSampler) SampleSolutions ¶
func (cs *ConsensusSampler) SampleSolutions(ctx context.Context, prompt string, generateFn func(context.Context, string) (string, error)) (*ConsensusResult, error)
SampleSolutions generates N solutions in parallel, scores each, and selects a winner based on the configured strategy.
type Critic ¶
type Critic struct {
// contains filtered or unexported fields
}
Critic provides fast pre-validation of patches using a cheap model before expensive execution. It generates a prompt for the cheap model and parses the response into a structured verdict.
func (*Critic) BuildPrompt ¶
BuildPrompt constructs a prompt for the cheap model to evaluate a patch.
func (*Critic) ParseVerdict ¶
func (c *Critic) ParseVerdict(response string) *PatchVerdict
ParseVerdict parses a model response into a structured PatchVerdict.
func (*Critic) PreScreenPatch ¶
func (c *Critic) PreScreenPatch(originalContent, patchedContent, intent string) *PatchVerdict
PreScreenPatch asks the cheap model whether a patch looks correct given the stated intent. It builds a prompt, and returns a verdict. In this implementation, the caller is expected to send the prompt to the model and pass the response to ParseVerdict. This method constructs a PatchVerdict based on a simple heuristic comparison when no model call is available.
func (*Critic) ShouldBlock ¶
func (c *Critic) ShouldBlock(verdict *PatchVerdict) bool
ShouldBlock returns true if the verdict indicates the patch should be blocked (verdict is "incorrect" with confidence > 0.8).
type PatchVerdict ¶
type PatchVerdict struct {
Likely string // "correct", "incorrect", "uncertain"
Issues []string // specific issues found
Confidence float64 // 0-1
}
PatchVerdict is the result of a critic's pre-screening of a patch.
type QualityScorer ¶
type QualityScorer struct {
Weights ScoreWeights
History []ScoredResponse
// contains filtered or unexported fields
}
QualityScorer evaluates LLM response quality across multiple dimensions and provides feedback for the self-improvement loop.
func NewQualityScorer ¶
func NewQualityScorer() *QualityScorer
NewQualityScorer creates a QualityScorer with default weights.
func (*QualityScorer) AverageScore ¶
func (qs *QualityScorer) AverageScore(n int) float64
AverageScore computes the average composite score over the last n responses.
func (*QualityScorer) FormatReport ¶
func (qs *QualityScorer) FormatReport(last int) string
FormatReport generates a formatted quality report for the last n responses.
func (*QualityScorer) GenerateFeedback ¶
func (qs *QualityScorer) GenerateFeedback(scored *ScoredResponse) []string
GenerateFeedback produces human-readable suggestions based on the scored response.
func (*QualityScorer) Score ¶
func (qs *QualityScorer) Score(ctx ResponseContext) *ScoredResponse
Score evaluates a response across all quality dimensions and returns a composite result.
func (*QualityScorer) TrendAnalysis ¶
func (qs *QualityScorer) TrendAnalysis() string
TrendAnalysis returns a human-readable description of quality trends.
type ResponseContext ¶
type ResponseContext struct {
UserPrompt string
AssistantResponse string
ToolCallCount int
ToolErrors int
FilesModified []string
TestsPassed bool
LintPassed bool
TokensUsed int
Duration time.Duration
}
ResponseContext provides the context needed to evaluate a response's quality.
type ReviewBot ¶
type ReviewBot struct {
Rules []ReviewRule
Severity string // minimum severity to report: "error", "warning", "info"
// contains filtered or unexported fields
}
ReviewBot is a rule-based code review engine that produces structured feedback without requiring an LLM call.
func NewReviewBot ¶
func NewReviewBot() *ReviewBot
NewReviewBot creates a ReviewBot pre-loaded with 20+ built-in rules.
func (*ReviewBot) ReviewDiff ¶
func (rb *ReviewBot) ReviewDiff(diffInput string) (*ReviewReport, error)
ReviewDiff parses a unified diff and reviews only changed lines.
func (*ReviewBot) ReviewFile ¶
func (rb *ReviewBot) ReviewFile(path, content string) (*ReviewReport, error)
ReviewFile reviews a full file's content.
type ReviewComment ¶
type ReviewComment struct {
File string
Line int
Severity string // "error", "warning", "info"
Category string
Message string
Suggestion string
RuleID string
}
ReviewComment represents a single piece of review feedback.
func FilterBySeverity ¶
func FilterBySeverity(comments []ReviewComment, minSeverity string) []ReviewComment
FilterBySeverity returns only comments at or above the specified minimum severity.
type ReviewReport ¶
type ReviewReport struct {
Comments []ReviewComment
FilesReviewed int
IssuesFound int
BySeverity map[string]int
Duration time.Duration
}
ReviewReport summarizes the results of a code review.
type ReviewResult ¶
type ReviewResult struct {
Best *Solution
All []Solution
Attempts int
TotalDuration time.Duration
TotalTokens int
Agreement float64
}
ReviewResult holds the outcome of the multi-attempt review process.
type ReviewRule ¶
type ReviewRule struct {
ID string
Name string
Category string // "security", "performance", "correctness", "style", "testing"
Language string
Check func(file string, lines []string, diffLines []diff.DiffLine) []ReviewComment
}
ReviewRule defines a single review check that can be applied to code.
type Sample ¶
Sample represents a single generated solution with metadata.
func MajorityVote ¶
MajorityVote finds the most similar/common solution using pairwise similarity. The sample with the highest average similarity to all others wins.
func Synthesize ¶
Synthesize combines elements from all solutions, weighted by score. It selects unique paragraphs from higher-scoring samples first.
type ScoreWeights ¶
type ScoreWeights struct {
Completeness float64 // did it address the full request?
Correctness float64 // is the code syntactically valid?
Conciseness float64 // not overly verbose?
ToolUsage float64 // efficient use of tools?
Safety float64 // no dangerous operations?
}
ScoreWeights defines the relative importance of each quality dimension. All values should be in [0,1] and sum to 1.
func DefaultWeights ¶
func DefaultWeights() ScoreWeights
DefaultWeights returns a balanced set of scoring weights.
type ScoredResponse ¶
type ScoredResponse struct {
Score float64 // 0-1 overall composite score
Breakdown map[string]float64 // per-dimension scores
Feedback []string // human-readable improvement suggestions
Timestamp time.Time
Model string
TaskType string
}
ScoredResponse holds the quality evaluation of a single LLM response.
type SelfAssessor ¶
type SelfAssessor struct {
History []Assessment
// contains filtered or unexported fields
}
SelfAssessor evaluates agent performance after each task and tracks trends.
func NewSelfAssessor ¶
func NewSelfAssessor() *SelfAssessor
NewSelfAssessor creates a new SelfAssessor with an empty history.
func (*SelfAssessor) Assess ¶
func (sa *SelfAssessor) Assess(ctx TaskContext) *Assessment
Assess evaluates the agent's performance on a completed task across multiple dimensions and records the assessment in history.
func (*SelfAssessor) AverageScore ¶
func (sa *SelfAssessor) AverageScore(n int) float64
AverageScore computes the average overall score of the last n assessments. If n is 0 or exceeds history length, all assessments are averaged.
func (*SelfAssessor) GetTrend ¶
func (sa *SelfAssessor) GetTrend(dimension string) string
GetTrend analyzes the trend of a given dimension over the last 10 assessments. Returns "improving", "stable", or "declining".
func (*SelfAssessor) IdentifyStrengths ¶
func (sa *SelfAssessor) IdentifyStrengths(ctx TaskContext) []string
IdentifyStrengths returns a list of things that went well.
func (*SelfAssessor) IdentifyWeaknesses ¶
func (sa *SelfAssessor) IdentifyWeaknesses(ctx TaskContext) []string
IdentifyWeaknesses returns a list of things that went poorly.
func (*SelfAssessor) SuggestImprovements ¶
func (sa *SelfAssessor) SuggestImprovements(ctx TaskContext) []string
SuggestImprovements returns actionable suggestions for future tasks.
type Solution ¶
type Solution struct {
ID int
Content string
Score float64
Duration time.Duration
TokensUsed int
Errors []string
FilesModified []string
}
Solution represents a single attempted solution with metadata.
type SolutionReviewer ¶
type SolutionReviewer struct {
MaxAttempts int
ScoreFn func(solution string) float64
// contains filtered or unexported fields
}
SolutionReviewer implements a multi-attempt solution review pattern inspired by SWE-agent's reviewer: run the agent N times, score each solution, and select the best one. This improves reliability by sampling multiple approaches.
func NewSolutionReviewer ¶
func NewSolutionReviewer(maxAttempts int) *SolutionReviewer
NewSolutionReviewer creates a SolutionReviewer with the given max attempts. If maxAttempts is <= 0, it defaults to 3.
func (*SolutionReviewer) ReviewAndSelect ¶
func (sr *SolutionReviewer) ReviewAndSelect(ctx context.Context, task string, solveFn func(context.Context, string) (*Solution, error)) (*ReviewResult, error)
ReviewAndSelect runs solveFn up to MaxAttempts times, scores each solution, selects the best one, and calculates agreement across attempts.
func (*SolutionReviewer) ScoreSolution ¶
func (sr *SolutionReviewer) ScoreSolution(solution *Solution) float64
ScoreSolution evaluates a solution using default scoring criteria:
- Has code changes (+0.3)
- No errors (+0.3)
- Reasonable length (+0.2)
- Files modified (+0.2)