anomaly

package
v0.0.0-...-ffc4fba Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2026 License: Apache-2.0, Apache-2.0 Imports: 20 Imported by: 0

README

Temporal Anomaly Detection

Preface

Large codebases evolve through thousands of commits. Most commits follow a predictable rhythm — a handful of files changed, tens to hundreds of lines modified. When a commit or time period suddenly deviates from that baseline, it often signals something worth investigating.

Problem

Teams need to detect sudden quality degradation in commit history:

  • "When did the massive vendoring event happen that doubled our codebase?"
  • "Which time periods had abnormal churn that might indicate rushed changes?"
  • "Are there refactoring bursts that correlate with post-release cleanup?"

Manual review of thousands of commits is impractical. Simple thresholds (e.g., ">100 files changed") miss context: 100 files might be normal for a monorepo but alarming for a small service.

How analyzer solves it

The Anomaly analyzer applies Z-score statistical analysis over a sliding window of per-tick metrics. Instead of fixed thresholds, it detects deviations relative to the repository's own recent baseline. A commit changing 500 files is flagged only if the surrounding ticks average 20 files — it adapts to each repository's rhythm.

Historical context

Z-score anomaly detection is a foundational technique in statistical process control (SPC), originating from Walter Shewhart's control charts at Bell Labs in the 1920s. The concept is simple: if a data point is more than N standard deviations from the mean, it is an outlier. Applied to software engineering, this detects "change-point events" in repository evolution — moments where development patterns shift abruptly.

Real world examples

  • Vendor imports: A sudden spike of thousands of added lines with zero removed lines flags a bulk import (e.g., vendoring a dependency).
  • Major refactors: A tick with 500+ files changed when the rolling average is 30 files indicates a large-scale restructuring.
  • Release cleanup: Post-release ticks often show abnormal churn as teams fix accumulated technical debt.
  • Regressions: A sudden drop in net churn (large deletions) after a period of steady growth may indicate a reverted feature.

How analyzer works here

  1. Metric collection: For each commit, the analyzer records per-tick metrics from plumbing analyzers: files changed, lines added, lines removed, and net churn (added - removed).
  2. Tick aggregation: Multiple commits in the same time tick are aggregated into a single data point.
  3. Z-score computation: For each tick, a trailing sliding window (default: 20 ticks) computes the rolling mean and population standard deviation. The Z-score measures how many standard deviations the current tick deviates from the window.
  4. Multi-metric detection: Z-scores are computed independently for all four metrics. A tick is flagged as anomalous if any metric exceeds the threshold (default: 2.0 sigma).
  5. Severity ranking: Anomalies are sorted by the maximum absolute Z-score across all metrics, so the most extreme deviations appear first.
Zero-variance handling

When the sliding window has zero variance (all identical values) and the current value differs, a sentinel Z-score of 100.0 is assigned. This correctly flags a spike against a perfectly stable baseline (e.g., the first non-trivial commit after a series of identical ticks).

Limitations

  • Tick granularity: The analyzer operates on ticks (time buckets), not individual commits. Anomalies point to ticks, not specific commit hashes.
  • No semantic analysis: A 1000-line addition of auto-generated code and a 1000-line hand-written refactor look identical. The analyzer detects statistical anomalies, not semantic ones.
  • Early window noise: The first few ticks have small sliding windows, making Z-scores less reliable. The sentinel value (100.0) handles the extreme case but borderline anomalies in early ticks should be interpreted with caution.
  • Merge commits: With --first-parent, merge commits are treated as single events. Without it, individual commits within a merge are counted separately, which may dilute or amplify anomalies.

Configuration

Key CLI Flag Default Description
history.anomaly.threshold --anomaly-threshold 2.0 Z-score threshold in standard deviations
history.anomaly.window_size --anomaly-window 20 Sliding window size in ticks

Lower threshold = more sensitive (more anomalies detected). Higher window = smoother baseline (less reactive to recent changes).

Metrics

Each metric implements the Metric[In, Out] interface from pkg/metrics.

anomalies

Type: risk

List of detected anomalous ticks, sorted by severity (highest absolute Z-score first).

Output fields:

  • tick - Time period index where anomaly was detected
  • z_scores - Per-metric Z-scores (net_churn, files_changed, lines_added, lines_removed)
  • max_abs_z_score - Maximum absolute Z-score across all metrics (severity measure)
  • metrics - Raw metric values for the tick (files_changed, lines_added, lines_removed, net_churn)
  • files - List of files changed in the anomalous tick
time_series

Type: time_series

Per-tick metrics with anomaly annotations. Every tick in the analysis period has an entry, regardless of whether it was flagged as anomalous.

Output fields:

  • tick - Time period index
  • metrics - Raw metric values (files_changed, lines_added, lines_removed, net_churn)
  • is_anomaly - Whether the tick was flagged as anomalous
  • churn_z_score - Net churn Z-score for the tick
aggregate

Type: aggregate

Summary statistics for the entire analysis period.

Output fields:

  • total_ticks - Number of time periods analyzed
  • total_anomalies - Number of ticks flagged as anomalous
  • anomaly_rate - Percentage of ticks that are anomalous
  • threshold - Z-score threshold used
  • window_size - Sliding window size used
  • churn_mean - Mean net churn across all ticks
  • churn_stddev - Standard deviation of net churn
  • files_mean - Mean files changed per tick
  • files_stddev - Standard deviation of files changed

Plot output

The plot format (--format plot) generates an interactive HTML report with:

  1. Net Churn Over Time - Line chart showing net churn per tick with anomalous ticks highlighted as red scatter points.
  2. Anomaly Detection Summary - Stats grid showing total ticks, anomalies detected, anomaly rate (with color-coded badge), and highest Z-score.

Further plans

  • Per-file anomaly detection (flag files that appear disproportionately in anomalous ticks).
  • Commit-level granularity option (detect anomalous individual commits, not just ticks).
  • Correlation with other analyzers (e.g., complexity spikes in anomalous ticks).
  • Configurable metric weights (e.g., prioritize files_changed over lines_added).

Documentation

Overview

Package anomaly provides temporal anomaly detection over commit history. It uses Z-score analysis with a sliding window to detect sudden quality degradation in per-tick metrics (files changed, lines added/removed, churn).

Index

Constants

View Source
const (
	ConfigAnomalyThreshold  = "TemporalAnomaly.Threshold"
	ConfigAnomalyWindowSize = "TemporalAnomaly.WindowSize"
)

Configuration keys.

View Source
const (
	DefaultAnomalyThreshold  = float32(2.0)
	DefaultAnomalyWindowSize = 20

	// MinWindowSize is the minimum valid sliding window size.
	MinWindowSize = 2
	// MinThreshold is the minimum valid Z-score threshold.
	MinThreshold = float32(0.1)
)

Default configuration values.

View Source
const (
	KindTimeSeries      = "time_series"
	KindAnomalyRecord   = "anomaly_record"
	KindAggregate       = "aggregate"
	KindExternalAnomaly = "external_anomaly"
	KindExternalSummary = "external_summary"
)

Store record kind constants.

Variables

This section is empty.

Functions

func AggregateCommitsToTicks

func AggregateCommitsToTicks(
	commitMetrics map[string]*CommitAnomalyData,
	commitsByTick map[int][]gitlib.Hash,
) map[int]*TickMetrics

AggregateCommitsToTicks builds per-tick metrics from per-commit data grouped by the commits_by_tick mapping. This replaces the need for a separate per-tick accumulation path during Consume.

func ComputeZScores

func ComputeZScores(values []float64, window int) []float64

ComputeZScores computes the Z-score for each value using a trailing sliding window of the given size. For index i, the window is values[max(0, i-window):i]. The Z-score measures how many standard deviations the current value is from the window mean. When the standard deviation is zero and the value equals the mean, the Z-score is 0. When the standard deviation is zero and the value differs from the mean, the Z-score is +/- stats.ZScoreMaxSentinel.

func EnrichAndRewrite

func EnrichAndRewrite(
	store analyze.ReportStore,
	analyzerID string,
	windowSize int,
	threshold float64,
) error

EnrichAndRewrite reads the anomaly analyzer's structured store kinds, detects cross-analyzer anomalies from other analyzers' store data, then rewrites all anomaly kinds (original + enrichment) to the store.

func GenerateStoreSections

func GenerateStoreSections(reader analyze.ReportReader) ([]plotpage.Section, error)

GenerateStoreSections reads pre-computed anomaly data from a ReportReader and builds the same plot sections as GenerateSections, without materializing a full Report or recomputing metrics.

func RegisterPlotSections

func RegisterPlotSections()

RegisterPlotSections registers the anomaly plot section renderer with the analyze package.

func RegisterStoreTimeSeriesExtractor

func RegisterStoreTimeSeriesExtractor(analyzerFlag string, fn StoreTimeSeriesExtractor)

RegisterStoreTimeSeriesExtractor registers a store-based extractor for the given analyzer flag.

func WriteEnrichmentToStore

func WriteEnrichmentToStore(
	w analyze.ReportWriter,
	externalAnomalies []ExternalAnomaly,
	externalSummaries []ExternalSummary,
) error

WriteEnrichmentToStore writes external anomaly and summary records to the writer. Called by the enrichment pipeline after cross-analyzer anomaly detection.

Types

type AggregateData

type AggregateData struct {
	TotalTicks          int     `json:"total_ticks"           yaml:"total_ticks"`
	TotalAnomalies      int     `json:"total_anomalies"       yaml:"total_anomalies"`
	AnomalyRate         float64 `json:"anomaly_rate"          yaml:"anomaly_rate"`
	Threshold           float32 `json:"threshold"             yaml:"threshold"`
	WindowSize          int     `json:"window_size"           yaml:"window_size"`
	ChurnMean           float64 `json:"churn_mean"            yaml:"churn_mean"`
	ChurnStdDev         float64 `json:"churn_stddev"          yaml:"churn_stddev"`
	FilesMean           float64 `json:"files_mean"            yaml:"files_mean"`
	FilesStdDev         float64 `json:"files_stddev"          yaml:"files_stddev"`
	LangDiversityMean   float64 `json:"lang_diversity_mean"   yaml:"lang_diversity_mean"`
	LangDiversityStdDev float64 `json:"lang_diversity_stddev" yaml:"lang_diversity_stddev"`
	AuthorCountMean     float64 `json:"author_count_mean"     yaml:"author_count_mean"`
	AuthorCountStdDev   float64 `json:"author_count_stddev"   yaml:"author_count_stddev"`
}

AggregateData contains summary statistics for the anomaly analysis.

func ReadAggregateIfPresent

func ReadAggregateIfPresent(reader analyze.ReportReader, kinds []string) (AggregateData, error)

ReadAggregateIfPresent reads the single aggregate record, returning zero value if absent.

type Analyzer

type Analyzer struct {
	*analyze.BaseHistoryAnalyzer[*ComputedMetrics]
	common.NoStateHibernation

	TreeDiff  *plumbing.TreeDiffAnalyzer
	Ticks     *plumbing.TicksSinceStart
	LineStats *plumbing.LinesStatsCalculator
	Languages *plumbing.LanguagesDetectionAnalyzer
	Identity  *plumbing.IdentityDetector

	// Configuration (read-only after Configure).
	Threshold  float32
	WindowSize int
	// contains filtered or unexported fields
}

Analyzer detects temporal anomalies in commit history using Z-score analysis over a sliding window of per-tick metrics. Per-commit results are emitted as TCs; accumulated state lives in the Aggregator, not in the analyzer.

func NewAnalyzer

func NewAnalyzer() *Analyzer

NewAnalyzer creates a new anomaly analyzer.

func (*Analyzer) ApplySnapshot

func (h *Analyzer) ApplySnapshot(snap analyze.PlumbingSnapshot)

ApplySnapshot restores plumbing state from a previously captured snapshot.

func (*Analyzer) CPUHeavy

func (h *Analyzer) CPUHeavy() bool

CPUHeavy returns false because the anomaly analyzer does not perform expensive UAST processing per commit.

func (*Analyzer) Configure

func (h *Analyzer) Configure(facts map[string]any) error

Configure applies configuration from the provided facts map.

func (*Analyzer) Consume

func (h *Analyzer) Consume(_ context.Context, ac *analyze.Context) (analyze.TC, error)

Consume processes a single commit and returns a TC with per-commit metrics. The analyzer does not retain any per-commit state; all output is in the TC.

func (*Analyzer) ExtractCommitTimeSeries

func (h *Analyzer) ExtractCommitTimeSeries(report analyze.Report) map[string]any

ExtractCommitTimeSeries extracts per-commit anomaly metrics from a finalized report. Implements analyze.CommitTimeSeriesProvider.

func (*Analyzer) Fork

func (h *Analyzer) Fork(n int) []analyze.HistoryAnalyzer

Fork creates independent copies of the analyzer for parallel processing.

func (*Analyzer) Initialize

func (h *Analyzer) Initialize(_ *gitlib.Repository) error

Initialize prepares the analyzer for processing commits.

func (*Analyzer) Merge

func (h *Analyzer) Merge(_ []analyze.HistoryAnalyzer)

Merge is a no-op. Per-commit results are emitted as TCs and collected by the framework, not accumulated inside the analyzer.

func (*Analyzer) Name

func (h *Analyzer) Name() string

Name returns the analyzer name.

func (*Analyzer) ReleaseSnapshot

func (h *Analyzer) ReleaseSnapshot(_ analyze.PlumbingSnapshot)

ReleaseSnapshot releases resources owned by the snapshot. The anomaly analyzer does not hold UAST trees, so this is a no-op.

func (*Analyzer) SnapshotPlumbing

func (h *Analyzer) SnapshotPlumbing() analyze.PlumbingSnapshot

SnapshotPlumbing captures the current plumbing output state.

func (*Analyzer) WriteToStore

func (h *Analyzer) WriteToStore(ctx context.Context, ticks []analyze.TICK, w analyze.ReportWriter) error

WriteToStore implements analyze.StoreWriter. It converts ticks to a report, computes all metrics, and streams pre-computed results as individual records:

  • "time_series": per-tick TimeSeriesEntry records (sorted by tick).
  • "anomaly_record": per-anomaly Record entries (sorted by Z-score desc).
  • "aggregate": single AggregateData record.

type CommitAnomalyData

type CommitAnomalyData struct {
	FilesChanged int            `json:"files_changed"`
	LinesAdded   int            `json:"lines_added"`
	LinesRemoved int            `json:"lines_removed"`
	NetChurn     int            `json:"net_churn"`
	Files        []string       `json:"files,omitempty"`
	Languages    map[string]int `json:"languages,omitempty"`
	AuthorID     int            `json:"author_id"`
}

CommitAnomalyData holds raw metrics for a single commit.

type ComputedMetrics

type ComputedMetrics struct {
	Anomalies         []Record          `json:"anomalies"                    yaml:"anomalies"`
	TimeSeries        []TimeSeriesEntry `json:"time_series"                  yaml:"time_series"`
	Aggregate         AggregateData     `json:"aggregate"                    yaml:"aggregate"`
	ExternalAnomalies []ExternalAnomaly `json:"external_anomalies,omitempty" yaml:"external_anomalies,omitempty"`
	ExternalSummaries []ExternalSummary `json:"external_summaries,omitempty" yaml:"external_summaries,omitempty"`
}

ComputedMetrics holds all computed metric results for the anomaly analyzer.

func ComputeAllMetrics

func ComputeAllMetrics(report analyze.Report) (*ComputedMetrics, error)

ComputeAllMetrics runs all anomaly metrics and returns the results.

func (*ComputedMetrics) AnalyzerName

func (m *ComputedMetrics) AnalyzerName() string

AnalyzerName returns the name of the analyzer.

func (*ComputedMetrics) ToJSON

func (m *ComputedMetrics) ToJSON() any

ToJSON returns the metrics in a format suitable for JSON marshaling.

func (*ComputedMetrics) ToYAML

func (m *ComputedMetrics) ToYAML() any

ToYAML returns the metrics in a format suitable for YAML marshaling.

type ExternalAnomaly

type ExternalAnomaly struct {
	Source    string  `json:"source"    yaml:"source"`
	Dimension string  `json:"dimension" yaml:"dimension"`
	Tick      int     `json:"tick"      yaml:"tick"`
	ZScore    float64 `json:"z_score"   yaml:"z_score"`
	RawValue  float64 `json:"raw_value" yaml:"raw_value"`
}

ExternalAnomaly describes an anomaly detected on an external analyzer's time series dimension.

func ReadExternalAnomaliesIfPresent

func ReadExternalAnomaliesIfPresent(reader analyze.ReportReader, kinds []string) ([]ExternalAnomaly, error)

ReadExternalAnomaliesIfPresent reads all external_anomaly records.

type ExternalSummary

type ExternalSummary struct {
	Source    string  `json:"source"    yaml:"source"`
	Dimension string  `json:"dimension" yaml:"dimension"`
	Mean      float64 `json:"mean"      yaml:"mean"`
	StdDev    float64 `json:"stddev"    yaml:"stddev"`
	Anomalies int     `json:"anomalies" yaml:"anomalies"`
	HighestZ  float64 `json:"highest_z" yaml:"highest_z"`
}

ExternalSummary summarizes anomaly detection results for one external dimension.

func ReadExternalSummariesIfPresent

func ReadExternalSummariesIfPresent(reader analyze.ReportReader, kinds []string) ([]ExternalSummary, error)

ReadExternalSummariesIfPresent reads all external_summary records.

type RawMetrics

type RawMetrics struct {
	FilesChanged      int `json:"files_changed"      yaml:"files_changed"`
	LinesAdded        int `json:"lines_added"        yaml:"lines_added"`
	LinesRemoved      int `json:"lines_removed"      yaml:"lines_removed"`
	NetChurn          int `json:"net_churn"          yaml:"net_churn"`
	LanguageDiversity int `json:"language_diversity" yaml:"language_diversity"`
	AuthorCount       int `json:"author_count"       yaml:"author_count"`
}

RawMetrics holds the raw metric values for a single tick.

type Record

type Record struct {
	Tick         int        `json:"tick"            yaml:"tick"`
	ZScores      ZScoreSet  `json:"z_scores"        yaml:"z_scores"`
	MaxAbsZScore float64    `json:"max_abs_z_score" yaml:"max_abs_z_score"`
	Metrics      RawMetrics `json:"metrics"         yaml:"metrics"`
	Files        []string   `json:"files"           yaml:"files"`
}

Record describes a detected anomaly at a specific tick.

func ReadAnomaliesIfPresent

func ReadAnomaliesIfPresent(reader analyze.ReportReader, kinds []string) ([]Record, error)

ReadAnomaliesIfPresent reads all anomaly_record records, returning nil if absent.

type ReportData

type ReportData struct {
	Anomalies         []Record
	TickMetrics       map[int]*TickMetrics
	Threshold         float32
	WindowSize        int
	ExternalAnomalies []ExternalAnomaly
	ExternalSummaries []ExternalSummary
}

ReportData is the parsed input data for anomaly metrics computation.

func ParseReportData

func ParseReportData(report analyze.Report) (*ReportData, error)

ParseReportData extracts ReportData from an analyzer report. Expects canonical format: commit_metrics and commits_by_tick.

type StoreTimeSeriesExtractor

type StoreTimeSeriesExtractor func(reader analyze.ReportReader) (ticks []int, dimensions map[string][]float64)

StoreTimeSeriesExtractor extracts tick-indexed dimensions from a store reader. It is used by analyzers that write structured store kinds.

type TickData

type TickData struct {
	// CommitMetrics maps commit hash (hex) to per-commit CommitAnomalyData.
	CommitMetrics map[string]*CommitAnomalyData
}

TickData is the per-tick aggregated payload for the anomaly analyzer. It holds per-commit metrics for the canonical report format.

type TickMetrics

type TickMetrics struct {
	FilesChanged int
	LinesAdded   int
	LinesRemoved int
	NetChurn     int
	Files        []string
	Languages    map[string]int   // language name → file count for this tick.
	AuthorIDs    map[int]struct{} // unique author IDs seen in this tick.
}

TickMetrics holds the raw metrics collected for a single tick.

type TimeSeriesEntry

type TimeSeriesEntry struct {
	Tick              int        `json:"tick"               yaml:"tick"`
	Metrics           RawMetrics `json:"metrics"            yaml:"metrics"`
	IsAnomaly         bool       `json:"is_anomaly"         yaml:"is_anomaly"`
	ChurnZScore       float64    `json:"churn_z_score"      yaml:"churn_z_score"`
	LanguageDiversity int        `json:"language_diversity" yaml:"language_diversity"`
	AuthorCount       int        `json:"author_count"       yaml:"author_count"`
}

TimeSeriesEntry holds per-tick data for the time series output.

func ReadTimeSeriesIfPresent

func ReadTimeSeriesIfPresent(reader analyze.ReportReader, kinds []string) ([]TimeSeriesEntry, error)

ReadTimeSeriesIfPresent reads all time_series records, returning nil if absent.

type ZScoreSet

type ZScoreSet struct {
	NetChurn          float64 `json:"net_churn"          yaml:"net_churn"`
	FilesChanged      float64 `json:"files_changed"      yaml:"files_changed"`
	LinesAdded        float64 `json:"lines_added"        yaml:"lines_added"`
	LinesRemoved      float64 `json:"lines_removed"      yaml:"lines_removed"`
	LanguageDiversity float64 `json:"language_diversity" yaml:"language_diversity"`
	AuthorCount       float64 `json:"author_count"       yaml:"author_count"`
}

ZScoreSet holds per-metric Z-scores for a single tick.

func (ZScoreSet) MaxAbs

func (z ZScoreSet) MaxAbs() float64

MaxAbs returns the maximum absolute Z-score across all metrics.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL