README
¶
Flake Detector
Long-running test harness to detect flaky tests across unit, integration, and stress suites. It runs selected suites repeatedly, stops on first flake or hard failure, and saves concise diagnostics and logs for triage.
Why use it
- Catch intermittent test failures early
- Reproduce racey timing issues locally
- Produce logs and a JSON report for debugging and CI artifacts
Script
Path: scripts/flake_detector.sh
Requires: bash, Go toolchain available on PATH
What it does
- Iterates test runs for the selected suites (unit, integration, stress)
- Prints run-by-run timing and status
- Early-stops when a failure is detected, distinguishing:
- Flaky failure: previously passed, then failed in a later iteration
- Non-flaky failure: fails on the first attempt
- Optionally writes per-run logs and a JSON summary
Environment variables (options)
MAX_RUNS(default:1000)- Max iterations before stopping if no flake is found
TEST_SET(default:unit,integration,stress)- Comma-separated list of suites to run; choose any of:
unit,integration,stress
- Comma-separated list of suites to run; choose any of:
PARALLELISM(default:4)- Passed to
go test -p(package-level parallelism)
- Passed to
RACE(default:1)- Enable Go race detector for unit/integration runs (
0to disable)
- Enable Go race detector for unit/integration runs (
STRESS(default:1)- Enable stress tests by exporting
PARTI_STRESS=1when running thestresssuite
- Enable stress tests by exporting
TEST_TIMEOUT(default:6m)go test -timeoutfor unit/integration runs
STRESS_TIMEOUT(default:20m)go test -timeoutfor stress runs (stress may legitimately exceed 6 minutes)
SLOW_LOG_THRESHOLD_SECONDS(default:15)- Print a note when a single suite exceeds this duration
REPORT_JSON(optional)- Path to a JSON summary file written at the end or on early exit
SAVE_LOGS_DIR(optional)- Directory to store per-run logs; directory is created if needed
GOFLAGS(optional)- Additional flags forwarded to
go test(e.g.,-run,-bench,-count=1retained by script)
- Additional flags forwarded to
Suites
unit: all non-integration/non-stress packagesintegration:./test/integrationstress:./test/stress(requiresPARTI_STRESS=1, set by the script whenSTRESS=1)
Typical usage
Run a quick flake hunt for unit + integration only (recommended fast loop):
MAX_RUNS=10 \
TEST_SET=unit,integration \
PARALLELISM=4 \
RACE=1 \
TEST_TIMEOUT=6m \
SAVE_LOGS_DIR=.flake_logs \
REPORT_JSON=.flake_logs/fast_report.json \
./scripts/flake_detector.sh
Run stress-only with extended timeout to avoid false timeouts:
MAX_RUNS=1 \
TEST_SET=stress \
PARALLELISM=2 \
STRESS_TIMEOUT=20m \
SAVE_LOGS_DIR=.flake_logs \
./scripts/flake_detector.sh
Full run for all suites (long):
MAX_RUNS=3 \
TEST_SET=unit,integration,stress \
PARALLELISM=2 \
SAVE_LOGS_DIR=.flake_logs \
REPORT_JSON=.flake_logs/report.json \
./scripts/flake_detector.sh
Output and exit codes
Console output shows:
- Start timestamp and options
- Per-run start/finish lines per suite
- Slow-run warnings beyond
SLOW_LOG_THRESHOLD_SECONDS - Early-exit diagnostics with the last ~50 lines of failing logs
Exit codes:
0: Completed all iterations with no failures- Non-zero: Stopped early due to failure; the script prints whether it was detected as flaky (passed first, then failed) or non-flaky (failed on the first attempt).
Artifacts (when configured):
SAVE_LOGS_DIR: Saves per-run, per-suite logs named by timestampREPORT_JSON: Summary report (high-level stats, last failure context, and timing); suitable for CI artifact collection
CI suggestions
Add two jobs to your CI:
- Fast daily flake loop (unit + integration)
name: Flake hunt (fast)
on:
schedule:
- cron: '0 3 * * *'
jobs:
flake-fast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.25.x'
- name: Run fast loop
run: |
mkdir -p .flake_logs
MAX_RUNS=20 TEST_SET=unit,integration PARALLELISM=4 RACE=1 TEST_TIMEOUT=6m \
SAVE_LOGS_DIR=.flake_logs REPORT_JSON=.flake_logs/fast_report.json \
./scripts/flake_detector.sh
- name: Upload logs
if: always()
uses: actions/upload-artifact@v4
with:
name: flake-fast-logs
path: .flake_logs
- Weekly stress-only, extended timeout
name: Flake hunt (stress)
on:
schedule:
- cron: '0 4 * * 0'
jobs:
flake-stress:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.25.x'
- name: Run stress-only
run: |
mkdir -p .flake_logs
MAX_RUNS=1 TEST_SET=stress PARALLELISM=2 STRESS_TIMEOUT=30m \
SAVE_LOGS_DIR=.flake_logs REPORT_JSON=.flake_logs/stress_report.json \
./scripts/flake_detector.sh
- name: Upload logs
if: always()
uses: actions/upload-artifact@v4
with:
name: flake-stress-logs
path: .flake_logs
Tips & troubleshooting
- Stress timing: If the
stresssuite times out around 6 minutes, increaseSTRESS_TIMEOUT(e.g.,20m–30m). - Speed vs coverage: Keep
RACE=1for fidelity; if you need speed for long loops, temporarily setRACE=0. - Parallelism: Tune
PARALLELISMbased on cores and memory. Very high values can increase flakiness. - Repro a single test: Use
GOFLAGSto add-runfilters to a particular package or test while keeping flake loop semantics. - Disk usage: With
SAVE_LOGS_DIRset, logs accumulate. Clean the directory periodically.
Examples for quick copy/paste
Fast loop (local):
MAX_RUNS=5 TEST_SET=unit,integration PARALLELISM=4 RACE=1 TEST_TIMEOUT=6m \
SAVE_LOGS_DIR=.flake_logs REPORT_JSON=.flake_logs/fast_report.json \
./scripts/flake_detector.sh
Stress-only (local):
MAX_RUNS=1 TEST_SET=stress PARALLELISM=2 STRESS_TIMEOUT=20m \
SAVE_LOGS_DIR=.flake_logs \
./scripts/flake_detector.sh
Questions or improvements? Open an issue or PR with proposed changes to scripts/flake_detector.sh and this README.
Click to show internal directories.
Click to hide internal directories.