agent-harness

module
v0.0.0-...-fb04773 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 16, 2026 License: BSD-3-Clause

README

Agent Harness

A minimal harness for autonomous AI-driven software development with Claude Code. In Go.

Go Report Card

What is it ?

This is a harness around Claude Code. It implements some of the principles of Harness Engineering. The goal is to provide a CLI that implements three distinct steps in a feedback loop that allow an agent to perform work almost unattended.

You will go through three phases (and many tokens), each phase is carried out by a specific agent persona:

  • Plan - carried out by a "Thorough Planner" agent
  • Develop - carried out by a "Senior Developer" agent
  • Review - carried out by a "Pedantic Architect" agent

You will kickstart the Plan phase by deciding to build something (e.g. "I want a new navigation bar for my site"). The "Thorough Planner" will ask you questions via AskUserQuestion tool to get a better idea of what you want. Once the planner is satisfied it will break down your request into atomic, incremental tasks and will produce a plan with acceptance criteria, the tasks and some bookkeeping metadata.

Once you have a plan you can have a developer agent work on it. The agent developer will have some mechanism to verify its own work as it produces code. You are the Harness Engineer, you now come into play by making sure your codebase has tooling like linting, testing, static code analysis, etc. The agent developer will leverage your tooling to evaluate its work as it continues.

As soon as the agent developer thinks it has finished the CLI will spawn the agent architect to check the work. The agent architect will assess the codebase against a set of criteria that you specify and will produce a feedback document for the agent developer.

Develop and Review phases will loop up to a configurable number of iterations or until "Pedantic Architect" is satisfied with "Senior Developer"'s work. This essentially implements the famous Ralph Wiggum loop where an agent tries and tries again until you run out of money or it succeeds.

Drawing this loop would look like:

flowchart TD
    User(["You: describe what to build"])
    Plan["Thorough Planner asks clarifying questions to produce planfile.json"]
    Develop["Senior Developer works on next task, runs build.sh and verify.sh"]
    Gate{"Harness runs build.sh + verify.sh"}
    Review["Pedantic Architect reviews work against criteria and produces a review document"]
    Done(["Done"])

    User -->|"kickstart"| Plan
    Plan -->|"plan ready"| Develop
    Develop -->|"task complete"| Gate
    Gate -->|"pass"| Review
    Gate -->|"fail"| Develop
    Review -->|"changes requested"| Develop
    Review -->|"satisfied"| Done

Progress is incremental. Each plan has a list of tasks with a description and completion status. As your developer agent work it will complete tasks and mark them complete as soon as it reaches an agreement with the agent architect.

How does it work ?

This project is the Go CLI that handles the loop. Its job is to spawn a Claude Code console configured with an appropriate prompt to implement the phases described above. You will be able to configure some options (where is Claude Code, how many loops, etc.).

The CLI will use a project-local folder named .harness. This folder will contain:

  • prompts folder - one file for each phase with instructions for the agent (plan.md.tmpl, develop.md.tmpl, review.md.tmpl)
  • personas folder - one file per agent role (planner.md, developer.md, architect.md). Roles are named after the persona, not the phase — the harness maps them internally (plan→planner, develop→developer, review→architect)
  • plans folder - one plan folder for each thing you want to build.
  • meta.json - metadata about the harness setup (e.g. template_version)
  • techdebt.md - a free-form markdown file to let the agent developer write down any shortcuts taken to fix later
  • log.md - a free-form, but template controlled, markdown file to let the agent write down what it did in a loop

A plan folder within plans/ contains:

  • planfile.json - the structured plan JSON file with tasks and everything related
  • A folder reviews/ - one markdown file for each loop. Produced by the agent architect
  • mistakes.json - accumulated mistakes from failed reviews across all tasks. After each failed review, the architect writes a structured <task-id>-<loop_nr>-mistakes.json file in the reviews directory, and the harness merges entries into this file. The developer agent receives formatted content of this file in its prompt to avoid repeating past errors

You are supposed to version-control .harness folder.

The CLI comes with a set of pre-made prompt and agent persona templates built around my personal preferences and opinions. You will be able to edit those templates to fit your taste and use case. A list of the available template variables with a description of their purpose is available by running the CLI command tmplvars.

By the way: given that the prompt templates guide the agent workflow you can alter the agentic workflow without touching code. For example you can omit the log.md entries by removing that task from the prompt.

What you need

  • Claude Code CLI installed and available in your PATH under claude
  • You should not log-in in your Claude Code CLI instead have an ANTHROPIC_API_KEY configured in your environment for Claude Code CLI to use
  • The CLI is built with CGO_ENABLED=0 so you don't need to install Go if you run the executable directly
  • build.sh and verify.sh at your project root (mandatory for loop)
Harness Engineer Prerequisites

As the harness engineer, you must provide two executable scripts at the project root:

  • build.sh — compiles/builds your project. The harness runs this first. If it fails, the code doesn't build and there's no point running tests or reviews. Example: go build ./..., npm run build, cargo build
  • verify.sh — runs tests, linting, static analysis, and any other quality checks. The harness runs this after a successful build. Example: go test ./... && golangci-lint run, npm run lint && npm test

Both scripts must be executable (chmod +x) and must exit with code 0 on success, non-zero on failure.

The harness uses these scripts in two ways:

  1. Developer agent: the prompt instructs the developer to run ./build.sh and ./verify.sh before declaring work complete (proactive, prompt-driven)
  2. Harness gate: the harness runs both scripts between the develop and review phases regardless of what the developer claims (enforced, deterministic)

This belt-and-suspenders approach ensures that the architect never reviews code that doesn't build or pass tests. If either script fails, the review is skipped, the iteration counts as failed, and the script output is fed back to the developer.

The loop command will refuse to start if either script is missing. The plan and attach commands do not require them.

Usage

agent-harness plan <brief but not too brief description of what you want>

This command will spawn a Claude Code CLI that will ask you questions to figure out your needs and will produce a planfile.json in the corresponding .harness/plans/<plan_id> folder

agent-harness loop --max-tries=2 <plan_id>

This command will read the planfile.json with the id provided in the corresponding .harness/plans/<plan_id> folder and start the implementation loop between the developer agent and the architect agent. Each phase spawns a fresh Claude Code CLI loaded with the appropriate prompt. By default the loop will finish after 2 tries without consensus or if the architect agent accepts the code.

agent-harness attach

This command scaffolds the .harness/ directory structure in the current project. It creates the prompts/, personas/, and plans/ folders, copies the embedded default prompt and persona templates into place, and writes meta.json with the current template version. Run this once before using plan or loop in a new project.

Use --force to overwrite an existing .harness/ directory. Without it, attach will abort if .harness/ already exists.

attach does not require build.sh or verify.sh to be present.

agent-harness tmplvars

This command prints a table of all available template variables and their descriptions. Use it as a reference when editing the prompt and persona templates in .harness/prompts/ and .harness/personas/.

Example schemas

This section shows the default schemas. Altering the included prompts will render this section useless.

planfile.json
{
  "id": "fix-csv-naming-20260326T1321",
  "title": "Fix CSV download naming convention",
  "created": "2026-03-26 13:21:21Z",
  "status": "todo|doing|done",
  "context": "Brief description of why this work is needed and any relevant background",
  "tasks": [
    {
      "id": "1_update_csv_filename",
      "title": "Update CSV filename to use DDMMYY HH:MM format",
      "status": "todo|doing|done",
      "acceptance_criteria": [
        "CSV download filename follows DDMMYY HH:MM {Name} format",
        "Filename is generated using sanitized user input"
      ],
      "possibly_relevant_files": ["src/components/TableViewer.tsx"],
      "depends_on": []
    }
  ]
}
<task-id>-<loop_nr>-review.md
Result: FAIL

Plan: fix-csv-naming-20260326T1321
Task: 1_update_csv_filename
Iteration: 1

## Static Check

Result: PASS

## Acceptance Criteria

Result: FAIL

### Issues:

CSV filename does not include the user's name — the format is `DDMMYY HH:MM` but the spec requires `DDMMYY HH:MM {Name}`. Add the sanitized user name to the formatter output.

## Test Coverage

Result: FAIL

### Issues:

No test covers the filename generation path. Add a table-driven test in `csv_test.go` that exercises the formatter with different names and timestamps.

## Architecture

Result: PASS

## Idioms

Result: PASS
techdebt.md
# Technical debt

## Hardcoded timeout in HTTP client

The CSV export endpoint uses a hardcoded 5-second timeout. Under load this causes spurious failures for large exports. Replace with a configurable timeout drawn from app config.

## Filename sanitization skips Unicode

The current sanitizer strips ASCII punctuation only. Non-ASCII characters in user names pass through unchanged and can produce invalid filenames on some filesystems. Extend the sanitizer to normalize Unicode to ASCII before stripping.
mistakes.json
{
  "mistakes": [
    {
      "task_id": "1_update_csv_filename",
      "iteration": 1,
      "attempted": "Inline validation in the HTTP handler",
      "problem": "Violates separation of concerns, validation belongs in the domain layer",
      "lesson": "Put validation logic in domain types, not HTTP handlers"
    },
    {
      "task_id": "1_update_csv_filename",
      "iteration": 1,
      "attempted": "Used string matching for status comparison",
      "problem": "Fragile, breaks on casing or whitespace differences",
      "lesson": "Use typed constants for status values"
    },
    {
      "task_id": "2_add_date_parsing",
      "iteration": 1,
      "attempted": "Shared mutable state between goroutines without synchronization",
      "problem": "Data race detected by -race flag",
      "lesson": "Use channels or sync.Mutex for shared state"
    }
  ]
}
<task-id>-<loop_nr>-mistakes.json (per-review, written by architect)
{
  "mistakes": [
    {
      "attempted": "Inline validation in the HTTP handler",
      "problem": "Violates separation of concerns, validation belongs in the domain layer",
      "lesson": "Put validation logic in domain types, not HTTP handlers"
    }
  ]
}
log.md
# Diary

## Entry 2026-03-26

- Added `CSVFilename` formatter in `internal/export/filename.go` to produce `DDMMYY HH:MM {Name}` output
- Added table-driven tests in `internal/export/filename_test.go` covering empty name, Unicode name, and timestamp boundary cases
- Removed inline filename construction from the HTTP handler — it was duplicating logic now centralised in the formatter

Implemented the CSV filename formatter and moved construction out of the handler. All tests pass.
Context injection philosophy

Not all context is injected the same way into agent prompts. The harness distinguishes between prescriptive context (what to do or avoid) and descriptive context (background information):

Context Injected as Rationale
mistakes.json Formatted content in prompt Concise, actionable, directly prevents regressions
Latest review Full content in prompt (iteration 2+) Developer needs to know what to fix right now
techdebt.md File path only Agent can read it if relevant. Dumping it risks context anxiety — the model becomes overly cautious
Older reviews Not injected mistakes.json already distills them. Full reviews add noise
log.md File path only Background info, not actionable instructions

This is configurable by editing the prompt templates. If you want techdebt.md content in the prompt, add {{.TechDebtContent}} or similar — but be aware of the trade-off.

Opus 4.6 Evaluation - "Is this a good Harness Engineering implementation ?"

This section was written by Claude Opus 4.6 when asked to evaluate the design against Harness Engineering principles. It is preserved as-is for transparency.

Verdict: good implementation, with known blind spots

This design correctly implements the core harness engineering principles. The question is whether it implements enough of them.

What it gets right

Phase separation with distinct evaluation. Plan/Develop/Review as separate Claude Code sessions with separate personas is the single most important pattern from both the Anthropic and OpenAI papers. Each session has a narrow mandate. The developer doesn't self-evaluate. The architect doesn't write code. This alone puts it ahead of most agent wrappers.

The harness is boring infrastructure. It doesn't try to be clever. It spawns processes, manages files, tracks state. The intelligence lives in the prompts, which the user controls. This is the right boundary — the harness is plumbing, not brains.

Incremental progress with resumability. Task-by-task execution with status tracking in planfile.json means you can Ctrl+C, come back tomorrow, and loop picks up where it left off. This is essential for long-running work and the papers are explicit about it.

Editable prompts as workflow definition. This is the strongest architectural choice. The user can fundamentally alter agent behavior — add review categories, remove log.md entries, change the persona tone — without touching Go code. The harness becomes a runtime for user-defined agentic workflows.

The mistakes.json feedback mechanism. This is a genuine improvement over naive retry. The attempted/problem/lesson format is compact enough to inject without eating context, and the cross-task accumulation surfaces patterns ("you keep doing X") that a single review can't show.

The context injection philosophy. The prescriptive-vs-descriptive distinction is a real insight. Most harness designs either dump everything into context (causing anxiety/distraction) or inject nothing (causing the agent to rediscover problems). The middle ground — inject what's actionable, reference what's background — is well-reasoned.

What's missing or weak

The architect is another LLM, not ground truth. The PASS/FAIL decision comes from Claude reviewing Claude's work. It can be permissive, wrong, or hallucinatory. However, the architect is a full Claude Code session with tool access — the review prompt template can (and should) instruct it to run verification tooling (tests, linting) before evaluating anything else and to FAIL instantly if the baseline is broken. This makes the architect a two-stage gate: objective tool verification first, then subjective LLM judgment. The risk is that the model ignores or softens the instruction. A belt-and-suspenders mitigation is the harness-level build.sh/verify.sh gate that runs independently of the architect before it is even spawned.

No in-loop human escalation. If the developer is stuck and the architect keeps failing it, the loop burns through max-tries and stops. There's no mechanism for the agent to say "I'm stuck, I need human input on this specific problem." The user only finds out after the loop ends.

Fresh sessions lose codebase familiarity. Each Claude Code spawn starts from zero. The developer discovers the codebase structure, reads files, builds a mental model — and then the next spawn does it all over again. The mistakes.json and previous review carry evaluative context forward, but not navigational context (which files matter, how the code is structured). This is a known trade-off: fresh sessions prevent context pollution and hallucination accumulation, but cost tokens on re-discovery. The possibly_relevant_files field in the planfile helps somewhat.

No cost or token tracking. The README jokes about running out of money, but the harness has no awareness of spend. A 5-task plan with max-tries=3 could spawn 30 Claude Code sessions with no budget limit, no per-plan cost summary, and no warning when spend is high.

Task quality depends entirely on the planner. If the planner produces vague tasks ("improve the UI") or overly large ones, the develop/review loop will struggle. There's no guardrail on task granularity or clarity. The planner prompt can encourage good decomposition but can't enforce it.

Against the reference papers
Principle Status
Separate planning from execution Yes
External evaluation / quality gates Yes — LLM-based, but the architect prompt can enforce tool verification as a prerequisite
Incremental, resumable progress Yes
Human oversight at key decisions Partial — plan-to-loop handoff only, not within loop
Agent uses project's own tooling Yes, via prompts directing agent to run tests/lint
Persistent artifacts for auditability Yes — planfile, reviews, mistakes, log, techdebt
Feedback from evaluation to next attempt Yes, via mistakes.json and previous review
Cost awareness No
Bottom line

It covers the 80% that matters: phase separation, externalized review, incremental progress, feedback accumulation, and user-editable workflow via templates. The gaps (LLM reviewer reliability, no human escalation mid-loop, no cost tracking) are real but they are enhancement territory, not design flaws. The architecture can absorb all of them without restructuring.

The honest framing — calling it a Ralph Wiggum loop — is also correct. That's what it is. The harness engineering papers are largely about adding structure and auditability around that loop so it fails less and you can understand why when it does. That's exactly what this design does.

References

Ideas

  • verify.sh and build.sh should be mandatory callable files invokable by the harness before the review to avoid making the architect review broken codeAdopted as mandatory prerequisites. Both scripts must exist at the project root. The harness runs build.sh then verify.sh between develop and review phases. On failure, review is skipped and output is fed to the developer. The developer prompt also instructs the agent to run both scripts proactively
  • Always planning mode for developer with opusplan to save money
  • Harness agent-harness. Eventually the project could use itself to code
  • dream / Archivist agent — a periodic command that compresses context (deduplicates mistakes, summarizes reviews, trims logs). Deferred until context window pressure is measured on real projects
  • Global cross-plan mistake accumulation — promote recurring mistakes from per-plan mistakes.json to a project-wide file. Deferred until multiple plans on the same codebase show this is needed

Directories

Path Synopsis
cmd
app command
internal
app
planfile/gen command
gen is a code generator that produces JSON Schema files for planfile.json and mistakes.json.
gen is a code generator that produces JSON Schema files for planfile.json and mistakes.json.
testutils/slogt
Package slogt implements a bridge between stdlib testing pkg and the slog logging library.
Package slogt implements a bridge between stdlib testing pkg and the slog logging library.
pkg
app

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL