What Is Harness Engineering? Control Loops for Reliable AI Agents
Add structured control loops to your AI coding agents with harness engineering — turning unpredictable model output into reliable software delivery.
Harness engineering builds an execution system around an AI coding model — a boundary of tools, checks, and feedback loops that turns a raw language model into a predictable production agent. Without it, an AI agent is just a chat interface with file access. With one, it becomes a governed, observable, recoverable contributor to your engineering team.
The idea draws from control theory: a system senses its output, compares it to a target, and adjusts its behaviour. In AI coding the target is a specification; the output is generated code; the sensor is a verification pipeline; the adjustment is a corrective loop — automatic (re-prompt with error) or manual (human review). This article introduces the key components of a harness and how they compose into a production-ready control loop.
The control loop: sense, compare, correct, approve
Every reliable AI agent runs on some variation of a four-step loop:
| Step | What happens | Example |
|---|---|---|
| Sense | Capture the agent’s output — code diff, file writes, shell commands | git diff --cached after an edit |
| Compare | Check output against acceptance criteria | Run tests, lint, type-check, policy rules |
| Correct | If a check fails, feed the error back to the model with context | Re-prompt with compiler errors and line numbers |
| Approve | If all checks pass, allow the change to proceed | Human approval gate or auto-merge on green |
The harness does not replace the model’s reasoning. It wraps the model in guardrails so mistakes are caught before they reach your codebase.
Consider the simplest possible harness: a shell script that passes the model’s output through a linter before accepting it.
#!/bin/bash
# Minimal verification gate: run lint after every AI edit
lint_output=$(npx eslint . 2>&1)
if [ $? -ne 0 ]; then
echo "Lint failed. Feeding errors back to model..."
echo "$lint_output" | your-ai-repair-tool --context "Fix these lint errors"
exit 1
fi
echo "Lint passed. Change accepted."
What goes into a production-grade harness
Verification pipeline. Deterministic checks — unit tests, type checking, lint rules — paired with inferential checks such as an independent model review or security scan. The deterministic checks catch what they can; the inferential checks catch edge cases and logic errors. The Harness Engineering book calls this “sensor fusion” — combining multiple signals for a more reliable verdict than any single check can provide.
State and continuity. An agent needs context to do its job: the current file tree, recent conversation history, environment variables, and any artefacts from previous steps. The harness manages this state so the agent does not lose its place across iterations or interruptions.
Observability. Every action the agent takes should be logged: what prompt it received, what code it generated, which checks passed or failed, and how long each step took. Without it, you cannot debug a bad output or measure whether your harness improves reliability. Tools such as OpenTelemetry provide a standard way to collect this telemetry from agent workflows, just as they do for distributed systems.
Recovery. When an agent enters a bad state — infinite loop, corrupted file, invalid configuration — the harness must be able to reset or roll back. This usually means snapshotting the workspace before each agent action and providing a rollback command.
Governance. Policy rules that the agent cannot override: never write to production credentials, never modify CI/CD configuration without approval, never exceed a cost budget. These rules are encoded in the harness, not in the prompt, so they apply regardless of what the model decides to do.
Ad hoc prompting vs harness engineering
| Ad hoc prompting | Harness engineering | |
|---|---|---|
| Verification | Manual review, no gate | Automated pipeline, gate per step |
| Error recovery | Start a new chat, lose context | Re-prompt with error context, preserve state |
| Observability | Screenshots and memory | Structured logs and metrics |
| Governance | Whatever the model agrees to | Enforced policy, cannot be bypassed |
| Repeatability | Every session is different | Same harness, consistent behaviour |
For teams adopting agentic coding workflows — discussed in Agentic Coding Pro — the shift to harnessed prompting is the highest-leverage investment. It does not require better models. It requires better systems around the models you already have.
A worked example: harnessed feature generation
Imagine you ask your AI agent to add a rate limiter to an API endpoint. Without a harness, the agent might write the middleware and run npm test only if you remember to ask — then you manually copy errors back into the chat.
With a harness, the same request flows through a controlled pipeline.
Add an Express rate limiter to the /api/orders endpoint using express-rate-limit.
Constraints:
- Limit: 100 requests per 15 minutes per IP
- Return 429 with a JSON body: { "error": "rate_limit_exceeded" }
- Use the standard X-RateLimit-* response headers
- Add tests for happy path, limit exceeded, and header presence
- Do not modify existing middleware
The harness executes this prompt through its agent, captures the output, and runs the verification pipeline:
{
"step": "verify",
"checks": [
{ "name": "lint", "passed": true },
{ "name": "type-check", "passed": true },
{ "name": "unit-tests", "passed": true, "summary": "12 passed, 0 failed" },
{ "name": "rate-limit-tests", "passed": true, "summary": "3 passed, 0 failed" }
],
"duration_ms": 8470,
"decision": "auto-approve"
}
Each check is a deterministic gate. The harness does not ask the model whether the code is correct — it runs the tests and uses their exit codes as ground truth. This is the central insight of harness engineering: the model proposes; the harness disposes.
Why harness engineering matters now
The capabilities of frontier models have improved dramatically, but their reliability has not kept pace. A model that writes excellent code 80% of the time still breaks 20% of the time. In manual workflows that 20% is friction; in automated ones it is an incident waiting to happen.
Harness engineering addresses this by treating unreliability as a systems problem rather than a model problem. Instead of waiting for a model that never makes mistakes, you build a system that assumes them and catches them before they propagate.
The Anthropic research on building effective agents makes a similar observation: the most reliable agentic systems are not the ones with the most capable models, but the ones with the best-designed tool use, verification, and error-handling layers.
Building your first harness
You do not need a complex orchestration framework to start. A first harness can be as simple as three pieces:
- A shell script that runs lint and tests after every agent edit
- A
git stash/git checkoutrollback if the tests fail - A log file that records each attempt’s outcome, duration, and verdict
From there you can incrementally add type checking, security scanning, approval gates, and observability. The Harness Engineering book provides a structured ninety-day adoption roadmap for teams moving from zero harness to production-grade control loops.
Start with the loop — sense, compare, correct, approve — and make it visible. Every time the harness catches a mistake you have evidence the investment pays off. Every miss signals which check to add next.
The future of AI coding is harnessed
Harness engineering is not an alternative to better models — it is the necessary complement. Models will never be perfect — nor is any software. The question is not whether your AI agent will make mistakes, but whether your system will catch them before they reach production.
Building that system is harness engineering — and for any team aiming to run AI coding agents at scale, it is the single most important practice you can adopt.
Further reading
More insights
All ArticlesWhen to Call in a Developer: An Honest Guide for Vibe Coders
A practical guide to recognising the moment your vibe-coded project needs professional help — before technical debt or security holes catch up with you.
Read articleMicrosoft 365 Copilot for Knowledge Workers: Tasks Worth Automating First
Discover which everyday Microsoft 365 tasks deliver the biggest productivity gains when automated with Copilot — from email triage to spreadsheet analysis.
Read articleHow to Scope an App Idea Before You Prompt an AI
A five-question scoping framework that turns a vague app idea into a focused brief before your first prompt — so the model builds what you actually meant.
Read articleOutcome Prompts vs Vague Prompts: Before-and-After Examples
See how rewriting a vague prompt into an outcome-based prompt transforms AI coding results — with real before-and-after examples you can apply to your next session.
Read article