Harness Engineering

The infrastructure layer around the model that turns a capable LLM into a reliable production agent.

Why harnesses, not just models

A model can perform well in isolated testing and still fail in production — not because the model regressed, but because it lacked the right context, called the wrong tool, lost important state, or had no reliable way to verify whether a task was actually completed correctly. These are system failures, not model failures, and the system surrounding the model — the harness — is what determines whether a capable model translates into a reliable agent. Most production agent failures trace back to harness gaps, not model limitations.

Three phases of the discipline

The field has moved through three overlapping phases. Prompt engineering (roughly 2022-2023) focused on wording — how a request was phrased changed output quality, and models functioned largely as smart autocomplete requiring constant human steering. Context engineering (2024-2025) shifted the bottleneck from wording to information — curating what files, rules, and constraints entered the model's context so it could reason about a specific situation rather than generating something generic. Harness engineering (2026) addresses a different problem: autonomy, accuracy, and control over longer-running, more independent agent behavior — not just what the model sees, but the entire operational envelope it runs inside.

The five layers of a production harness

Tool orchestration governs which tools are available when, and how their outputs get integrated back into the agent's reasoning. Verification loops check whether an action actually achieved its intended effect, rather than assuming success once a tool call returns without error. Context and memory management — covered in depth elsewhere — give the agent continuity across a session and across tasks, avoiding the need to relearn the same constraints repeatedly. Guardrails constrain what the agent is permitted to do, catching unsafe, destructive, or out-of-scope actions before they execute rather than after. Observability makes the agent's internal decision process inspectable after the fact — what it tried, why, what failed — which is what makes debugging a misbehaving agent tractable instead of guesswork.

Sub-agent orchestration

For sufficiently complex tasks, a single agent maintaining all context for an entire job becomes its own context engineering problem. The orchestrator pattern addresses this by having one main agent maintain a high-level plan and delegate focused, well-scoped sub-tasks to sub-agents that operate in their own clean context windows, returning condensed results rather than their full working trace. This keeps any single agent's context focused and makes failures easier to localize — a sub-agent that goes wrong affects only its own narrow task, not the entire job's context.

Verification as a first-class concern

A harness that doesn't verify its own output is operating on faith. Verification loops — re-checking a file was actually written correctly, a deployment actually succeeded, a calculation actually matches expected bounds — catch the silent failures that a model's own confidence cannot be trusted to flag. This is especially important for long-running, multi-step tasks where an early undetected error compounds across every subsequent step.

Verification step configuration placeholder
{
  "step": "deploy_service",
  "tool_result": { "status": "ok", "deployment_id": "dep_42" },
  "verification": {
    "type": "http_check",
    "url": "https://api.example.com/health",
    "expect": { "status_code": 200, "body_contains": "healthy" }
  },
  "on_failure": "retry_with_rollback"
}

Part II — Workflow vs agent

Anthropic distinguishes workflows — LLMs orchestrated through predefined code paths — from agents — systems where the LLM dynamically directs its own tool use and process. Most production systems are workflows with agent-like steps, not fully autonomous agents. Start with the simplest workflow that solves the task; add autonomy only where the path cannot be predicted in advance (unknown number of files to edit, unknown retrieval hops, unknown subtasks).

Part II — Pattern selection matrix

| Pattern | Use when | Avoid when | | Prompt chaining | Fixed multi-step pipeline | Steps need dynamic branching | | Routing | Input maps to distinct specialists | Single generalist suffices | | Parallelization | Independent subtasks | Subtasks depend on each other | | Orchestrator-workers | Subtasks unknown until runtime | Subtasks are fixed and few | | Evaluator-optimizer | Clear rubric + iterative gain | Speed matters more than quality |

Anti-patterns: monolithic agents with every tool loaded; unbounded planner loops; missing observability so failures are invisible.

Part II — Orchestrator-workers deep dive

The orchestrator analyzes the task, emits a structured delegation plan (XML or JSON), and assigns workers with scoped context. Workers return condensed results — not full traces. The orchestrator synthesizes and verifies. Use Promise.allSettled-style semantics: partial worker failure should not crash the entire job. Each worker gets its own timeout, idempotency key, and response schema (status, data, confidence).

Anthropic's cookbook orchestrator-workers notebook demonstrates dynamic subtask generation — superior to hardcoded parallel branches when the decomposition depends on the specific input.

Part II — LangGraph as a state machine harness

LangGraph models agent execution as a state graph: nodes are steps (LLM call, tool call, human approval), edges are transitions conditioned on state. Checkpointing persists state between steps — enabling long-running jobs, human-in-the-loop interrupts, and time-travel debugging (rewind to a prior checkpoint and branch). This is harness engineering made explicit: the graph is code; the model is one node type among several.

Part II — Verification: outcome over transcript

A successful tool response is not a successful outcome. The harness must verify world state: file exists and hash matches, database row inserted, HTTP health check passes, invoice total within bounds. Verification loops belong in the graph as explicit nodes — not as hope that the model self-checks.

Case study: A deployment agent announced success after a CI API returned 200, but the deployment was queued, not live. Fix: add verify_health node that polls until success or timeout; regression eval asserts outcome table row, not assistant message text.

Orchestrator delegation schema
{
  "delegation": {
    "task_id": "audit-442",
    "workers": [
      { "id": "billing", "prompt_scope": "Refund policy only", "timeout_ms": 12000 },
      { "id": "technical", "prompt_scope": "API logs last 1h", "timeout_ms": 15000 }
    ],
    "response_schema": {
      "status": ["ok", "partial", "failed"],
      "data": "object",
      "confidence": "number"
    },
    "verify_before_merge": ["billing.confidence > 0.7", "technical.status != failed"]
  }
}

Further reading

Orchestration. The model is not the agent; the model is just the "reasoning engine." The agent is the Harness (the Python/TypeScript code loop that calls the model, validates the tool, handles the error, and calls the model again).

  • Building Effective Agents (Anthropic Research) — A must-read. They demystify hyper-complex frameworks and prove why simple, composable patterns (like Orchestrator-Workers, Evaluator-Optimizer, Routing) outperform monolithic autonomous approaches in production.
  • LangGraph Conceptual Guide — The best current orchestration framework because it treats agent execution as a State Machine (State Graph), allowing persistence, human-in-the-loop, and time-travel (undoing steps).