Agentic Harness: Code-Driven LLM Optimization
- Agentic harness is an advanced orchestration mechanism that uses recursive, agent-controlled code search to optimize LLM interactions.
- It leverages full-history access to execution traces for precise failure diagnosis and causal reasoning without modifying model weights.
- Empirical results show significant improvements in tasks like text classification, math reasoning, and coding, with enhanced accuracy and efficiency.
An agentic harness is the explicit mechanism, typically manifest as an orchestrating code layer or search/optimization procedure, that governs how information is stored, retrieved, formatted, and presented to a LLM—and, crucially, how this mechanism itself can be autonomously engineered or optimized by an agent acting over a record of prior harnesses and their executions. The distinction between manual harnesses and agentic harnesses is central: in the latter, the harness’s search-and-refinement loop is recursively agent-driven, allowing systematic improvement and adaptation solely through outer-loop reasoning on code, traces, and results without direct model weight modification (Lee et al., 30 Mar 2026).
1. Scope and Formal Definition
A harness for a fixed LLM is executable code that specifies:
- What interaction-derived context to persist (memory policies)
- What information to retrieve and how for subsequent prompts (retrieval policy)
- Prompt/step formatting, tool-call structure, and execution ordering (orchestration and I/O semantics)
Given a sample , induces a rollout trajectory , dictating prompts presented to and state transitions throughout the task instance.
An agentic harness is one whose search, evaluation, and refinement is itself agent-controlled: a coding agent (“proposer”) is given full access to source code, scores, and execution traces (not just scalar rewards), operating over a shared filesystem archive. Instead of being hand-modified, the harness is iteratively evolved through agent-conducted code reasoning, diagnosis, and patching, informed by explicit exploration of prior failures and successes (Lee et al., 30 Mar 2026).
2. Agentic Harness Search and Optimization Procedure
The canonical agentic harness workflow is embodied by the Meta-Harness system (Lee et al., 30 Mar 2026), structured as follows:
- Initialization: Seed a directory with hand-designed baseline harnesses, run each on against a task distribution , logging execution traces and scores.
- Agentic Search Loop:
- The proposer reads code, logs, and artifacts in 0 using direct shell tools (e.g., grep, cat), diagnosing failures (e.g., “prompt-cleanup regression”), generating natural language hypotheses and plans.
- 1 applies code edits (local or global) to produce new candidate harnesses 2.
- Each 3 is interface-validated, executed on 4, and newly generated execution traces and scores are appended to 5.
- Objective: Find 6 maximizing expected rollout score, often under a context-token constraint:
7
Non-dominated (Pareto-optimal) points in (accuracy, context cost) space are designated as frontiers.
Notably, there are no fixed mutation operators; 8 freely explores code space, guided by file-level history and execution outcomes. File-system scale can be arbitrary (millions of tokens); 9 navigates selectively, not by exhaustive reading.
3. Filesystem Access and Causal Reasoning
A defining characteristic is the proposer's access to raw code and full execution traces (prompts, tool calls, model outputs, state updates), as well as scalar metrics stored as artifacts. This historical archive supports:
- Failure Diagnosis: Isolating problematic code patterns by correlating trace anomalies with regression events.
- Causal Hypothesis Generation: Natural language planning to suggest targeted changes (e.g., “isolate marker stripping from prompt edits”).
- Auditable Causal Chains: By maintaining separate code variants and corresponding logs, it is possible to directly observe overfitting in code (e.g., brittle if-chains), offering human-auditable solutions not typically possible with weight-space overfitting.
Ablation studies show that systems that rely only on scalar-reward feedback or compressed trace summaries underperform compared to those granting full-history trace access; execution traces are empirically the key ingredient in successful automated harness search (Lee et al., 30 Mar 2026).
4. Empirical Results in LLM Applications
Meta-Harness demonstrates the performance and efficiency of agentic harness search in several domains (Lee et al., 30 Mar 2026):
| Task | Baseline | Best Discovered Harness | Improvement |
|---|---|---|---|
| Online Text Classification | ACE accuracy: 40.9% | 48.6% accuracy, 4× fewer tokens | +7.7 pts accuracy, 11.4k vs 50.8k tokens |
| Retrieval-aug. Math Reason. | BM25: 37.5% pass@1 | 38.8% pass@1 | +4.7 pts over no retrieval, BM25 |
| Agentic Coding (TerminalBench-2) | Terminus-KIRA: 74.7% | 76.4% (Opus-4.6), 37.6% (Haiku-4.5) | Surpasses best hand-engineered baselines |
Characteristic harness patterns (e.g., two-stage draft-verification in classification) and code snippets discovered by the agentic harness show domain-specific optimization, with discovered harnesses generalizing across unseen models and out-of-distribution datasets.
5. Agentic Harness Research Directions and Insights
Several key takeaways and research directions emerge (Lee et al., 30 Mar 2026):
- Full-history access and code-level search provide richer optimization signals than compressed or scalar feedback.
- Explicit causal reasoning: The agentic loop supports fault localization and systematic, testable code changes.
- Generalization: Agentically discovered harnesses transfer to previously unseen model architectures and datasets.
- Human-readability: Unlike black-box model overfitting, harness code logic is directly inspectable and modifiable.
- Future directions include co-evolution of harness and model weights, automated agent skill-text generation, and application to new domains (beyond text and coding).
The agentic harness framework thus operationalizes model-centric system improvement wholly through code-space search, guided by outer-loop agentic reasoning. This marks a fundamental shift from model weight-centric optimization to infrastructure-level code search for performance control and adaptation (Lee et al., 30 Mar 2026).
6. Category-Specific Agentic Harness Exemplars
Distinct agentic harness paradigms have been instantiated in specialized domains:
- Fuzzing (HarnessAgent): The harness automates function signature extraction, dependency retrieval, compile-fix loops, and validation for program fuzzing, achieving up to 87% harness validation success in C projects (Yang et al., 3 Dec 2025).
- Retrieval-Augmented Generation: Harness engineering orchestrates agentic selection of retrieval strategies, planning, multi-agent collaboration, and context-refinement in RAG systems (Singh et al., 15 Jan 2025).
- Formal Specification (VeriAct): Loop over LLM-driven spec synthesis, formal verification, and harness-based correctness/completeness assessment achieves significant gains over prompt-centric approaches (Misu et al., 31 Mar 2026).
- Compiler Bug Repair: Agentic harnesses expose system-internal APIs and tool wrappers, guiding LLM agents through environment setup, debugging, patch proposal, and regression validation (Zheng et al., 20 Mar 2026).
These instantiations share a unifying pattern: modular orchestration over tools, memory, and orchestration primitives, with agent-controlled, iterative refinement predicated on access to full execution artifacts.
References:
- "Meta-Harness: End-to-End Optimization of Model Harnesses" (Lee et al., 30 Mar 2026)
- "HarnessAgent: Scaling Automatic Fuzzing Harness Construction with Tool-Augmented LLM Pipelines" (Yang et al., 3 Dec 2025)
- "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG" (Singh et al., 15 Jan 2025)
- "VeriAct: Beyond Verifiability -- Agentic Synthesis of Correct and Complete Formal Specifications" (Misu et al., 31 Mar 2026)
- "Agentic Harness for Real-World Compilers" (Zheng et al., 20 Mar 2026)