Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Harness: Code-Driven LLM Optimization

Updated 19 April 2026
  • Agentic harness is an advanced orchestration mechanism that uses recursive, agent-controlled code search to optimize LLM interactions.
  • It leverages full-history access to execution traces for precise failure diagnosis and causal reasoning without modifying model weights.
  • Empirical results show significant improvements in tasks like text classification, math reasoning, and coding, with enhanced accuracy and efficiency.

An agentic harness is the explicit mechanism, typically manifest as an orchestrating code layer or search/optimization procedure, that governs how information is stored, retrieved, formatted, and presented to a LLM—and, crucially, how this mechanism itself can be autonomously engineered or optimized by an agent acting over a record of prior harnesses and their executions. The distinction between manual harnesses and agentic harnesses is central: in the latter, the harness’s search-and-refinement loop is recursively agent-driven, allowing systematic improvement and adaptation solely through outer-loop reasoning on code, traces, and results without direct model weight modification (Lee et al., 30 Mar 2026).

1. Scope and Formal Definition

A harness HH for a fixed LLM MM is executable code that specifies:

  • What interaction-derived context to persist (memory policies)
  • What information to retrieve and how for subsequent prompts (retrieval policy)
  • Prompt/step formatting, tool-call structure, and execution ordering (orchestration and I/O semantics)

Given a sample xXx \sim \mathcal{X}, HH induces a rollout trajectory τpM(H,x)\tau \sim p_M(H, x), dictating prompts presented to MM and state transitions throughout the task instance.

An agentic harness is one whose search, evaluation, and refinement is itself agent-controlled: a coding agent (“proposer”) is given full access to source code, scores, and execution traces (not just scalar rewards), operating over a shared filesystem archive. Instead of being hand-modified, the harness is iteratively evolved through agent-conducted code reasoning, diagnosis, and patching, informed by explicit exploration of prior failures and successes (Lee et al., 30 Mar 2026).

2. Agentic Harness Search and Optimization Procedure

The canonical agentic harness workflow is embodied by the Meta-Harness system (Lee et al., 30 Mar 2026), structured as follows:

  • Initialization: Seed a directory D\mathcal{D} with hand-designed baseline harnesses, run each on MM against a task distribution X\mathcal{X}, logging execution traces and scores.
  • Agentic Search Loop:
    • The proposer PP reads code, logs, and artifacts in MM0 using direct shell tools (e.g., grep, cat), diagnosing failures (e.g., “prompt-cleanup regression”), generating natural language hypotheses and plans.
    • MM1 applies code edits (local or global) to produce new candidate harnesses MM2.
    • Each MM3 is interface-validated, executed on MM4, and newly generated execution traces and scores are appended to MM5.
  • Objective: Find MM6 maximizing expected rollout score, often under a context-token constraint:

MM7

Non-dominated (Pareto-optimal) points in (accuracy, context cost) space are designated as frontiers.

Notably, there are no fixed mutation operators; MM8 freely explores code space, guided by file-level history and execution outcomes. File-system scale can be arbitrary (millions of tokens); MM9 navigates selectively, not by exhaustive reading.

3. Filesystem Access and Causal Reasoning

A defining characteristic is the proposer's access to raw code and full execution traces (prompts, tool calls, model outputs, state updates), as well as scalar metrics stored as artifacts. This historical archive supports:

  • Failure Diagnosis: Isolating problematic code patterns by correlating trace anomalies with regression events.
  • Causal Hypothesis Generation: Natural language planning to suggest targeted changes (e.g., “isolate marker stripping from prompt edits”).
  • Auditable Causal Chains: By maintaining separate code variants and corresponding logs, it is possible to directly observe overfitting in code (e.g., brittle if-chains), offering human-auditable solutions not typically possible with weight-space overfitting.

Ablation studies show that systems that rely only on scalar-reward feedback or compressed trace summaries underperform compared to those granting full-history trace access; execution traces are empirically the key ingredient in successful automated harness search (Lee et al., 30 Mar 2026).

4. Empirical Results in LLM Applications

Meta-Harness demonstrates the performance and efficiency of agentic harness search in several domains (Lee et al., 30 Mar 2026):

Task Baseline Best Discovered Harness Improvement
Online Text Classification ACE accuracy: 40.9% 48.6% accuracy, 4× fewer tokens +7.7 pts accuracy, 11.4k vs 50.8k tokens
Retrieval-aug. Math Reason. BM25: 37.5% pass@1 38.8% pass@1 +4.7 pts over no retrieval, BM25
Agentic Coding (TerminalBench-2) Terminus-KIRA: 74.7% 76.4% (Opus-4.6), 37.6% (Haiku-4.5) Surpasses best hand-engineered baselines

Characteristic harness patterns (e.g., two-stage draft-verification in classification) and code snippets discovered by the agentic harness show domain-specific optimization, with discovered harnesses generalizing across unseen models and out-of-distribution datasets.

5. Agentic Harness Research Directions and Insights

Several key takeaways and research directions emerge (Lee et al., 30 Mar 2026):

  • Full-history access and code-level search provide richer optimization signals than compressed or scalar feedback.
  • Explicit causal reasoning: The agentic loop supports fault localization and systematic, testable code changes.
  • Generalization: Agentically discovered harnesses transfer to previously unseen model architectures and datasets.
  • Human-readability: Unlike black-box model overfitting, harness code logic is directly inspectable and modifiable.
  • Future directions include co-evolution of harness and model weights, automated agent skill-text generation, and application to new domains (beyond text and coding).

The agentic harness framework thus operationalizes model-centric system improvement wholly through code-space search, guided by outer-loop agentic reasoning. This marks a fundamental shift from model weight-centric optimization to infrastructure-level code search for performance control and adaptation (Lee et al., 30 Mar 2026).

6. Category-Specific Agentic Harness Exemplars

Distinct agentic harness paradigms have been instantiated in specialized domains:

  • Fuzzing (HarnessAgent): The harness automates function signature extraction, dependency retrieval, compile-fix loops, and validation for program fuzzing, achieving up to 87% harness validation success in C projects (Yang et al., 3 Dec 2025).
  • Retrieval-Augmented Generation: Harness engineering orchestrates agentic selection of retrieval strategies, planning, multi-agent collaboration, and context-refinement in RAG systems (Singh et al., 15 Jan 2025).
  • Formal Specification (VeriAct): Loop over LLM-driven spec synthesis, formal verification, and harness-based correctness/completeness assessment achieves significant gains over prompt-centric approaches (Misu et al., 31 Mar 2026).
  • Compiler Bug Repair: Agentic harnesses expose system-internal APIs and tool wrappers, guiding LLM agents through environment setup, debugging, patch proposal, and regression validation (Zheng et al., 20 Mar 2026).

These instantiations share a unifying pattern: modular orchestration over tools, memory, and orchestration primitives, with agent-controlled, iterative refinement predicated on access to full execution artifacts.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Harness.