HarnessX: Adaptive Agent Harness Foundry
- HarnessX is a composable, adaptive agent harness foundry that systematically assembles, adapts, and evolves control modules for LLM agents.
- It operationalizes harness engineering as algebraically composable objects, enabling integration of heterogeneous control, memory, and tool orchestration strategies.
- Trace-driven adaptation via the AEGIS engine closes the loop between runtime scaffolding and model learning, improving agent benchmarking and performance consistency.
HarnessX is a composable, adaptive, and evolvable agent harness foundry designed to systematically assemble, adapt, and evolve agent harnesses—the infrastructural software layer mediating how LLM agents observe, reason, and act. HarnessX operationalizes harness engineering as a first-class, algebraically composable object, enabling the integration of heterogenous control, memory, and tool orchestration strategies atop diverse model backends. Through a trace-driven adaptation protocol, HarnessX leverages execution data to close the loop between runtime scaffolding and both harness and model learning. Empirical studies across established agent benchmarks demonstrate that harness configuration, rather than model selection alone, is often the dominant source of agent performance variability, motivating rigorous harness disclosure and adaptive composition as crucial levers for agent progress (Chen et al., 12 Jun 2026, Zhang et al., 7 May 2026, Wang et al., 15 May 2026).
1. HarnessX: Structural Foundations
HarnessX defines the agent harness as a tuple of a model configuration (main agent, judges, fallbacks) and a harness configuration encoding the runtime control logic. The harness configuration formalizes:
- — maps each of the system’s lifecycle hooks (covering steps such as context construction, tool mediation, and verification) to an ordered list of processors, which are composable, type-annotated control elements.
- — exposes a finite set of slots (e.g., tool registries, sandboxes) shared among processors.
A processor implements 6 and interfaces with precisely one hook, supporting algebraic safety checks through singleton group and ordering metadata. This yields a compositional substrate permitting insertion, removal, and replacement of harness primitives through a substitution algebra (Chen et al., 12 Jun 2026).
2. HarnessX Substitution Algebra and Composability
Harness configuration manipulations in HarnessX are modeled as algebraic edit operations:
- Insert inserts processor at hook :
- Remove eliminates all processors in singleton group across hooks.
All edits compose via normal function composition, preserving type and behavioral invariants. The substitution algebra guarantees identity, associativity, and disjoint-hook commutativity, eliminating the need for global pipeline rewrites and enabling safe variant isolation and deterministic gating (see Section 6).
This modularity is critical for supporting rapid, reversible experimentation with harness variants tailored to benchmarks, task clusters, or model capabilities (Chen et al., 12 Jun 2026).
3. AEGIS: Trace-Driven Harness Evolution
HarnessX adapts harness configurations using AEGIS, an evolution engine casting harness adaptation as a Markov decision process (MDP) over symbolic control artifacts:
- State: 0, where 1 is the harness and 2 the trace corpus.
- Action: Typed harness edit 3 (processor-level manipulation).
- Reward: Verifier-derived score aggregated over a batch of traces.
The adaptation loop integrates four distinct modules:
- Digester: Summarizes raw execution traces into task-level failure/success, implicated processors, and evidence anchors.
- Planner: Enumerates unexplored directions in parameter/hook/processor space using historical adaptation outcomes.
- Evolver: Proposes concrete, type-safe harness edits, each annotated with a change manifest (mechanism, attribution signature, and predicted impact).
- Critic and Deterministic Gate: Enforces manifestation-trace consistency, blocks reward hacking, prevents catastrophic forgetting (no regression on previously solved tasks), and accepts only behaviorally validated edits.
AEGIS addresses RL pathologies in symbolic harness evolution, notably reward hacking, catastrophic forgetting, and under-exploration by enforcing manifest–trace consistency and explicit exploration (Chen et al., 12 Jun 2026).
4. Closing the Harness–Model Learning Loop
HarnessX iteratively alternates harness adaptation with model reinforcement learning (RL) through a shared replay buffer 4. The protocol is:
- Roll out 5; log traces 6.
- Verify traces with a fixed harness-level verifier to obtain rewards.
- Insert 7 into 8.
- Evolve harness using AEGIS.
- Cache action log-probabilities at insertion (for off-policy RL).
- Update 9 using Group Relative Policy Optimization (GRPO):
0
where trajectories are grouped by task identity to normalize advantage estimates across harnesses (Chen et al., 12 Jun 2026).
No additional rollouts are required; all model updates are amortized over AEGIS-generated execution traces.
5. Reliability, Evaluation Protocols, and Disclosure
HarnessX is situated within a broader context where agent test outcomes on long-horizon tasks are dominated by harness-induced variance (HV), often exceeding model-induced variance (MV) by large factors (e.g. 1 on controlled grids). As a result, standardized harness-aware evaluation is essential for scientific rigor (Zhang et al., 7 May 2026).
HarnessX evaluation leverages:
- The ETCSOVG Harness Card — requiring full disclosure of Execution, Tool, Context, Scheduling, Observability, Verification, and Governance configuration.
- Variance-decomposition protocol — (i) Compare at least 2 grid (3 harnesses 4 5 models), (ii) Report per-cell benchmark scores, HV, MV, the HV/MV ratio, model-pair ranking flips, and interaction statistics (6).
- Trajectory-level reliability metrics — e.g., recovery rate 7 post-anomaly, context retention, control lag 8.
These protocols demystify where performance gains originate and ensure interpretability of agent benchmarking.
6. Empirical Findings and Case Analyses
HarnessX’s efficacy is demonstrated on ALFWorld, GAIA, WebShop, 9-Bench, and SWE-bench Verified, using Claude Sonnet 4.6, GPT-5.4, and Qwen 3.5-9B as agent families. Key results (Chen et al., 12 Jun 2026):
| Benchmark | Agent | Baseline | Evolved | Δ |
|---|---|---|---|---|
| ALFWorld | Sonnet 4.6 | 83.6 | 94.8 | +11.2 |
| ALFWorld | GPT-5.4 | 76.9 | 97.8 | +20.9 |
| ALFWorld | Qwen 3.5 | 53.0 | 97.0 | +44.0 |
| WebShop | Sonnet 4.6 | 60.0 | 76.0 | +16.0 |
| WebShop | GPT-5.4 | 55.0 | 73.0 | +18.0 |
| WebShop | Qwen 3.5 | 36.0 | 49.0 | +13.0 |
| GAIA | Sonnet 4.6 | 73.8 | 83.5 | +9.7 |
| GAIA | GPT-5.4 | 73.8 | 73.8 | 0.0 |
| GAIA | Qwen 3.5 | 20.3 | 37.4 | +17.1 |
| 0-Bench | GPT-5.4 | 76.2 | 90.7 | +14.5 |
| SWE-bench Verified | GPT-5.4 | 45.5 | 63.6 | +18.1 |
- Average gain: +14.5 pp across 15 configurations.
- Inverse scaling: architecture provides the greatest improvement to the weakest agent baselines (+44.0 pp on Qwen3.5-9B, ALFWorld).
- Variant isolation: supports stable compositional evolution; e.g., resolving catastrophic regressions on GAIA by routing clusters to harness variants rather than a monolith.
- In practical evolution episodes, batches displayed prompt, tool, and retry policy edits directly linked—via the deterministic gate—to resolution of persistent task clusters and attributed improvements, evident in trace-anchored manifests.
7. Harness Principles: Decomposition, Guidance, and Alignment
HarnessX also incorporates formal and empirical guidance on harness design from trajectory-alignment theory (Wang et al., 15 May 2026):
- Decomposition 1: Maps tasks into ordered subgoals; too fine-grained (2 outside agent’s cumulative progress windows) induces drift, inefficiency, or failure.
- Guidance (3, 4): Reweighting local agent behavior; positive retention gap (5) is necessary for guidance to improve recoverability.
- Partial harnesses: Often optimal to scaffold only the initial execution stages and allow autonomous planning for residuals.
- Alignment: Structural guidance and decomposition must align with agent capability—misaligned guidance or over-pruning can sharply degrade success rates or induce hallucinated execution.
- Retry budgets and tolerance: Must be tuned relative to agent’s achievable progress to absorb small mismatch but cannot fix structural misalignment.
These principles directly inform the processor and hook design within HarnessX, as well as AEGIS’s hypothesis space for edit generation.
8. Broader Context, Significance, and Open Questions
HarnessX substantiates the thesis that execution harnesses are not neutral interfaces, but dominant factors in agent capability realization and measurable success on complex tasks (Zhang et al., 7 May 2026). As a result, model benchmarking without full harness disclosure or variance decomposition is methodologically incomplete. The system's compositional structure is a practical engine for evolvability, variant isolation, and robust regression detection. Full trace observability—encompassing event, tool, and control flow logs—is essential for diagnosability and safety assurance, as scalar outcome rates are insufficient.
HarnessX's co-evolutionary protocol leverages the complementary strengths of non-parametric harness evolution and parametric model learning, surpassing either in isolation. Open questions remain regarding meta-agent design, harness evolution for continuous (e.g., robotic) action domains, automated detection of under-exploration, curriculum generation, and hybrid symbolic-neural evolvers.
In summary, HarnessX systematizes and operationalizes the compositional, adaptive, and evolvable agent harness paradigm, establishing both theoretical and empirical standards for next-generation LLM agent benchmarking and improvement (Chen et al., 12 Jun 2026, Zhang et al., 7 May 2026, Wang et al., 15 May 2026).