Harness Self-Evolution

Updated 4 July 2026

Harness self-evolution is the process by which an agent’s external runtime—comprising prompts, tools, skills, and middleware—is automatically adapted using execution evidence instead of weight updates.
It enables persistent evolution of non-parametric components by converting episodic task experiences into systematic, auditable improvements across multiple harness layers.
Techniques such as retrospective replay, trace-guided repair, and source-level rewriting provide practical methods for enhancing agent performance without altering the core model.

to=arxiv_search.search 娱乐开号ýyş code 天天大奖彩票站 {"query":"\"Harness Self-Evolution\" OR \"self-evolving harness\" agent harness arXiv", "max_results": 10, "sort_by": "submittedDate"} to=arxiv_search.search 彩神争霸 code เงินไทยฟรี {"query":"(Pan et al., 4 Jun 2026) OR (Chen et al., 4 Jun 2026) OR (Zhang et al., 8 Jun 2026) OR (Lin et al., 28 Apr 2026) OR (Hebbar et al., 26 May 2026) OR (Chen et al., 2 Jun 2026) OR (Cai et al., 21 May 2026) OR (Sengupta, 1 Jul 2026) OR (Liu et al., 1 Jun 2026) OR (Lin et al., 28 May 2026)", "max_results": 15, "sort_by": "submittedDate"} Harness self-evolution denotes the automatic adaptation of an agent’s harness—the persistent, non-parametric layer around a fixed model that includes prompts, tools, skills, memory, workflows, middleware, runtime state, verification logic, and governance surfaces—using execution evidence rather than weight updates. Across recent work, the harness is treated as an editable software object: a “persistent collection of tools, prompts, and skills” in one line of work, the full ETCLOVG stack in another, and more generally the runtime environment that coordinates memory, skills, protocols, permission, control, and observability (Pan et al., 4 Jun 2026, Chen et al., 4 Jun 2026, Zhou et al., 9 Apr 2026). The unifying claim is that many capabilities earlier attributed to model weights can instead be reorganized into external infrastructure, then revised online through trajectory analysis, retrospective replay, trace-guided repair, or source-level rewriting (Ning et al., 18 May 2026, Cai et al., 21 May 2026).

1. Definition and conceptual scope

In the narrowest formalization, a task $t$ is solved under a harness $h$ by producing a trajectory $\tau = \mathrm{solve}(h,t)$ , and self-evolution changes $h$ while keeping the backbone model fixed (Pan et al., 4 Jun 2026). In broader systems formulations, the harness spans prompts, tool interfaces, context and memory, lifecycle and orchestration, observability, verification and evaluation, and governance and security; HarnessFix names these seven layers with the ETCLOVG taxonomy and treats harness repair as edits to those artifacts rather than retraining the base model (Chen et al., 4 Jun 2026). The externalization view sharpens this further: memory externalizes state, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering is the unification layer that coordinates them into governed execution (Zhou et al., 9 Apr 2026).

This literature distinguishes harness self-evolution from model fine-tuning in both mechanism and objective. Weight adaptation changes the parametric policy. Harness self-evolution changes the software shell through which the policy observes, reasons, acts, verifies, and persists experience. Some systems make this distinction explicit by defining the scaffold or harness as “the union of the system prompt, tool-dispatch logic, answer extraction, and any supporting infrastructure, every part of the agent that is fixed code rather than model output” (Hebbar et al., 26 May 2026). Others argue that code is increasingly not just an output target but the operational substrate of reasoning, acting, environment modeling, and verification, so harness evolution is naturally expressed as code evolution rather than prompt tweaking alone (Ning et al., 18 May 2026).

A recurring conceptual split is between local adaptation and persistent change. Trajectories, notes, and task-specific logs are local artifacts; reusable guidance, tools, middleware, and routing rules are persistent harness state. Milkyway makes this distinction explicit by separating checkpoint notes for a particular unresolved forecasting question from the persistent future-prediction harness $H_t=(F_t,E_t,U_t)$ , where $F_t$ is factor tracking, $E_t$ is evidence gathering and interpretation, and $U_t$ is uncertainty handling and prediction revision guidance (Wei et al., 17 Apr 2026). This suggests that harness self-evolution is fundamentally about converting episodic evidence into persistent execution structure.

2. Mutable substrates and representational forms

The harness is represented in several concrete forms, and that representation strongly shapes what can evolve. In RHO, the harness is literally a directory on disk containing Markdown instruction files, Markdown skill files, and executable scripts such as bin/repair-verify, python_package_smoke.py, and are_helper.py; all adaptation happens by adding, deleting, or changing files in that directory (Pan et al., 4 Jun 2026). Self-Harness similarly constrains evolution to the DeepAgent harness definition file, including system prompt text, tool set configuration, runtime parameters, and optional middleware prompts, while keeping the base model fixed (Zhang et al., 8 Jun 2026). AHE makes the editable action space explicit through file-level “component observability”: systemprompt.md, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory are separate files under a single workspace, enabling localized edits and rollback (Lin et al., 28 Apr 2026).

Other work expands the editable surface beyond text artifacts. MOSS argues that text-mutable artifacts—skills, prompt configurations, memory schemas, workflow graphs—leave routing, hook ordering, state invariants, and dispatch untouched because those live in source code rather than text files. It therefore treats source-level adaptation as a Turing-complete, strict superset of text-layer evolution, with deterministic effect and resistance to long-context drift (Cai et al., 21 May 2026). Adaptive Auto-Harness organizes harnesses as a git-backed tree of branches, so each branch corresponds to a regime-specific harness rather than a single ever-growing monolith (Liu et al., 1 Jun 2026). SEA constrains self-modification to a small steering adapter $L_1$ and a versioned harness $L_2$ around a frozen base model $h$ 0, explicitly restricting the mutable hypothesis space to auditable components (Sengupta, 1 Jul 2026).

These representational choices imply different notions of evolvability. File-level and branch-level representations support localized edits, rollback, and solve-time routing. Source-level representations extend reach to structural failures in orchestration code. Adapter-plus-harness representations make policy distance and statistical gating tractable. This suggests that “harness self-evolution” is not a single mechanism but a family of update regimes conditioned by what is considered an editable harness surface.

3. Main algorithmic patterns

One major pattern is retrospective optimization from past trajectories. RHO performs coreset selection over stored trajectories, generates multiple parallel rollouts on difficult and diverse past tasks, diagnoses them through self-validation and self-consistency, proposes multiple candidate harnesses, and selects a candidate by pairwise self-preference over trajectories (Pan et al., 4 Jun 2026). Its coreset uses a DPP kernel $h$ 1 over LLM-produced difficulty scores and structural fingerprints, while candidate selection uses an aggregate self-preference score $h$ 2 over baseline-versus-candidate rollouts. The essential move is to replace unavailable ground-truth validation with structured self-judgment over replayed trajectories.

A second pattern is trace-guided, layer-aware diagnosis and repair. HarnessFix compiles raw traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR) consisting of TraceSteps, temporal links, provenance links, control-flow links, node-local evidence, and ETCLOVG layer responsibility facets (Chen et al., 4 Jun 2026). A diagnosis agent attributes failures to responsible steps and layers, consolidates recurring diagnoses into flaw records, maps them to layer-specific repair operators, instantiates repair specifications, and accepts patches only if target flaw occurrence decreases while new regressions stay under a bound. The mechanism is explicitly scoped: rather than “improve the prompt,” it asks which harness layer caused the failure and which operator is admissible.

A third pattern is weakness mining plus regression-tested proposal selection. Self-Harness iterates over Weakness Mining, Harness Proposal, and Proposal Validation (Zhang et al., 8 Jun 2026). Failed traces are clustered by deterministic failure signatures $h$ 3, where $h$ 4 is verifier-level cause, $h$ 5 is causal status, and $h$ 6 is the implicated agent mechanism. The same model then proposes $h$ 7 diverse but minimal harness edits tied to those patterns, and a proposal is accepted only if $h$ 8, $h$ 9, and $\tau = \mathrm{solve}(h,t)$ 0 on held-in and held-out splits. This is a direct non-regression gate over harness evolution.

A fourth pattern is observability-driven evolution. AHE structures the loop around component observability, experience observability, and decision observability (Lin et al., 28 Apr 2026). Past runs are distilled into analysis/overview.md and per-task detail reports, while every edit is paired with a change manifest recording predicted fixes, risk tasks, and targeted files. The next iteration attributes which predictions came true and can roll back edits that failed their contract. The harness therefore evolves through explicit, falsifiable edit hypotheses rather than opaque trial-and-error.

Other domains reuse the same logic with domain-specific evidence. Milkyway extracts internal feedback from temporal contrasts between repeated predictions on the same unresolved question, updates its future-prediction harness before the outcome is known, and only later performs a retrospective check after resolution (Wei et al., 17 Apr 2026). EvoTrainer applies the pattern on the training side: the policy improves within a run, but the training harness evolves across runs by revising diagnostics, reward shaping, filtering logic, and reusable skills such as StdGroupFilter (Chen et al., 2 Jun 2026).

4. Extensions beyond prompt and skill editing

Several systems widen harness self-evolution beyond prompt libraries and skillbooks. SIA explicitly joins harness updates and weight updates in a single loop: a Feedback-Agent rewrites the scaffold of a task-specific agent or configures a weight-update phase, interleaving “H” and “W” steps under a step budget (Hebbar et al., 26 May 2026). Its central claim is that harness updates make the model agentic, while weight updates build domain intuition that no prompt or scaffold can instill. This is not pure harness evolution, but it positions harness self-evolution as one lever in a broader self-improvement system.

MOSS goes further by moving from text-layer mutation to source-level rewriting. Each evolution run is anchored to a batch of production-failure evidence, passes through Locate, Plan, Plan-Review, Implement, Code-Review, Task-Evaluate, and Verdict stages, and verifies candidates by replaying the batch in ephemeral trial workers before promotion through user-consent-gated container swap with health-probe-gated rollback (Cai et al., 21 May 2026). The practical consequence is that routing logic, mediator code, hook ordering, and other harness-level structural failures become reachable to self-evolution.

Adaptive Auto-Harness addresses a different extension: sustained deployment on open-ended streams. It decomposes regret against an oracle harness into evolution loss and adaptation loss, then attacks them with a stateful multi-agent evolver and a harness tree with solve-time routing (Liu et al., 1 Jun 2026). Rather than densely editing one shared harness forever, it keeps branch-specific harnesses such as branch/crypto, branch/pwn, or branch/lvl3, and a router selects a branch per task. This suggests that long-run harness evolution may require not only updating components but also partitioning the harness space itself.

SEA focuses on statistical control. It confines self-modification to a small steering adapter and a versioned harness, then admits each modification only through an anytime-valid gate with a fixed error budget $\tau = \mathrm{solve}(h,t)$ 1, allocating $\tau = \mathrm{solve}(h,t)$ 2 so the total budget is controlled (Sengupta, 1 Jul 2026). Best-of- $\tau = \mathrm{solve}(h,t)$ 3, micro-step search, self-authored reproduction oracles, search-layer control, and self-repair then supply the dense in-loop verification signal required by those gates. In this formulation, harness self-evolution is not only adaptive but certified.

5. Empirical landscape and evaluation methodology

The empirical record is heterogeneous in domains, metrics, and update surfaces, so direct numerical comparison across systems is not meaningful. Even so, reported outcomes establish that harness self-evolution is not confined to a single benchmark family.

System	Domain(s)	Reported outcome
RHO	SWE-Bench Pro, Terminal-Bench 2, GAIA-2	0.59→0.78; 0.71→0.76; 0.29→0.37
HarnessFix	SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA, AppWorld	15.2%–50.0% over initial harnesses
Self-Harness	Terminal-Bench-2.0	40.5%→61.9%; 23.8%→38.1%; 42.9%→57.1%
AHE	Terminal-Bench 2	69.7%→77.0% pass@1
Milkyway	FutureX, FutureWorld	44.07→60.90; 62.22→77.96
MOSS	OpenClaw	0.25→0.61 mean grader score
SEA	52-instance SWE-bench Verified subset	GLM 5.2 24→28; GPT 29→34
SIA	LawBench, GPU kernels, scRNA denoising	56.6% on LawBench; 91.9% runtime reduction on GPU kernels; 502% on denoising over the initial baseline

These results point to several recurrent findings. First, harness updates can materially shift long-horizon behavior without changing model weights, as seen in RHO, AHE, Self-Harness, Milkyway, and MOSS (Pan et al., 4 Jun 2026, Lin et al., 28 Apr 2026, Zhang et al., 8 Jun 2026, Wei et al., 17 Apr 2026, Cai et al., 21 May 2026). Second, the editable surface matters: methods that can alter tools, middleware, verification, or source code often target failures that prompt-only systems cannot reach (Chen et al., 4 Jun 2026, Cai et al., 21 May 2026). Third, long-run evaluation is fragile: useful intermediate snapshots can collapse later, and frequent updates may fail to improve held-out performance.

SEAGym was introduced precisely because isolated task scores and single sequential curves obscure whether a harness update is reusable, overfits recent tasks, increases cost, or harms older behavior (Zheng et al., 16 Jun 2026). It organizes evaluation into train batches, frozen update-validation, held-out ID transfer, held-out OOD transfer, replay diagnostics, and cost records. Its metrics include local and cumulative update-validation gain, ID and OOD gain, and forgetting/regression $\tau = \mathrm{solve}(h,t)$ 4. The paper’s central evaluation claim is that these views provide complementary signals: frequent updates may not improve held-out performance, useful snapshots may collapse later, and source diversity and backend choice affect harness reliability.

6. Reliability, capability distinctions, and open problems

The field increasingly treats reliability as a first-class concern rather than a side effect of better prompts. HarnessFix enforces repair specifications with explicit target-improvement and regression-bound criteria (Chen et al., 4 Jun 2026). Self-Harness blocks promotion unless both held-in and held-out splits are non-regressing (Zhang et al., 8 Jun 2026). AHE attaches every edit to a falsifiable manifest and can revert file-level changes when predicted fixes do not materialize (Lin et al., 28 Apr 2026). SEA adds anytime-valid gates and certificates, explicitly recognizing that self-evolving agents break exogeneity assumptions because the data, evaluator, components, and hypothesis space are produced by the policy being updated (Sengupta, 1 Jul 2026).

A separate line of analysis distinguishes two capabilities that had often been conflated: harness-updating and harness-benefit. “Harness Updating Is Not Harness Benefit” shows that harness-updating is flat in base capability—models from different capability tiers can produce harness updates that lead to surprisingly similar gains—whereas harness-benefit is non-monotonic in base capability (Lin et al., 28 May 2026). Weak-tier models may fail to activate relevant harness artifacts or activate them but fail to follow them faithfully; mid-tier models benefit most; strong-tier models benefit less than mid-tier. This suggests that better evolvers are not always the bottleneck. A plausible implication is that investment in the task-solving agent’s ability to invoke and obey harness artifacts can matter more than making the evolver larger.

Open problems recur across the literature. Some are methodological: sparse failures, ambiguous attribution, flaky benchmarks, weak replay coverage, and the absence of harness-aware evaluation beyond final task success (Chen et al., 4 Jun 2026, Zheng et al., 16 Jun 2026, Ning et al., 18 May 2026). Others are architectural: context bloat, complexity creep, shared-state consistency across multiple agents, and the question of how far self-modification should extend into governance-critical components (Zhou et al., 9 Apr 2026, Ning et al., 18 May 2026). Still others are deployment-specific: replay safety, irreversible environments, distribution shift in open-ended streams, and the need for human steering when history lacks the required signal (Liu et al., 1 Jun 2026, Wei et al., 17 Apr 2026). Taken together, these works suggest that harness self-evolution is moving from ad hoc prompt refinement toward a more general science of editable agent runtimes—one in which diagnosis, verification, routing, rollback, and governance are as central as the update itself.