HarnessOpt: Optimizing Agentic Harnesses

Updated 19 May 2026

HarnessOpt is a framework for systematically designing and optimizing the controlling interface between agents and their operational tasks.
It applies methods such as Bayesian configuration search, multi-agent feedback loops, and code synthesizing to enhance harness performance.
Practical implementations include improved task pass rates, cost efficiency, and vulnerability detection across agentic and hardware systems.

HarnessOpt refers to methodologies, algorithms, and frameworks for systematic harness optimization: the process of designing, evolving, and optimizing the controlling "harness" that mediates how agents—especially LLMs and coding agents—interact with their environment, tools, and tasks. HarnessOpt appears in multiple research domains: agentic software engineering, algorithm discovery with LLMs, Bayesian configuration optimization, vulnerability-oriented multi-agent coordination, and classical cable harness layout. Across these domains, harness optimization is central to maximizing task performance, ensuring operational stability, protecting against exploitative solutions, and guaranteeing physical or logical compliance.

1. Definitions and Core Principles

Harness optimization generalizes to the systematic design, parameterization, and autonomous evolution of all agent-facing interfaces, scaffolding, and infrastructure components that mediate between the core model(s) and their operational task environment (Lin et al., 28 Apr 2026, Lee et al., 30 Mar 2026, Ishibashi et al., 13 May 2026, Sengupta et al., 22 Apr 2026, Liu et al., 22 Apr 2026). In modern agentic systems, the "harness" encompasses not only prompt templates but also middleware, tool interfaces, sub-agent orchestration, execution wrappers, memory management, component pipelines, and explicit decision logic.

A general formalization (editors' synthesis from (Lin et al., 28 Apr 2026, Sengupta et al., 22 Apr 2026, Lee et al., 30 Mar 2026)):

Let the harness $H$ be an executable program or configuration controlling a fixed model $M$ .
The performance metric is $\mu(H) = \mathbb{E}_{(x, \tau)}[r(\tau, x)]$ , where $\tau$ is a trajectory induced by running $M$ with $H$ on task $x$ , and $r$ is a scalar reward.
The harness optimization problem: $H^* = \arg\max_H \mu(H), \quad \text{subject to cost}(H) \leq B$ where cost may be token, computational, or wall-time budget.

Key harness optimization regimes:

Closed-loop, trajectory-level evolution of code-agent harnesses (Lin et al., 28 Apr 2026)
Outer-loop LLM-driven search over harness code or dataflow (Lee et al., 30 Mar 2026)
Bayesian flag configuration search with cost and safety constraints (Sengupta et al., 22 Apr 2026)
Meta-design of multi-agent protocol graphs (Liu et al., 22 Apr 2026)
Physical cable harness Pareto optimization (length/bundle tradeoffs) (Karlsson et al., 2023)

2. HarnessOpt Methodologies and Algorithms

Several algorithmic families underpin contemporary HarnessOpt:

(a) Observability-Driven Evolution

Agentic Harness Engineering (AHE) (Lin et al., 28 Apr 2026) instruments the canonical design–execute–inspect–revise cycle with distinct "observability pillars":

Component observability: Explicit, file-level representations for all harness components (prompts, skills, tools, middleware, memory), mapped to atomic, revertible edit actions (git-compatible patches).
Experience observability: Layered reduction of raw multi-million-token agent trajectories to drill-down, machine-consumable evidence corpora, enabling evidence-driven component revision.
Decision observability: Mandatory, falsifiable contracts for every edit, pairing each patch with explicit task-level predictions and post-hoc automated verification; failed contracts trigger granular rollback.

The full AHE loop executes multi-phase rollouts, evidence distillation, rollback/attribution, predictive editing, and version-controlled commits.

(b) Automated Harness Code Search

Meta-Harness (Lee et al., 30 Mar 2026) uses a coding agent as a harness proposer, leveraging a file-backed memory to score, trace, and analyze all prior harness attempts. The proposer generates harness candidates by inspecting and editing arbitrary code granularity—ranging from heuristic fixes to end-to-end rewrites—guided by execution traces and Pareto-optimal tradeoffs (accuracy, context cost).

(c) Bayesian Configuration Search

HARBOR (Sengupta et al., 22 Apr 2026) formalizes mixed-variable harness optimization as constrained noisy Bayesian optimization. The search space consists of boolean, categorical, discrete, and continuous configuration flags. The algorithm employs a block-additive SAAS GP surrogate (with axis-aligned priors and cross-block interaction), multi-fidelity cost-aware qNEHVI acquisition, TuRBO trust regions, and posterior chance-constrained safety rejection to maximize pass-rate under a deployment cost ceiling.

Cold-start correction is applied for configuration features requiring warmup, and a silent-flag detector prunes ineffective or broken flag blocks from the search.

(d) Feedback-Driven Multi-Agent Harness Synthesis

AgentFlow (Liu et al., 22 Apr 2026) models multi-agent harnesses as well-typed, directed graphs in a DSL that explicitly parametrizes agent roles, prompts, model assignments, allowed tools, inter-agent edges (both data and guarded control flow), and fan-out nodes. The search space is vast, so AgentFlow leverages type-safe proposal generation (by LLM), runtime feedback extraction (pass/fail, coverage, sanitizer, stdout), and LLM-driven diagnosis to iteratively repair and improve harnesses.

(e) Domain-Specific Methods

In hardware, HarnessOpt algorithms for cable harness layout (Karlsson et al., 2023) solve a constrained multi-commodity flow on a spatial grid with competing objectives (length minimization vs bundling), using deterministic local or Lagrangian-relaxation heuristics to produce Pareto frontiers and route bundles around obstacles/preferences.

3. Action Spaces, Feedback, and Search Dimensions

The harness edit/action space is highly heterogeneous:

Edit granularity: File-level (AHE), code block/module-level (Meta-Harness/AgentFlow), or configuration flags (HARBOR).
Supported objects: Prompts, tool APIs, database policies, internal state machines, orchestration graphs, environmental integration, memory backends.
Search moves: Add/delete/modify/revert component; rewire agent schedules; patch prompt templates; adjust memory thresholds; enable/disable compression; change protocol topology (for multi-agent); or set parallelization/isolation parameters.

Empirically, the feasible search space explodes combinatorially with system size (roles, model choices, tools, prompt grammars, edge wiring, fan-out degree (Liu et al., 22 Apr 2026)). Typed well-formedness checks, static validation, and cold-start pruning are all essential to manage this complexity.

Diagnostic signals can include:

Task-level binary or scalar rewards (pass@1, accuracy, F1)
Layered trajectory evidence (failure clusters, root-cause analysis)
Per-edit attribution (fix/rollback verdicts, regression predictions)
Sanitizer, coverage, and runtime traces (for code/bug finding)
Domain-specific constraints (cable length, physical clearance)

Contracts and explicit edit manifests (AHE) operationalize feedback by associating a predicted-fix/regression set per action, enabling precise quantitative attribution and patch rollback.

4. Empirical Results and Performance Metrics

HarnessOpt frameworks yield substantial empirical gains versus baseline or hand-engineered harnesses:

Coding agents (AHE): Rewards such as pass@1 on Terminal-Bench 2 increase from 69.7% (seed) and 71.9% (Codex-CLI) to 77.0% (AHE), with positive transfer (token and pass@1) to SWE-bench-verified and multiple alternate model families (Lin et al., 28 Apr 2026).
Algorithm discovery: Under fixed budgets (40M tokens), quality-optimized harnesses (deep multistep agent sessions, hack-detection, Git worktree isolation) discover higher-performing algorithms than quantity-optimized fast API loops, with fewer hack exploits and near-linear wall-clock scaling (3.2–3.9× speedup for parallelization) (Ishibashi et al., 13 May 2026).
Multi-agent LLM harnesses: AgentFlow improves TerminalBench-2 pass rate to 84.3% (highest in snapshot), outperforming established baseline pipelines and discovering previously unknown critical vulnerabilities in a real codebase under real-world constraints (Google Chrome, 10 zero-day bugs) (Liu et al., 22 Apr 2026).
Bayesian flag optimization: HARBOR matches or exceeds carefully hand-tuned configurations with fewer flags enabled, filtering out non-contributing or error-prone enhancements. Large flag spaces ( $F \gg 16$ ) cannot be efficiently explored by manual stacking (Sengupta et al., 22 Apr 2026).

Key metrics (per context):

Task-level pass@1 or accuracy
Cost efficiency (task success per million tokens)
Pareto optimality (accuracy vs cost)
Fix and regression precision/recall per edit
Wall-clock scaling under parallelization constraints
Zero-day discovery count (for vulnerability harnesses)
Exact and normalized objective gaps to MIP/PSO in hardware routing

5. Practical Guidelines and Limitations

Robust, cost-effective harness optimization is contingent upon systematic observation, structured versioning, and feedback attribution. The literature converges on several core best practices:

Enforce explicit versionable representations for all components (file-level, flags, graph-DSL)
Structure the search loop with incremental, observable, and reversible actions
Leverage layered trajectory and runtime evidence—avoid compressed pass/fail signals alone (Lin et al., 28 Apr 2026, Lee et al., 30 Mar 2026)
Integrate falsifiable contracts for every harness edit, to enable precise rollback (Lin et al., 28 Apr 2026)
Pareto-optimize accuracy versus cost (token budget, wall-time, memory), reporting and enforcing on deployment cost ceilings (Sengupta et al., 22 Apr 2026, Lee et al., 30 Mar 2026)
Mandate hack- and regression-detection filters at scale as agent capabilities grow (Ishibashi et al., 13 May 2026)
Parallelize safely using isolation primitives (e.g. Git worktrees) (Ishibashi et al., 13 May 2026)

Major limitations and open problems include:

Regression prediction remains unreliable (low precision/recall ~11%), impeding consistent improvement (Lin et al., 28 Apr 2026)
Cold-start bias for stateful/warmup-dependent features inflates early variance; unbiased estimators and reward correction are necessary (Sengupta et al., 22 Apr 2026)
Compute cost remains high for full-benchmark rollouts, trajectory analysis, and workspace management; addressing this requires efficient surrogate models, selective rollouts, or multi-fidelity strategies
Joint optimization of interacting harness components is unsolved; component ablations sum superlinearly (Lin et al., 28 Apr 2026, Ishibashi et al., 13 May 2026)
Most results remain "in-the-small" (fixed agent/harness, benchmark suite, or hardware context); transferability is promising but not guaranteed

6. Cross-Domain Applications and Comparative Impact

HarnessOpt frameworks demonstrate generality across a range of domains:

Agentic coding/LLM systems: Observability-driven evolution, contract-based versioning, and explicit feedback dramatically outperform legacy, static harness designs (Lin et al., 28 Apr 2026, Lee et al., 30 Mar 2026).
Algorithm discovery: Quality-focused agentic harnesses, equipped with hack detection and safe parallelism, dominate in constrained resource settings (Ishibashi et al., 13 May 2026).
Vulnerability discovery: Harness synthesis via graph DSL captures complex multi-agent interplay, discoverable only through feedback-driven, diagnosis-guided search (Liu et al., 22 Apr 2026).
Hardware/cable optimization: MILP-based harness routing via deterministic local search achieves near-optimal topologies, scale, and constraint compliance versus stochastic alternatives (Karlsson et al., 2023).

The empirical consensus is clear: when the harness is the locus of control and coordination, its systematic optimization—automated, rigorously observable, and explicitly feedback-driven—is necessary for scaling agentic performance under practical task, cost, and safety constraints.

References

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (Lin et al., 28 Apr 2026)
Meta-Harness: End-to-End Optimization of Model Harnesses (Lee et al., 30 Mar 2026)
Effective Harness Engineering for Algorithm Discovery with Coding Agents (Ishibashi et al., 13 May 2026)
HARBOR: Automated Harness Optimization (Sengupta et al., 22 Apr 2026)
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery (Liu et al., 22 Apr 2026)
Automatic cable harness layout routing in a customizable 3D environment (Karlsson et al., 2023)