Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Published 21 May 2026 in cs.AI | (2605.22505v1)

Abstract: Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.

Summary

  • The paper introduces priority ranking to directly evaluate harness optimizer decisions, revealing frequent detrimental updates and poor self-assessment.
  • It demonstrates that 44.8โ€“48.2% of optimizer modifications degrade performance, underscoring the need for step-level evaluation over traditional agent-centric metrics.
  • The study shows a significant correlation between ranking performance and multi-step agent improvement, offering a cost-effective screening tool for CI/CD pipelines.

Direct Evaluation of Harness Optimizers: The Priority Ranking Approach

Background and Motivation

Harness optimization for LLM-based agents automates the iterative improvement of agent scaffoldsโ€”including prompts, tool configurations, memory mechanisms, and workflowsโ€”enabling substantial agent performance gains across domains such as software engineering, text-to-SQL, and customer support. However, prior evaluation paradigms have been agent-centric, relying exclusively on the end-task performance improvement (end-SR) of target agents. This indirect metric neglects the optimizerโ€™s intermediate actions, often masking detrimental updates and providing no insight into whether optimization is guided by informed decisions or trial-and-error.

Empirical analyses in this work demonstrate that optimizer missteps are frequent and persistent: nearly half of harness updates degrade agent performance, and over 94% of non-prompt-related errors at intermediate optimization steps remain uncorrected in final agent configurations. State-of-the-art harness optimizers also exhibit near-random accuracy in predicting whether their own updates will be beneficial. These findings decisively invalidate the assumption that agent-centric evaluation proxies optimizer quality and establish the necessity for direct, step-level assessment of optimizer decision-making.

Priority Ranking: Methodological Design

To address the lack of actionable, direct metrics for harness optimizer evaluation, the paper introduces priority ranking. This evaluation protocol requires the optimizer to rank harness componentsโ€”prompt, tool, memory, workflowโ€”by their relative potential to impact agent performance if updated at the current optimization step. This reframing eschews the need for expensive full rollouts or manual harness audits, instead transforming the evaluation into a tractable, non-iterative language modeling task.

The core innovation lies in the SHOR dataset: 182 human-verified optimization scenarios spanning multiple domains and optimization stages. Each scenario is annotated by consensus among state-of-the-art coding agents and systematically filtered for inter-annotator consistency and meaningful performance gaps between ranked components.

Given a harness and its optimization trajectory, an optimizer is evaluated on its ability to recover the consensus priority ranking for the set of editable components. Metrics include Acc@1 (identifying the single highest-priority component) and NDCG (the quality of the entire ranking permutation).

Empirical Analysis and Findings

Error Analysis and Limitations of End-Improvement Evaluation

Detailed trajectory analyses confirm that:

  • Erroneous Update Frequency: 44.8โ€“48.2% of optimizer-initiated harness modifications are detrimental to agent performance.
  • Error Persistence: Intermediate errors in workflow, tool, and memory components persist to the final agent configuration in over 94% of cases.
  • Poor Update Awareness: Optimizersโ€™ prediction accuracy for self-evaluation of their actions is uniformly close to random (0.33โ€“0.56 depending on domain and model).

Therefore, the prevalent evaluation schema is fundamentally insufficient for ensuring harness optimizersโ€™ reliability and correctness.

Performance of Priority Ranking

Optimizersโ€™ performance on priority ranking reveals several key insights:

  • General Weakness: SOTA optimizers struggle to identify the appropriate component(s) for prioritized updates, with the leading top-1 accuracy (Acc@1) at only 0.305, and even the highest overall NDCG values not indicating robust ranking competence.
  • Lack of Cross-domain Generalization: Optimizers with high performance in one domain (e.g., software engineering) fail to transfer ranking performance to others (e.g., customer support or text-to-SQL).
  • Invalidation of Agent Harness Quality as Optimizer Harness Proxy: Just because a harness exhibits strong agent performance does not imply it is an effective configuration for the optimizer itself; a configuration beneficial for a target agent may be suboptimal for use as an optimizer harness.

Predictive Value and Efficiency

Crucially, the paper establishes a statistically significant correlation (Pearson r=0.60r=0.60, p=0.038p=0.038) between an optimizerโ€™s priority ranking performance and its realized ability to improve agent performance in multi-step optimization. This correlation holds across in-domain, out-of-domain, and various optimization stages (maximal for mid-stage harnesses, i.e., Tโˆˆ[6,10]T\in[6,10] iterations).

From a practical perspective, priority ranking is robustly efficient:

  • Cost and Time: At least 8x cheaper and 17x faster than full-harness optimization rollouts for comparative evaluation.
  • Predictive Utility: Offers a low-latency, actionable screening tool for optimizer selection and monitoring in CI/CD pipelines, supporting industrial agent deployment lifecycles.

Actionable Insights for Optimizer Development

A key experimental result is that providing optimizers with oracle information about which harness component contains a flaw improves their error resolution rate by 17โ€“51 percentage points (domain-dependent, up to 72 points). This implies that optimizers possess the architectural and functional capacity to execute repairs when given explicit prioritization cues, but lack effective diagnostic mechanisms for self-directed prioritization.

Consequently, the design and training of future harness optimizers should explicitly incorporate priority prediction submodulesโ€”decoupling the identification of optimization targets from the execution of updatesโ€”to improve resilience, generalization, and effectiveness.

Implications for Theory and Industrial Practice

The transition from black-box end-improvement evaluation to step-level, component-wise prioritization establishes a new theoretical substrate for studying optimizer competence, bridging the gap between agentic system design and introspective optimizer diagnostics.

From an applied perspective, the results highlight silent failure risks in production automation scenarios: agents deployed with harnesses refined under optimizer-centric, agent-blind protocols may inherit undetected, persistent faults. Priority ranking can mitigate these risks, both as a pre-deployment screening gate and as a continuous quality-control measure in production workflows.

Moreover, as harness optimization becomes more prevalent in closed- and open-source agentic systems, the disconnect between agent harness quality and optimizer harness quality will have direct consequences for the reproducibility and portability of agentic research and benchmarks.

Limitations and Future Directions

Limitations include the granularity of harness component decomposition (four broad categories), single-agent focus (multi-agent/multi-harness systems remain unaddressed), and annotator homogeneity (limited to a set of SOTA coding agents). Further work should generalize the SHOR evaluation pipeline to finer-grained harness architectures, collaborative or competitive agent environments, and annotator pools with greater diversity.

Additionally, the paperโ€™s framework does not address the mitigation of emergent safety risks in self-evolving agent systems, an area of growing importance in practical deployments.

Conclusion

This work introduces and justifies the necessity of direct, step-level evaluation for harness optimizers in agentic LLM systems. The proposed priority ranking task, underpinned by the SHOR dataset, enables scalable, interpretable, and accurate appraisal of optimizer competence while substantially reducing evaluation burden. Not only does priority ranking serve as an efficient proxy for multi-turn optimization ability, but it elucidates actionable methods for developing next-generation optimizers with explicit prioritization and diagnostic capabilities. These findings have immediate implications for theory development as well as for the design, deployment, and maintenance of industrial AI agent systems.

Reference: "Towards Direct Evaluation of Harness Optimizers via Priority Ranking" (2605.22505)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.