- The paper challenges traditional benchmarks by revealing that they conflate model performance with system harness quality, causing ≥20 percentage point variations across setups.
- It demonstrates that single reference solutions obscure valid alternatives, penalizing functionally superior outcomes and hindering robust system design.
- The paper advocates for a redesign of benchmarks with component-level evaluations and multi-shape verifiers to accurately capture agentic software engineering complexities.
Coding Benchmarks and Their Misalignment with Agentic Software Engineering
Introduction
Current evaluation protocols for coding agents, based largely on traditional code generation benchmarks, are fundamentally misaligned with the realities and needs of agentic software engineering. The prevailing approach, which yields an end-to-end score typically computed against a single reference solution, conflates distinct components within agent-driven software systems, penalizes valid alternatives, and precludes actionable component-level signals for system iteration. This position paper identifies and analyzes critical symptoms of this misalignment and advocates for re-engineering benchmarks to reflect the composite nature of agentic systems.
Agentic Systems and the System Harness
In modern practice, coding agents function as orchestrated composites—so-called system harnesses—rather than monolithic models. Such systems combine LLMs, tool-use scaffolds, multi-tier environments, curated contexts, and feedback signals operating at varying timescales and granularities. The agent harness (model plus basic tooling for a single task) is enveloped by a system harness that decomposes high-level goals, handles environment management, orchestrates task dispatch, and routes through a hierarchy of verification and feedback mechanisms.
The system harness integrates the following critical components:
- Task decomposition and dispatch: Transforms high-level objectives into tractable, verifiable units.
- Agent harness selection and configuration: Choice of models, prompts, and available tools markedly affect outcomes.
- Dynamic environment management: Incorporates runtime state, external services, and evolving repositories.
- Context curation: Assembles relevant knowledge, code artifacts, and problem-specific context for invocation.
- Multi-tier feedback: Includes inner-loop (fast, local, automated verifiers), middle-loop (reviewer and simulation feedback), and outer-loop (production/user signals).
Each component can independently shift system performance by margins comparable to inter-generational model jumps, underscoring the reductionism of attributing benchmark outcomes to the model alone.
Analysis of Benchmark Misalignment
The paper identifies three central symptoms that together render current benchmarking paradigms inadequate for agentic systems.
Conflation of Model and Harness
Conventional leaderboards (e.g., TerminalBench, SWE-Bench, SWE-Lancer) report single performance metrics attributed to a “model” without explicit definition of the system harness—obscuring the substantial variance contributed by scaffold design, orchestration, containerization, and evaluation environment. Empirical results demonstrate ≥20 percentage point swings for the same base model across harnesses and setups, sometimes exceeding the difference achieved through state-of-the-art model progress. Methodologically, this undermines the validity of direct model comparisons and instead compares at the system level, leading to substantial misattribution of observed differences.
Recommendation: Mandatory harness-aware metadata and submission of ablations along non-model axes would increase evaluation validity and transparency.
Single Reference Solution Anchoring
Most benchmarks grade outcomes against a single “gold” reference or its derivative test suite (e.g., in SWE-Bench), penalizing alternate but functionally valid or even superior solutions. This approach is not robust to the inherent multiplicity of solutions that arise in agentic code synthesis, especially for open-ended or design-intensive tasks (e.g., architectural refactoring, API evolution). The narrow evaluation scope ignores the broader invariants that characterize software quality: abstraction adherence, code reuse, system architecture, maintainability, etc.
Recent studies reveal that (i) solution leakage, (ii) test suite incompleteness, and (iii) metric divergence from ground-truth developer assessments are common in current benchmarks, further exacerbating construct validity issues.
Recommendation: Adoption of multi-shape verifiers, property testing, and behavioral specifications decoupled from specific implementations to reflect functional intent and design desiderata.
Lack of Component-Level Signal
End-to-end metrics from agent runs provide no actionable attribution for diagnosing or evolving system harnesses. Failures are opaque with respect to which component (context construction, tool selection, feedback mechanism, etc.) may be responsible. Without decomposed evaluation, improvement cycles devolve into expensive, intuition-driven ablations, impeding systematic progress and optimization.
Recommendation: Evaluation of individual harness components in isolation (e.g., context utility, verifier robustness, agent compliance with invariants) to support targeted system evolution—analogous to the established practice of orthogonal unit and integration tests in software engineering.
Implications and Future Directions
The misalignment between current benchmarks and the realities of agentic software engineering has several practical and theoretical implications:
- Practical: Deployment decisions based on single-score, model-centric benchmarking signals may lead to suboptimal system choices and hinder progress in harness optimization. Operationalizing robust measurement will require significant benchmark redesign, introduction of new meta-data protocols, and the development of verifiers capable of grading design-level and behavioral invariants.
- Theoretical: Benchmarking agentic systems is formally a measurement challenge in the sense of social science measurement theory: operationalizations (test suites, oracle definitions) must faithfully reflect target constructs (effective software engineering, system design quality) without encoding implementation particulars. This echoes Wallach et al.’s recent arguments distinguishing construct validity and operationalization in GenAI evaluation (Wallach et al., 1 Feb 2025).
Open research problems include:
- Automated construction of verifiers for design invariants and structural properties
- Specifying tasks and feedback protocols that incentivize and measure generalization and system-level code quality, not only passing reference tests
- Developing harness-specific evaluation methods and meta-benchmark repositories
Recent community efforts, e.g., SkillsBench (Li et al., 13 Feb 2026), ProgramBench (Yang et al., 5 May 2026), and Meta-Harness (Lee et al., 30 Mar 2026), demonstrate movement in this direction but a comprehensive, widely-adopted protocol remains elusive.
Conclusion
Coding benchmarks designed for pre-agentic paradigms are inadequate for measuring progress in agentic software engineering. They systematically conflate model and harness, obscure solution multiplicity, and fail to provide decomposable, actionable signals. The field must adopt harness-aware, component-centric, and behaviorally grounded evaluation protocols. Progress hinges on solving the operationalization gap: specifying what agentic systems should accomplish in a measurable but implementation-agnostic way. Absent such re-alignment, future benchmarks risk grading advancement only in proximity to past solutions rather than in realized improvements to the engineering process.