- The paper introduces a novel benchmark that evaluates coding agents' ability to sustain correctness over 100 interaction turns.
- It employs a domain-agnostic, fully programmatic REST API framework to generate tasks, verify changes, and isolate harness effects.
- The findings reveal that even advanced models rapidly degrade without effective feedback and retry strategies, highlighting the need for robust error recovery.
Authoritative Summary of "StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns" (2606.19613)
Motivation and Benchmark Framework
The paper introduces StaminaBench, a rigorous benchmark designed to evaluate coding agents under long-horizon, multi-turn scenarios mimicking real-world iterative software development. It departs from prevailing evaluation paradigms (e.g., HumanEval, SWE-Bench) that assess agent performance on single, self-contained tasks. Instead, StaminaBench systematically generates REST API server implementation tasks that evolve through up to 100 procedurally defined change requests. This tracks agent stamina—the capability to maintain correctness across hundreds of interaction turns, confronting compounding complexity and accumulated context.
The benchmark methodology is grounded in a general domain-agnostic framework: state space S is parameterized as OpenAPI-like schemas; initial states are sampled programmatically; actions (change requests) are strictly typed; transitions deterministically update the reference schema; and correctness is verified via fully programmatic, language-agnostic HTTP-based test suites. Crucially, there is no LLM dependence in scenario, change, or test generation, ensuring reproducibility and result fidelity.
Experimental Setup and Evaluation
The study evaluates seven open-source LLMs (24B–744B parameters) and six agent harnesses across 20 scenarios, each extending over 100 turns. Each coding agent operates in Docker isolation, receives only natural language description of requirements and changes, and is optionally exposed to test feedback. Implementation is allowed in multiple languages (Python, JavaScript, Rust), with testing performed exclusively via REST API endpoints.
Metrics include:
- Average interaction turns passed before failure
- Pass rate (completion of all 100 turns)
- Cost efficiency measured by API token consumption.
Ablations are performed on feedback granularity (minimal, medium, detailed), retry budgets, scenario generation approaches (LLM-driven vs programmatic), implementation language, and agent harness.
Core Findings and Numerical Results
Early Failure and Limitations in Stamina
Without feedback and retries, all tested models fail within an average of 5–6 turns, including state-of-the-art models (GLM-5: 6.2 turns with best harness). This reveals a fundamental brittleness in multi-turn, vibe-coding-style programming even under clear and exhaustive instructions. The majority of failures arise from incomplete or incorrect implementation of changes, context loss, and validation errors.
Impact of Feedback and Retry Strategies
Enabling test feedback and a retry loop (R=2) yields dramatic improvements: stronger models see up to a 12x increase in turns completed (GLM-5: 57.0 turns with detailed feedback and OpenCode harness). The gain is largest for detailed assertion-level feedback, which is rarely available in practice. Scaling retry attempts up to 10 continues to yield improvement but plateaus after 3–5 attempts.
Harness Influence and Agent Infrastructure
Agent harness effects are substantial: OpenCode is consistently the most performant, while OpenHands is the lowest, with up to a 6x difference in completed turns across harnesses for the same model. Provider harnesses do not robustly correlate with better performance. Infrastructure errors (tool misuse, self-kill via pkill, stuck loops) manifest increasingly as higher retry budgets are allowed, emerging as a fundamental bottleneck alongside model limitations.
Language and Sampling Strategy Effects
Implementation language trends non-significantly but positively for JavaScript on GLM-5 and Kimi K2.5; Rust performance is lower, reflecting disparities in model pretraining coverage. LLM-driven and programmatic scenario/transition sampling exhibit similar failure modes and task difficulty.
Failure Taxonomy and Context Compression Pathology
Detailed failure analysis classifies errors into missing features, validation errors, cascade deletion bugs, improper renames, regressions, hallucinated features, endpoint mismatches, type and default value mishandling, server crashes, stuck loops, agent suicide, and invalid tool calls. As session length increases, agents exhibit context compression failures, often disregarding critical instructions, hallucinating unrequested changes, and breaking previously functional components. Infrastructure failures (e.g., harness process kill) dominate at extended retry allocations.
Theoretical and Practical Implications
StaminaBench demonstrates that multi-turn performance is a fundamentally distinct capability from single-task proficiency; no current agent is able to sustain correctness for even modestly long sessions. The benchmark underscores the need for research in:
- Reliable context management and repeated instruction retention
- Robust harness development and error recovery
- Model architectures tailored for iterative, compounding codebase modifications
- Scalable language-agnostic evaluation and black-box test automation
From a practical standpoint, industry-grade coding assistants remain profoundly limited in real iterative workflows, especially when detailed test feedback is unavailable. The benchmark quantitatively reframes expectations for agent adoption in production-level software engineering.
Future Directions
StaminaBench’s modular framework is extensible to other domains with procedurally generable interfaces, enabling comparative longitudinal studies of agentic stamina across complex tasks. As model and harness development advances, StaminaBench will serve as a critical diagnostic layer to separate compositional, context-adaptive reasoning (and infrastructure reliability) from mere code generation proficiency. The potential for hybrid agent architectures, improved memory mechanisms, and active state tracking are key research frontiers. Closed-source comparisons are currently hampered by API costs but would be informative for wider benchmarking.
Conclusion
StaminaBench establishes a novel lens for stress-testing coding agent reliability in realistic, multi-turn settings. Agents demonstrate rapid degradation amidst iterative complexity, with performance contingent on feedback, harness quality, and infrastructure robustness. The benchmark, released for community use, is positioned as an essential tool for advancing agentic software engineering research and bridging the gap between static task benchmarks and longitudinal, session-based workflows (2606.19613).