StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

Published 17 Jun 2026 in cs.SE and cs.AI | (2606.19613v1)

Abstract: We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic. We evaluate six agent harnesses paired with seven open-source LLMs across 20 scenarios of 100 turns each and find that: (1) all the tested models fail within 5-6 turns, confirming that vibe-coding-style programming without thorough testing produces bugs; (2) passing test feedback back to the agent and allowing it to retry improves passed turn count by up to 12x; and (3) a good harness is required for strong performance: stronger models exhibit up to a 6x gap between their best and worst harness, while weaker models fail with any harness. We release the benchmark and the generated tasks to enable further research into multi-turn coding agent behavior. Benchmark code and data: github.com/amazon-science/StaminaBench.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a novel benchmark that evaluates coding agents' ability to sustain correctness over 100 interaction turns.
It employs a domain-agnostic, fully programmatic REST API framework to generate tasks, verify changes, and isolate harness effects.
The findings reveal that even advanced models rapidly degrade without effective feedback and retry strategies, highlighting the need for robust error recovery.

Authoritative Summary of "StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns" (2606.19613)

Motivation and Benchmark Framework

The paper introduces StaminaBench, a rigorous benchmark designed to evaluate coding agents under long-horizon, multi-turn scenarios mimicking real-world iterative software development. It departs from prevailing evaluation paradigms (e.g., HumanEval, SWE-Bench) that assess agent performance on single, self-contained tasks. Instead, StaminaBench systematically generates REST API server implementation tasks that evolve through up to 100 procedurally defined change requests. This tracks agent stamina—the capability to maintain correctness across hundreds of interaction turns, confronting compounding complexity and accumulated context.

The benchmark methodology is grounded in a general domain-agnostic framework: state space $S$ is parameterized as OpenAPI-like schemas; initial states are sampled programmatically; actions (change requests) are strictly typed; transitions deterministically update the reference schema; and correctness is verified via fully programmatic, language-agnostic HTTP-based test suites. Crucially, there is no LLM dependence in scenario, change, or test generation, ensuring reproducibility and result fidelity.

Experimental Setup and Evaluation

The study evaluates seven open-source LLMs (24B–744B parameters) and six agent harnesses across 20 scenarios, each extending over 100 turns. Each coding agent operates in Docker isolation, receives only natural language description of requirements and changes, and is optionally exposed to test feedback. Implementation is allowed in multiple languages (Python, JavaScript, Rust), with testing performed exclusively via REST API endpoints.

Metrics include:

Average interaction turns passed before failure
Pass rate (completion of all 100 turns)
Cost efficiency measured by API token consumption.

Ablations are performed on feedback granularity (minimal, medium, detailed), retry budgets, scenario generation approaches (LLM-driven vs programmatic), implementation language, and agent harness.

Core Findings and Numerical Results

Early Failure and Limitations in Stamina

Without feedback and retries, all tested models fail within an average of 5–6 turns, including state-of-the-art models (GLM-5: 6.2 turns with best harness). This reveals a fundamental brittleness in multi-turn, vibe-coding-style programming even under clear and exhaustive instructions. The majority of failures arise from incomplete or incorrect implementation of changes, context loss, and validation errors.

Impact of Feedback and Retry Strategies

Enabling test feedback and a retry loop (R=2) yields dramatic improvements: stronger models see up to a 12x increase in turns completed (GLM-5: 57.0 turns with detailed feedback and OpenCode harness). The gain is largest for detailed assertion-level feedback, which is rarely available in practice. Scaling retry attempts up to 10 continues to yield improvement but plateaus after 3–5 attempts.

Harness Influence and Agent Infrastructure

Agent harness effects are substantial: OpenCode is consistently the most performant, while OpenHands is the lowest, with up to a 6x difference in completed turns across harnesses for the same model. Provider harnesses do not robustly correlate with better performance. Infrastructure errors (tool misuse, self-kill via pkill, stuck loops) manifest increasingly as higher retry budgets are allowed, emerging as a fundamental bottleneck alongside model limitations.

Language and Sampling Strategy Effects

Implementation language trends non-significantly but positively for JavaScript on GLM-5 and Kimi K2.5; Rust performance is lower, reflecting disparities in model pretraining coverage. LLM-driven and programmatic scenario/transition sampling exhibit similar failure modes and task difficulty.

Failure Taxonomy and Context Compression Pathology

Detailed failure analysis classifies errors into missing features, validation errors, cascade deletion bugs, improper renames, regressions, hallucinated features, endpoint mismatches, type and default value mishandling, server crashes, stuck loops, agent suicide, and invalid tool calls. As session length increases, agents exhibit context compression failures, often disregarding critical instructions, hallucinating unrequested changes, and breaking previously functional components. Infrastructure failures (e.g., harness process kill) dominate at extended retry allocations.

Theoretical and Practical Implications

StaminaBench demonstrates that multi-turn performance is a fundamentally distinct capability from single-task proficiency; no current agent is able to sustain correctness for even modestly long sessions. The benchmark underscores the need for research in:

Reliable context management and repeated instruction retention
Robust harness development and error recovery
Model architectures tailored for iterative, compounding codebase modifications
Scalable language-agnostic evaluation and black-box test automation

From a practical standpoint, industry-grade coding assistants remain profoundly limited in real iterative workflows, especially when detailed test feedback is unavailable. The benchmark quantitatively reframes expectations for agent adoption in production-level software engineering.

Future Directions

StaminaBench’s modular framework is extensible to other domains with procedurally generable interfaces, enabling comparative longitudinal studies of agentic stamina across complex tasks. As model and harness development advances, StaminaBench will serve as a critical diagnostic layer to separate compositional, context-adaptive reasoning (and infrastructure reliability) from mere code generation proficiency. The potential for hybrid agent architectures, improved memory mechanisms, and active state tracking are key research frontiers. Closed-source comparisons are currently hampered by API costs but would be informative for wider benchmarking.

Conclusion

StaminaBench establishes a novel lens for stress-testing coding agent reliability in realistic, multi-turn settings. Agents demonstrate rapid degradation amidst iterative complexity, with performance contingent on feedback, harness quality, and infrastructure robustness. The benchmark, released for community use, is positioned as an essential tool for advancing agentic software engineering research and bridging the gap between static task benchmarks and longitudinal, session-based workflows (2606.19613).

Markdown Report Issue