Stateful SWE-Bench Evaluation

Updated 3 July 2026

Stateful SWE-bench is a suite of software engineering benchmarks that incorporate persistent, evolving state to simulate real-world multi-turn coding challenges.
These protocols formalize tasks as Markov Decision Processes, modeling progressive state transitions across environment setup, code implementation, and test generation phases.
Empirical studies demonstrate that incorporating user modeling, continual learning, and contrast-based metrics improves evaluation rigor and advances autonomous coding agent research.

Stateful SWE-bench is a class of software engineering (SWE) benchmarks and evaluation protocols that incorporate persistent, temporally evolving state — either in the user/interaction context, system environment, or reasoning history — into the assessment of autonomous coding agents. Unlike stateless, one-shot code-generation challenges, stateful SWE-bench settings require agents to model or leverage accumulated context, maintain or manipulate execution/workspace state, and reason over multi-turn or multi-phase solution trajectories. Modern stateful SWE-bench variants encompass multi-session user modeling, end-to-end environment–implementation–verification pipelines, continual-learning task streams, and state-aware benchmarking methodologies.

1. Conceptualization and Motivation

Stateful SWE-bench emerges in response to the inadequacy of traditional stateless evaluations for realistic, autonomous software agent assessment. Legacy benchmarks such as HumanEval, MBPP, and the original SWE-bench (Jimenez et al., 2023) treat each task as an independent, atomic event: the agent receives a textual issue description and a static codebase snapshot, produces code edits, and is immediately scored by a test suite. These paradigms do not reflect the ongoing nature of real-world software engineering, where agents must:

Retain and exploit user or system preferences across sessions
Coordinate sequential development phases (e.g., environment setup, implementation, test generation)
Transfer knowledge between temporally ordered tasks
Manage the cumulative effects of actions in dynamic interactive settings

Stateful SWE-bench protocols thus simulate or instrument these longitudinal phenomena, enabling rigorous evaluation of agent capabilities in realistically evolving contexts (Zhou et al., 24 Oct 2025, Guan et al., 13 May 2026, Joshi et al., 13 Jun 2025, Melis, 15 Jun 2026).

2. Formal Task Definitions and State Representations

Leading stateful SWE-bench protocols formally specify tasks as Markov Decision Processes (MDPs), with domain-specific definitions of state $S$ , action space $A$ , and deterministic or stochastic transition functions $T$ . The "SWE-Cycle" benchmark (Guan et al., 13 May 2026) exemplifies this approach:

Environment Reconstruction (Env):
- $S_{env}$ : On-disk repository snapshot (files, folder tree), no dependencies
- $A_{env}$ : System/environment mutating commands (e.g., install, write, configure)
- $T_{env}(s_t, a_t) = s_{t+1}$ : Transition to state where $a_t$ ’s side effects are applied
Code Implementation (Impl):
- $S_{impl}$ : Pre-configured codebase, issue description $I$ , test suite $T_{gold}$
- $A$ 0: Patch operations over AST or text (insert, delete, modify code)
- $A$ 1: Updated codebase after applying $A$ 2
Verification Test Generation (TestGen):
- $A$ 3: Patched codebase, issue description $A$ 4, reference tests
- $A$ 5: Add/modify test files, configure runners
- $A$ 6: State updated to contain agent-authored discriminative tests
FullCycle (End-to-End):
- $A$ 7: Bare repository, issue description $A$ 8, empty execution environment
- $A$ 9: $T$ 0 in unified, uninterrupted session
- $T$ 1 for entire sequence, storing all state in a single container with zero external resets

State passing is explicit: each phase serializes its output (environment snapshot, repo state, test suite) as input for downstream phases, enforcing strict continuity and prohibiting human intervention.

3. User and Memory Modeling in Stateful Interactions

Stateful SWE-bench methodologies extend state tracking to user-agent interaction history and persistent user preferences. The ToM-SWE framework (Zhou et al., 24 Oct 2025) introduces:

Developer Profiles ( $T$ 2): Encapsulate interaction traits (verbosity, question timing), coding preferences (frameworks, libraries)
Session Histories ( $T$ 3): Sets of prior user-agent transcripts, exposing temporal dependencies
Persistent User Model ( $T$ 4): Inferred via theory-of-mind (ToM) agents; consumed by the SWE agent when choosing actions

A typical policy incorporates both in-session context ( $T$ 5) and long-term user modeling:

$T$ 6

Tasks are evaluated on agents' ability to (1) infer persistent preferences, (2) respect user styles, and (3) minimize unnecessary clarifications. This simulation is typically realized by a profile-conditioned user simulator and task generator, with satisfaction and efficiency metrics scored by LLM-powered evaluators.

4. Evaluation Protocols and Metrics

Stateful SWE-bench evaluation metrics measure phase-resolved correctness, state integrity, and learning/dynamic adaptation:

Phase Scores (SWE-Cycle) (Guan et al., 13 May 2026):
- Static and dynamic sub-scores $T$ 7 per instance/phase
- Normalized aggregate: $T$ 8 and solved rate
- End-to-end: $T$ 9 and $S_{env}$ 0 computed as means over composite phase scores
Continual Learning Metrics (SWE-Bench-CL) (Joshi et al., 13 Jun 2025):
- Average Accuracy (AA): Mean final performance over sequence
- Forgetting (F): Loss from peak prior performance on earlier tasks
- Forward/Backward Transfer (FT/BWT): Gains/losses on new/prior tasks due to incremental learning
- Composite Continual-Learning Score (CCLS): Weighted sum of above, with stable-plasticity F $S_{env}$ 1 harmonics
Stateful Decision-Problem Benchmarking (Melis, 15 Jun 2026):
- Replaces naive absolute metrics with contrast (delta) estimators: $S_{env}$ 2
- Experiments use randomized or blocked trial designs for consistent identification of optimal programs, accounting for uncontrolled environmental state drift

5. Implementation Strategies and Agent Architectures

Modern stateful SWE-bench protocols leverage system architectures and memory systems capable of retaining, retrieving, and condensing long-run interaction or reasoning histories:

Graph-based Agent State (LangGraph) (Joshi et al., 13 Jun 2025): Encodes agent observations, actions, plans, and tool invocations as nodes in a persistent, updatable graph structure at each turn
FAISS-based Semantic Memory (Joshi et al., 13 Jun 2025): Stores representations (embeddings) of prior solved tasks, facilitating retrieval of analogous experience for new tasks
Dynamic Reasoning Contexts (SWE-AGILE) (Lian et al., 13 Apr 2026): Maintains a sliding window of detailed reasoning over N recent steps, storing earlier reasoning as digests. Compression functions $S_{env}$ 3 optimize semantic fidelity under token constraints. This approach mitigates context window limitations and preserves deep System-2 reasoning chains

Pseudocode implementations exemplify strict state-carryover, memory-prompting, and hybrid retrieval-training structures.

6. Empirical Findings and Performance Analysis

Empirical studies across benchmarks reveal that:

Cross-phase or cross-session state dependencies substantially increase task difficulty. In SWE-Cycle, the solve rate for FullCycle (simultaneous end-to-end, stateful execution) drops below 14%, even as best-in-class agents achieve 78%–97% per-phase accuracy in isolation. Most of the dynamic boost in end-to-end setting is offset by cumulative static errors and verification bottlenecks (Guan et al., 13 May 2026).
User modeling and memory improve objective and subjective outcomes. ToM-SWE achieves 59.7% success on stateful scenarios, far exceeding the 18.1% baseline, and earns consistently higher satisfaction scores via persistent preference tracking (Zhou et al., 24 Oct 2025).
Continual learning agents with external memory show improved accuracy, reduced forgetting, and better transfer compared to memoryless baselines. Memory-aware architectures support faster resolution of recurring bug types and more efficient adaptation to evolving codebases, as measured by AA, F, FT, and CCLS (Joshi et al., 13 Jun 2025).
Contrast-based estimators yield consistent decisions even under unobservable, stateful environmental drift, avoiding the misleading bias of traditional averaging protocols (Melis, 15 Jun 2026).

7. Practical Recommendations and Future Directions

Best practices for constructing and deploying stateful SWE-bench protocols include:

Instrument all benchmarks to pass state explicitly between phases or sessions; prohibit human resets or ad hoc re-initializations.
Utilize explicit user simulators or developer profiles to test agents' ability to recognize, retain, and exploit long-term user context.
Cap memory context to avoid overflow, and filter prior experiences for relevance and correctness.
Adopt randomized or block-based experiment designs for performance benchmarking under stateful system dynamics, always relying on relative (contrast) metrics rather than absolute averages.
Incorporate continual fine-tuning, curriculum ordering, and diagnostic logging to enable adaptive, robust agent training.

Further research is exploring algorithmic improvements in memory compression, adaptive window sizing, theory-of-mind augmentation, and cross-language or cross-domain generalization.

Stateful SWE-bench defines a suite of rigorous, state-dependent evaluations for autonomous code agents, encompassing environment manipulation, persistent user modeling, multi-turn continual learning, and robust benchmarking under dynamic system state. State formalization, explicit memory, and hybrid evaluation protocols are essential for measuring, and ultimately achieving, practical agent autonomy in real-world software engineering (Jimenez et al., 2023, Zhou et al., 24 Oct 2025, Lian et al., 13 Apr 2026, Joshi et al., 13 Jun 2025, Guan et al., 13 May 2026, Melis, 15 Jun 2026).