Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stateful SWE-Bench Evaluation

Updated 3 July 2026
  • Stateful SWE-bench is a suite of software engineering benchmarks that incorporate persistent, evolving state to simulate real-world multi-turn coding challenges.
  • These protocols formalize tasks as Markov Decision Processes, modeling progressive state transitions across environment setup, code implementation, and test generation phases.
  • Empirical studies demonstrate that incorporating user modeling, continual learning, and contrast-based metrics improves evaluation rigor and advances autonomous coding agent research.

Stateful SWE-bench is a class of software engineering (SWE) benchmarks and evaluation protocols that incorporate persistent, temporally evolving state — either in the user/interaction context, system environment, or reasoning history — into the assessment of autonomous coding agents. Unlike stateless, one-shot code-generation challenges, stateful SWE-bench settings require agents to model or leverage accumulated context, maintain or manipulate execution/workspace state, and reason over multi-turn or multi-phase solution trajectories. Modern stateful SWE-bench variants encompass multi-session user modeling, end-to-end environment–implementation–verification pipelines, continual-learning task streams, and state-aware benchmarking methodologies.

1. Conceptualization and Motivation

Stateful SWE-bench emerges in response to the inadequacy of traditional stateless evaluations for realistic, autonomous software agent assessment. Legacy benchmarks such as HumanEval, MBPP, and the original SWE-bench (Jimenez et al., 2023) treat each task as an independent, atomic event: the agent receives a textual issue description and a static codebase snapshot, produces code edits, and is immediately scored by a test suite. These paradigms do not reflect the ongoing nature of real-world software engineering, where agents must:

  • Retain and exploit user or system preferences across sessions
  • Coordinate sequential development phases (e.g., environment setup, implementation, test generation)
  • Transfer knowledge between temporally ordered tasks
  • Manage the cumulative effects of actions in dynamic interactive settings

Stateful SWE-bench protocols thus simulate or instrument these longitudinal phenomena, enabling rigorous evaluation of agent capabilities in realistically evolving contexts (Zhou et al., 24 Oct 2025, Guan et al., 13 May 2026, Joshi et al., 13 Jun 2025, Melis, 15 Jun 2026).

2. Formal Task Definitions and State Representations

Leading stateful SWE-bench protocols formally specify tasks as Markov Decision Processes (MDPs), with domain-specific definitions of state SS, action space AA, and deterministic or stochastic transition functions TT. The "SWE-Cycle" benchmark (Guan et al., 13 May 2026) exemplifies this approach:

  • Environment Reconstruction (Env):
    • SenvS_{env}: On-disk repository snapshot (files, folder tree), no dependencies
    • AenvA_{env}: System/environment mutating commands (e.g., install, write, configure)
    • Tenv(st,at)=st+1T_{env}(s_t, a_t) = s_{t+1}: Transition to state where ata_t’s side effects are applied
  • Code Implementation (Impl):
    • SimplS_{impl}: Pre-configured codebase, issue description II, test suite TgoldT_{gold}
    • AA0: Patch operations over AST or text (insert, delete, modify code)
    • AA1: Updated codebase after applying AA2
  • Verification Test Generation (TestGen):
    • AA3: Patched codebase, issue description AA4, reference tests
    • AA5: Add/modify test files, configure runners
    • AA6: State updated to contain agent-authored discriminative tests
  • FullCycle (End-to-End):
    • AA7: Bare repository, issue description AA8, empty execution environment
    • AA9: TT0 in unified, uninterrupted session
    • TT1 for entire sequence, storing all state in a single container with zero external resets

State passing is explicit: each phase serializes its output (environment snapshot, repo state, test suite) as input for downstream phases, enforcing strict continuity and prohibiting human intervention.

3. User and Memory Modeling in Stateful Interactions

Stateful SWE-bench methodologies extend state tracking to user-agent interaction history and persistent user preferences. The ToM-SWE framework (Zhou et al., 24 Oct 2025) introduces:

  • Developer Profiles (TT2): Encapsulate interaction traits (verbosity, question timing), coding preferences (frameworks, libraries)
  • Session Histories (TT3): Sets of prior user-agent transcripts, exposing temporal dependencies
  • Persistent User Model (TT4): Inferred via theory-of-mind (ToM) agents; consumed by the SWE agent when choosing actions

A typical policy incorporates both in-session context (TT5) and long-term user modeling:

TT6

Tasks are evaluated on agents' ability to (1) infer persistent preferences, (2) respect user styles, and (3) minimize unnecessary clarifications. This simulation is typically realized by a profile-conditioned user simulator and task generator, with satisfaction and efficiency metrics scored by LLM-powered evaluators.

4. Evaluation Protocols and Metrics

Stateful SWE-bench evaluation metrics measure phase-resolved correctness, state integrity, and learning/dynamic adaptation:

  • Phase Scores (SWE-Cycle) (Guan et al., 13 May 2026):
    • Static and dynamic sub-scores TT7 per instance/phase
    • Normalized aggregate: TT8 and solved rate
    • End-to-end: TT9 and SenvS_{env}0 computed as means over composite phase scores
  • Continual Learning Metrics (SWE-Bench-CL) (Joshi et al., 13 Jun 2025):
    • Average Accuracy (AA): Mean final performance over sequence
    • Forgetting (F): Loss from peak prior performance on earlier tasks
    • Forward/Backward Transfer (FT/BWT): Gains/losses on new/prior tasks due to incremental learning
    • Composite Continual-Learning Score (CCLS): Weighted sum of above, with stable-plasticity FSenvS_{env}1 harmonics
  • Stateful Decision-Problem Benchmarking (Melis, 15 Jun 2026):
    • Replaces naive absolute metrics with contrast (delta) estimators: SenvS_{env}2
    • Experiments use randomized or blocked trial designs for consistent identification of optimal programs, accounting for uncontrolled environmental state drift

5. Implementation Strategies and Agent Architectures

Modern stateful SWE-bench protocols leverage system architectures and memory systems capable of retaining, retrieving, and condensing long-run interaction or reasoning histories:

  • Graph-based Agent State (LangGraph) (Joshi et al., 13 Jun 2025): Encodes agent observations, actions, plans, and tool invocations as nodes in a persistent, updatable graph structure at each turn
  • FAISS-based Semantic Memory (Joshi et al., 13 Jun 2025): Stores representations (embeddings) of prior solved tasks, facilitating retrieval of analogous experience for new tasks
  • Dynamic Reasoning Contexts (SWE-AGILE) (Lian et al., 13 Apr 2026): Maintains a sliding window of detailed reasoning over N recent steps, storing earlier reasoning as digests. Compression functions SenvS_{env}3 optimize semantic fidelity under token constraints. This approach mitigates context window limitations and preserves deep System-2 reasoning chains

Pseudocode implementations exemplify strict state-carryover, memory-prompting, and hybrid retrieval-training structures.

6. Empirical Findings and Performance Analysis

Empirical studies across benchmarks reveal that:

  • Cross-phase or cross-session state dependencies substantially increase task difficulty. In SWE-Cycle, the solve rate for FullCycle (simultaneous end-to-end, stateful execution) drops below 14%, even as best-in-class agents achieve 78%–97% per-phase accuracy in isolation. Most of the dynamic boost in end-to-end setting is offset by cumulative static errors and verification bottlenecks (Guan et al., 13 May 2026).
  • User modeling and memory improve objective and subjective outcomes. ToM-SWE achieves 59.7% success on stateful scenarios, far exceeding the 18.1% baseline, and earns consistently higher satisfaction scores via persistent preference tracking (Zhou et al., 24 Oct 2025).
  • Continual learning agents with external memory show improved accuracy, reduced forgetting, and better transfer compared to memoryless baselines. Memory-aware architectures support faster resolution of recurring bug types and more efficient adaptation to evolving codebases, as measured by AA, F, FT, and CCLS (Joshi et al., 13 Jun 2025).
  • Contrast-based estimators yield consistent decisions even under unobservable, stateful environmental drift, avoiding the misleading bias of traditional averaging protocols (Melis, 15 Jun 2026).

7. Practical Recommendations and Future Directions

Best practices for constructing and deploying stateful SWE-bench protocols include:

  • Instrument all benchmarks to pass state explicitly between phases or sessions; prohibit human resets or ad hoc re-initializations.
  • Utilize explicit user simulators or developer profiles to test agents' ability to recognize, retain, and exploit long-term user context.
  • Cap memory context to avoid overflow, and filter prior experiences for relevance and correctness.
  • Adopt randomized or block-based experiment designs for performance benchmarking under stateful system dynamics, always relying on relative (contrast) metrics rather than absolute averages.
  • Incorporate continual fine-tuning, curriculum ordering, and diagnostic logging to enable adaptive, robust agent training.

Further research is exploring algorithmic improvements in memory compression, adaptive window sizing, theory-of-mind augmentation, and cross-language or cross-domain generalization.


Stateful SWE-bench defines a suite of rigorous, state-dependent evaluations for autonomous code agents, encompassing environment manipulation, persistent user modeling, multi-turn continual learning, and robust benchmarking under dynamic system state. State formalization, explicit memory, and hybrid evaluation protocols are essential for measuring, and ultimately achieving, practical agent autonomy in real-world software engineering (Jimenez et al., 2023, Zhou et al., 24 Oct 2025, Lian et al., 13 Apr 2026, Joshi et al., 13 Jun 2025, Guan et al., 13 May 2026, Melis, 15 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stateful SWE-bench.