Self-play SWE-RL Framework

Updated 24 December 2025

Self-play SWE-RL (SSR) is a reinforcement learning framework that trains LLM software agents by interleaving bug injection and repair within sandboxed codebases.
SSR employs an MDP-based dual-role curriculum with integrated tool commands and PPO optimization to iteratively enhance debugging capabilities.
Evaluations on SWE-bench Verified and Pro demonstrate SSR's superior resolve-rate, highlighting its effectiveness in autonomously generating and solving challenging software bugs.

Self-play SWE-RL (SSR) is a reinforcement learning framework for training LLM software agents directly on real-world codebases by engaging them in an iterative, dual-role game of bug injection and bug repair. Unlike prior RL paradigms reliant on curated human data, SSR only requires access to sandboxed source repositories with their dependencies, dispensing entirely with human-authored issues or tests. SSR formalizes this dual-role curriculum as a Markov Decision Process (MDP), optimizing a singular LLM policy to alternately act as a “bug injector” and a “solver,” thereby autonomously generating and solving software debugging tasks of increasing difficulty (Wei et al., 21 Dec 2025).

1. Formal Model and Problem Definition

SSR employs an MDP in which the agent alternates between two roles within a sandboxed repository environment: injector (introducing a bug) and solver (repairing the injected bug). The state space is defined as a tuple $(r, \text{role}, p, \tau)$ , where:

$r$ : repository snapshot (filesystem + Docker context).
role: $\{\text{inject}, \text{solve}\}$ .
$p(\text{role})$ : prompt encoding instructions specific to the current role.
$\tau$ : for the solver, a formal test specification (a patch representing the inverse of test-weakening performed during injection).

The action space corresponds to discrete tokens output by the LLM, consumed by a tool interface capable of bash scripting, source edits, patch generation, and other developer operations. Trajectories proceed stepwise until a <tool:submit> call yields a terminal interaction (i.e., bug artifact submission or repair attempt), at which point consistency or test checks trigger state transitions and rewards. The environment provides up to $K=8$ solver attempts per bug, constituting one episode.

The reward schema is carefully designed to incentivize difficulty-adjusted bug synthesis and robust bug repair:

Injector: $r_\text{inject}(s) = -1.0$ if consistency checks fail; if all or no attempts succeeded, $-\alpha$ (with $\alpha = 0.8$ ); intermediate difficulty yields $1-(1+\alpha)s$ for solve rate $s$ .
Solver: binary (+1 if passes post-repair tests, -1 otherwise).

2. LLM Agent Architecture and Tool API

SSR utilizes Code World Model (CWM-sft, 32B parameters) as its LLM backbone. Both injector and solver functions are parameter-shared; role distinction is accomplished through prompt engineering:

Injector prompt: guides test discovery, test-parser/test-script authoring, code-hunk removal/history reversion, and test weakening.
Solver prompt: provides the oracle “patch” defining the minimal repair target (reverse test weakening).

The agent issues tool-labeled tokens (e.g., <tool:bash>, <tool:edit>, <tool:submit>) controlling repository state and evaluation via bash scripts, editors, and test parsing. The agent’s decision history—comprising tool calls and associated state observations—forms the autoregressive input at each RL step.

3. Reinforcement Learning Objective and Optimization

SSR simultaneously optimizes both roles’ policy trajectories using policy gradient methods (proximal policy optimization, PPO). The objective is the standard discounted expected return: $J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[ \sum_{t=0}^T \gamma^t r_t \right]$ Policy updates employ a PPO surrogate clipped-loss with advantage normalization derived via a learned value head $V_\phi$ . A final loss combines policy gradient, value regression, and entropy regularization: $L(\theta, \phi) = -L^\text{CLIP}(\theta) + c_V L_V + c_\text{ent} H[\pi_\theta]$ A high-parallelism regime is used: 448 rollout actors provide trajectories to a central learner (64 GPUs) for advantage calculation and parameter updates.

4. Self-Play Training Loop and Curriculum

The SSR training loop is as follows:

[Injector phase] Discover test commands, generate a suite of testing tools, perform either segment removal or full commit reversion to create a new bug by diffing with the clean code, and weaken the corresponding tests.
Submit the bug artifact (including all supporting test scripts and diffs) if consistency checks pass (e.g., nontrivial failing/passing test quota, changed files).
[Solver phase] Apply the bug and test-weakening diffs to a new repo copy. The solver is then given only the oracle “repair patch” and attempts up to $K=8$ repairs per bug.
Each solution attempt is validated by running the original or updated test suite; pass/fail results are used for reward and for constructing higher-order bugs in later curriculum stages (failed patches are recycled as new, composite bugs).

The curriculum is emergent: bug-injection strategies are sampled among direct injection, code removal, and history-aware (full commit revert), with history-aware variants producing the strongest final metrics. The solver-feedback mechanism in the injector’s reward encourages generating bugs neither too easy nor too hard, calibrating difficulty throughout training.

5. Empirical Evaluation and Benchmarks

SSR is evaluated on the SWE-bench Verified (500 human-verified Python issues with tests) and SWE-bench Pro (731 public, complex real-world issues). The resolve-rate metric is used: proportion of issues fixed in a single model attempt (no ensembling or reranking). SSR achieves:

+10.4 absolute points over baseline RL (human issue/test access) on SWE-bench Verified.
+7.8 absolute points gain on SWE-bench Pro. SSR dominates the baseline over the entire training trajectory, despite never seeing natural language issues or human annotations.

Ablation studies show:

Both injection-only and repair-only agents underperform compared to full SSR self-play.
Higher-order bug chaining increases diversity and robustness.
Removing solve-rate feedback results in only marginal resolve-rate drop due to inherent reward signal noise.

6. Training Details and Hyperparameters

Key architectural and training settings:

Hardware: 512×H100 GPUs (448 rollout, 64 learner).
RL: async PPO within the CWM-RL codebase.
Context window: 131,072 tokens; global batch: 16M tokens (16×grad-accumulation, group size 8).
Learning rate: linear warm-up to $3\times 10^{-6}$ .
Injector curriculum requires minimums (10 passing tests, 2 changed files, 3 failing tests).
Duration: 150 global steps ( $\sim$ 2.5B tokens).

7. Limitations, Impact, and Future Directions

SSR demonstrates that self-play can bootstrap a curriculum of software-engineering tasks, enabling autonomous agentic learning unconstrained by human data curation. Limitations include exclusive reliance on unit tests (no “hidden” human-oracle or unseen validation cases), a single-policy design that may benefit from role-splitting or mixture-of-experts, and the persistence of some training instability at large scale (gibberish rollouts observed).

Future research priorities involve architectural decoupling of roles, denser/graded feedback signals for long-horizon tasks, distributional bug control to prevent duplicates, and scaling to higher abstraction levels such as multi-file or full-version changes. SSR’s empirical trajectory suggests a path toward superintelligent software agents capable of solving previously unseen challenges and autonomously synthesizing new software artifacts (Wei et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Toward Training Superintelligent Software Agents through Self-Play SWE-RL (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Self-play SWE-RL (SSR).