SWE-RL: Reinforcement Learning for Software Engineering

Updated 17 October 2025

SWE-RL is a suite of reinforcement learning methods that trains LLMs using comprehensive software evolution data for advanced code synthesis and repair.
It employs scalable techniques like GRPO with sequence similarity-based rewards to handle real-world, multi-file, and long-context reasoning challenges.
Benchmarking on multilingual datasets and test-driven feedback has enabled SWE-RL to achieve transferable, state-of-the-art performance across diverse software engineering tasks.

SWE-RL refers to a suite of reinforcement learning (RL) methods, datasets, and benchmarks designed to advance LLM reasoning for real-world software engineering, and more broadly denotes a rapidly expanding research area at the intersection of RL, software evolution data, and agentic or agentless frameworks for automated code synthesis, repair, and feature development. The field is anchored by the introduction of SWE-RL in the context of scaling RL-based LLMs for software engineering (Wei et al., 25 Feb 2025), with significant extensions in datasets, multilingual benchmarking, training algorithms, and system infrastructure in subsequent work.

1. Foundations and Motivation

SWE-RL was introduced to address the gap between RL-driven LLM progress in domains such as competitive programming or mathematics (e.g., DeepSeek-R1) and the more complex, context-rich tasks in real-world software engineering (Wei et al., 25 Feb 2025). The core insight is to leverage the rich process record of open-source software evolution—specifically the entirety of a software project’s lifecycle, including code snapshots, code diffs, pull requests, and issues—as a training substrate. This enables LLMs to learn by reconstructing the implicit reasoning processes and repair strategies of human developers as they navigate, localize, and fix bugs using only natural language and code contexts.

This approach is fundamentally different from prior works that focus on tightly specified tasks with strong, deterministic reward signals derived from execution outcomes. In SWE-RL, rewards often reflect looser criteria such as sequence similarity or code-edit distance, and the environment is inherently more diverse and noisy due to real repository artifacts. The methodology thus seeks to induce stronger generalization and transferrable reasoning capabilities by training models on massive, heterogeneous software evolution data.

2. Methodological Frameworks

SWE-RL’s operational pipeline is characterized by a lightweight, rule-based reward system, scalable RL optimization with Group Relative Policy Optimization (GRPO), and a curated dataset derived from open-source evolution events:

Reward Function: At each RL step, the model outputs a candidate code patch for a given issue and code context. If the output is incorrectly formatted, a penalty (–1) is returned. Otherwise, the reward is the sequence similarity score between the model prediction and the oracle patch from the GitHub PR (computed via Python’s difflib.SequenceMatcher):

$\mathcal{R}(\tau) = \begin{cases} -1, & \text{format error}\ \text{compare}(patch_{pred}, patch_{gt}), & \text{otherwise} \end{cases}$

Policy Optimization: The update objective employs GRPO, where for each sampled group, normalized advantages are clipped, and a KL divergence penalty regulates divergence from the reference policy:

$\mathcal{J}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G} \left( \min(A_i,\, \text{clip}(r_i,\, 1-\epsilon,\, 1+\epsilon) A_i ) - \beta \cdot \text{KL}( \cdot ) \right)\right]$

Data Conditioning: Training instances include a full issue description, the complete set of code files affected (including both modified and relevant unmodified files), and the human-authored patch. This forces the model to implicitly learn fault localization before generating repairs.

The resulting model, Llama3-SWE-RL-70B, was trained for 1,600 RL steps with a global batch size of 512 and a 16k token context window, processing roughly 11M unique PR instances.

3. Benchmarks, Evaluation, and Performance

SWE-RL and its variants are evaluated on newly curated and established benchmarks:

SWE-bench Verified: A human-verified subset of real-world GitHub issues requiring multi-file, long-context reasoning. Llama3-SWE-RL-70B achieves a 41.0% solve rate, the highest reported among medium-sized ( $<$ 100B) models and competitive with leading proprietary LLMs like GPT-4o (Wei et al., 25 Feb 2025).
Out-of-domain Transfer: Despite RL being performed solely on software evolution data, models exhibit improved performance on function coding (HumanEval+), library use (BigCodeBench-Hard), code reasoning (CRUXEval), mathematical problem-solving, and general language understanding (MMLU), in contrast to supervised baselines that may degrade on these tasks.
Scaling with Sampling: Increasing the number of samples per issue enhances solution rates, with performance gains saturating beyond 160 samples, underscoring the importance of efficient sampling strategies and reward shaping.

4. Extensions in Datasets and Infrastructure

Subsequent research has generalized SWE-RL along several dimensions:

Multi-SWE-bench and Multi-SWE-RL: To enable cross-language evaluation, Multi-SWE-bench provides 1,632 annotated instances spanning Java, TypeScript, JavaScript, Go, Rust, C, and C++, with rigorous containerization and reproducible runtime environments for RL experiments (Zan et al., 3 Apr 2025). The accompanying Multi-SWE-RL community released 4,723 containerized RL training instances and an open, fully documented data production pipeline, catalyzing large-scale, collaborative RL dataset creation in issue resolving.
SWE-Dev Dataset: SWE-Dev targets autonomous feature-driven development (FDD), supplying 14,000 training and 500 test samples, each coupled with a runnable environment and unit tests. This enables RL with execution-based reward signals, creating a direct path for optimizing models using grounded, test-driven feedback (Du et al., 22 May 2025).
Infrastructure Advancements: RepoForge (Chen et al., 3 Aug 2025) automates end-to-end data generation, curation, labeling (with SPICE for difficulty assessment), and scalable distributed evaluation (Ray-powered harness). Its RL integration uses a bubble-free, asynchronous scaffold capable of concurrent multi-turn rollouts, efficiently training even $\leq$ 8B models to competitive benchmarks.

5. Agent Architectures and Skill Priors

SWE-RL research explores both agentless (workflow-based) and agent-based (multi-turn, autonomous) paradigms:

Kimi-Dev: This framework first induces “skill priors” via structured, agentless (single-turn) training on tasks such as bug localization, code edit, and self-reflection, followed by RL applied to code-edit components. The resulting model achieves state-of-the-art agentless scores (60.4% on SWE-bench Verified) and, with light supervised adaptation, enables SWE-Agents to reach agentic pass@1 rates comparable to large proprietary models (Yang et al., 27 Sep 2025).
Evolutionary Test-Time Scaling: Satori-SWE incorporates RL with evolutionary self-improvement, letting the model iteratively refine code patch outputs guided by potential-based reward shaping. This avoids dependence on external verifiers at inference, improving sample efficiency while matching or exceeding the performance of much larger models in few-shot settings (Zeng et al., 29 May 2025).

6. Implications, Challenges, and Emerging Directions

SWE-RL marks a significant methodological shift in training AI software engineering agents:

Transferable Reasoning: RL on real-world software evolution can yield reasoning competencies that generalize across diverse problem domains.
Execution-based RL: Integrating runtime test feedback and environment containerization tightly grounds reward signals, moving toward “learning by doing.”
Skill Priors and Agent Evolution: Training with explicit workflows or structured steps produces efficient and robust initialization (“skill priors”) for subsequent agentic RL, facilitating efficient adaptation even in sparse reward, long-horizon settings.
Scaling and Efficiency: Innovations in sampling, reward shaping, and large-scale distributed training and evaluation infrastructure have overcome previous bottlenecks in data, labeling, and computational cost.

Key open directions include enhancing semantic-reward mechanisms, developing more robust agentic RL systems, expanding to additional programming languages and software engineering tasks (such as end-to-end project synthesis), refining multi-agent collaboration protocols, and integrating agent–environment–human interaction models.

7. Representative SWE-RL Papers and Contributions

Paper/Asset	Research Focus	Key Contributions
SWE-RL (Wei et al., 25 Feb 2025)	RL for LLMs on open-source SWE data	41.0% solve rate, generalized reasoning
Multi-SWE-bench (Zan et al., 3 Apr 2025)	Multilingual issue resolving & RL community/data pipeline	1,632 multilingual instances, 4,723 RL data
SWE-Dev (Du et al., 22 May 2025)	Feature-driven SW development; RL with executable tests	14,000 training, RL with test rewards
Satori-SWE (Zeng et al., 29 May 2025)	EvoScale: RL-enabled evolutionary test-time scaling	Sample-efficient refinement, Best@N metric
RepoForge (Chen et al., 3 Aug 2025)	At-scale SWE agent training, infra, RL scaffolds	Distributed eval, storage/cost reduction
Kimi-Dev (Yang et al., 27 Sep 2025)	Agentless skill priors + RL for agentic SWE	60.4% agentless, 48.6% agentic pass@1

In summary, SWE-RL designates a body of research, datasets, optimization strategies, and agent architectures dedicated to advancing the reasoning and code synthesis capabilities of LLMs for software engineering, using large-scale RL on real software evolution and test-driven feedback as its methodological pivot.