SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution (2502.18449v1)

Published 25 Feb 2025 in cs.SE, cs.AI, and cs.CL

Abstract: The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of LLMs. While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

PDF Abstract

The paper "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution" introduces SWE-RL, a reinforcement learning (RL) method designed to improve the reasoning capabilities of LLMs (LLMs) for software engineering (SE) tasks. The approach leverages software evolution data and rule-based rewards to enable LLMs to autonomously recover developer reasoning processes and solutions.

The core idea involves training an LLM to solve real-world software issues by learning from a curated dataset of GitHub pull requests (PRs). The LLM is tasked with generating code changes to address specific issues, and a reward signal is calculated based on the similarity between the predicted and the ground-truth code changes. This reward is computed using Python's difflib.SequenceMatcher.

The method involves several key steps:

Raw Pull Request Data Curation: The process begins by curating a dataset of GitHub PRs. This involves collecting GitHub events and cloning repositories to recover PR details. The collected data is then aggregated, associating issues, discussions, code contents, and changes with each PR. To mitigate bias, the model predicts relevant but unmodified files for each PR. Data filtering is applied to remove noisy or potentially harmful PRs.
Reward Modeling: A seed dataset is prepared for RL by extracting high-quality PRs. This seed dataset contains issue descriptions, code context, and oracle patches. The reward function, $\mathcal{R}(\tau)$ , is defined based on whether the generated code changes are correctly formatted and the similarity score between the predicted patch ( $patch_{pred}$ ) and the oracle patch ( $patch_{gt}$ ). Incorrectly formatted responses receive a negative reward. The reward function is:

$\mathcal R(\tau) = \begin{cases} -1, &\text{if the format is wrong},\ \mathit{compare}(patch_{pred}, patch_{gt}), & \text{otherwise}. \end{cases}$ where $\mathit{compare}$ is instantiated with Python's difflib.SequenceMatcher.
Policy Optimization: Group Relative Policy Optimization (GRPO) is used for policy optimization. Given a seed RL dataset, the policy LLM tries to solve issues by generating code changes through reasoning, producing multiple outputs for each input prompt. The policy LLM aims to maximize the GRPO objective:

$\mathcal J(\theta) = {\mathbb E}\left[ \frac1G\sum_{i=1}^G\left( \min\left( A_i, clip\left(, 1-\epsilon, 1 + \epsilon\right)A_i\right) - \beta(\,\|\,) \right) \right],$

where $(issue, ctx, patch_{gt}) \sim seed$ , $q = form(issue, ctx)$ , and $\{o_i\}_{i=1}^G \sim (\cdot \mid q)$ .

Here, $A_i$ represents the advantages calculated using normalized rewards, and $and$ are the old and reference policy, respectively. The terms $\epsilon$ and $\beta$ are hyperparameters, and $(\,\|\,)$ denotes the approximated KL-divergence.

A model trained with SWE-RL, \ours[70], based on Llama-3.3-70B-Instruct, achieved a 41.0% solve rate on SWE-bench Verified. Ablation studies showed that \ours[70] significantly outperforms the Llama baseline. A supervised fine-tuning (SFT) model, \oursft[70], was also developed for comparison, demonstrating that \ours[70] not only surpasses the SFT model in SWE-bench but also excels in out-of-domain (OOD) tasks.

The contributions of the paper are:

The introduction of SWE-RL, an RL approach for enhancing LLMs for SE tasks using software evolution data and rule-based rewards.
The development of \ours[70], which achieves a 41.0% solve rate on SWE-bench Verified.
The demonstration that applying RL to real-world SE tasks can enhance an LLM's general reasoning abilities.

The evaluation includes:

Experimental Setup: \ours[70] was trained on top of Llama-3.3-70B-Instruct using SWE-RL. Agentless Mini was developed as the underlying scaffold. The evaluation was conducted on SWE-bench Verified. An SFT baseline, \oursft[70], was trained for comparison.
Main Results: \ours[70] achieves state-of-the-art results among small and medium-sized LLMs by resolving 41.0% of the issues.
Baseline Comparison: Compares \ours with the corresponding Llama-3 and SFT baseline using Agentless Mini as the underlying scaffold. The base Llama-3.3 model struggles to produce correctly formatted code edits. With SFT, most code edits generated by the LLM are correctly formatted, and the repair performance shows significant improvement. \ours[70] demonstrates even greater enhancement in repair capabilities.
Scaling Analysis: Increasing both the number of repair samples and test samples enhances performance on SWE-bench.
Generalizability of RL: \ours is only trained with SWE-RL on issue-solving data. The LLMs are evaluated on five out-of-domain benchmarks, i.e., HumanEval{+} for function-level code generation, BigCodeBench for practical code generation with library use, CRUXEval for code execution reasoning, MATH for mathematical reasoning, and MMLU for general language understanding.
Reward Ablation: Ablation on SWE-RL's reward functions and their training dynamics. The default continuous reward function is compared against a discrete reward.

The Agentless Mini scaffold is a simplified version of Agentless, emphasizing component decomposition, parallelization, and scalability. Agentless Mini consists of localization and repair, reproduction tests generation and selection, regression tests selection, and reranking.

The paper also discusses synthesizing supervised-finetuning data to construct SFT data. The data generation pipeline is inspired by Magicoder, where the OSS-Instruct technique generates high-quality code instruction data from open-source seed snippets.

Additionally, the paper explores orthogonal ways to improve LLMs on real-world software engineering. The raw PR collection can be directly utilized through continued pretraining (midtraining) to empower small LLMs. This involves data packing design, formatting PR data as dialogs, dynamic context adjustment, and stable training and annealing.

The limitations of the approach include the reward implementation comparing sequence similarity rather than semantic equivalence, the simplified localization process in Agentless Mini, the pipeline-based approach hindering holistic problem-solving, and the substantial sampling budget required. Future work involves integrating agentic reinforcement learning, incorporating execution, and improving sample efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Yuxiang Wei (40 papers)
Olivier Duchenne (3 papers)
Jade Copet (26 papers)
Quentin Carbonneaux (3 papers)
Lingming Zhang (48 papers)
Daniel Fried (69 papers)
Gabriel Synnaeve (97 papers)
Rishabh Singh (58 papers)
Sida I. Wang (20 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/gkcs_/status/1901556606451757561

https://twitter.com/Teknium1/status/1927089897833140647

https://twitter.com/syhw/status/1894824392451928240

https://twitter.com/Teknium1/status/1925435861057122807

https://twitter.com/ChaseBrowe32432/status/1925030043945713961

https://twitter.com/fly51fly/status/1894860882766856218

YouTube

Show All Videos