The paper "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution" introduces SWE-RL, a reinforcement learning (RL) method designed to improve the reasoning capabilities of LLMs (LLMs) for software engineering (SE) tasks. The approach leverages software evolution data and rule-based rewards to enable LLMs to autonomously recover developer reasoning processes and solutions.
The core idea involves training an LLM to solve real-world software issues by learning from a curated dataset of GitHub pull requests (PRs). The LLM is tasked with generating code changes to address specific issues, and a reward signal is calculated based on the similarity between the predicted and the ground-truth code changes. This reward is computed using Python's difflib.SequenceMatcher
.
The method involves several key steps:
- Raw Pull Request Data Curation: The process begins by curating a dataset of GitHub PRs. This involves collecting GitHub events and cloning repositories to recover PR details. The collected data is then aggregated, associating issues, discussions, code contents, and changes with each PR. To mitigate bias, the model predicts relevant but unmodified files for each PR. Data filtering is applied to remove noisy or potentially harmful PRs.
- Reward Modeling: A seed dataset is prepared for RL by extracting high-quality PRs. This seed dataset contains issue descriptions, code context, and oracle patches. The reward function, , is defined based on whether the generated code changes are correctly formatted and the similarity score between the predicted patch () and the oracle patch (). Incorrectly formatted responses receive a negative reward. The reward function is:
where is instantiated with Python's
difflib.SequenceMatcher
. - Policy Optimization: Group Relative Policy Optimization (GRPO) is used for policy optimization. Given a seed RL dataset, the policy LLM tries to solve issues by generating code changes through reasoning, producing multiple outputs for each input prompt. The policy LLM aims to maximize the GRPO objective:
where , , and .
Here, represents the advantages calculated using normalized rewards, and are the old and reference policy, respectively. The terms and are hyperparameters, and denotes the approximated KL-divergence.
A model trained with SWE-RL, \ours[70], based on Llama-3.3-70B-Instruct, achieved a 41.0% solve rate on SWE-bench Verified. Ablation studies showed that \ours[70] significantly outperforms the Llama baseline. A supervised fine-tuning (SFT) model, \oursft[70], was also developed for comparison, demonstrating that \ours[70] not only surpasses the SFT model in SWE-bench but also excels in out-of-domain (OOD) tasks.
The contributions of the paper are:
- The introduction of SWE-RL, an RL approach for enhancing LLMs for SE tasks using software evolution data and rule-based rewards.
- The development of \ours[70], which achieves a 41.0% solve rate on SWE-bench Verified.
- The demonstration that applying RL to real-world SE tasks can enhance an LLM's general reasoning abilities.
The evaluation includes:
- Experimental Setup: \ours[70] was trained on top of Llama-3.3-70B-Instruct using SWE-RL. Agentless Mini was developed as the underlying scaffold. The evaluation was conducted on SWE-bench Verified. An SFT baseline, \oursft[70], was trained for comparison.
- Main Results: \ours[70] achieves state-of-the-art results among small and medium-sized LLMs by resolving 41.0% of the issues.
- Baseline Comparison: Compares \ours with the corresponding Llama-3 and SFT baseline using Agentless Mini as the underlying scaffold. The base Llama-3.3 model struggles to produce correctly formatted code edits. With SFT, most code edits generated by the LLM are correctly formatted, and the repair performance shows significant improvement. \ours[70] demonstrates even greater enhancement in repair capabilities.
- Scaling Analysis: Increasing both the number of repair samples and test samples enhances performance on SWE-bench.
- Generalizability of RL: \ours is only trained with SWE-RL on issue-solving data. The LLMs are evaluated on five out-of-domain benchmarks, i.e., HumanEval{+} for function-level code generation, BigCodeBench for practical code generation with library use, CRUXEval for code execution reasoning, MATH for mathematical reasoning, and MMLU for general language understanding.
- Reward Ablation: Ablation on SWE-RL's reward functions and their training dynamics. The default continuous reward function is compared against a discrete reward.
The Agentless Mini scaffold is a simplified version of Agentless, emphasizing component decomposition, parallelization, and scalability. Agentless Mini consists of localization and repair, reproduction tests generation and selection, regression tests selection, and reranking.
The paper also discusses synthesizing supervised-finetuning data to construct SFT data. The data generation pipeline is inspired by Magicoder, where the OSS-Instruct technique generates high-quality code instruction data from open-source seed snippets.
Additionally, the paper explores orthogonal ways to improve LLMs on real-world software engineering. The raw PR collection can be directly utilized through continued pretraining (midtraining) to empower small LLMs. This involves data packing design, formatting PR data as dialogs, dynamic context adjustment, and stable training and annealing.
The limitations of the approach include the reward implementation comparing sequence similarity rather than semantic equivalence, the simplified localization process in Agentless Mini, the pipeline-based approach hindering holistic problem-solving, and the substantial sampling budget required. Future work involves integrating agentic reinforcement learning, incorporating execution, and improving sample efficiency.