SWE-Llama: Automated Code Repair Models

Updated 24 November 2025

SWE-Llama is a suite of LLaMA-based language models designed for automated patch generation and repair in large-scale Python repositories.
It leverages LoRA fine-tuning and reinforcement learning to enhance performance on real-world GitHub issues with concrete success metrics.
The framework integrates conversational feedback and advanced context handling to address both single-file fixes and complex multi-file edits.

SWE-Llama refers to a set of LLMs specialized for software engineering tasks focused on automated code editing and repair in large-scale Python repositories, as evaluated within the SWE-bench and related benchmarks. It encompasses both the original open-weight fine-tuned LLaMA variants adapted for patch generation given real GitHub issues, as well as conversation-driven and reinforcement learning-augmented approaches that leverage the LLaMA and Llama 3 architectures for program repair workflows (Jimenez et al., 2023, Cheshkov et al., 6 Oct 2024, Wei et al., 25 Feb 2025).

1. Model Architectures and Fine-Tuning Strategies

SWE-Llama comprises open-weight LLaMA derivatives with 7B and 13B parameters, each inheriting the standard Transformer architecture of the LLaMA family. The 7B model uses 32 decoder layers, a hidden size of 4096, and 32 self-attention heads; the 13B model uses 40 decoder layers, a hidden size of approximately 5120, and 40 heads. Rather than retraining all parameters, SWE-Llama employs Low-Rank Adaptation (LoRA), inserting rank-16 adapters (with $\alpha = 16$ and a 5% dropout) into the query, key, value, and output projection matrices of each attention sublayer. The bulk of pre-trained parameters remain frozen, and only a small number of adapter weights are trained.

To accommodate repository-scale and long-context tasks—sometimes exceeding 100,000 tokens—SWE-Llama applies system-level optimizations such as FlashAttention and DeepSpeed Ulysses to extend input sequence handling well beyond typical context limits.

Table: SWE-Llama Technical Specifications

Variant	Layers	Hidden Size	Attention Heads	LoRA Application
7B	32	4096	32	All attention matrix weights
13B	40	~5120	40	All attention matrix weights

(Jimenez et al., 2023)

2. Training Data and Methodology

Fine-tuning SWE-Llama utilizes the SWE-bench-train corpus, consisting of approximately 19,000 issue–pull-request pairs sampled from 37 popular Python repositories (disjoint from the evaluation set). Each training instance has three components:

The issue description (in natural language),
The “gold” patch (a unified diff specifying all affected files/lines),
The minimal subset of the codebase encompassing all files edited in the patch.

Instances are tokenized with the CodeLlama BPE vocabulary. Examples exceeding 30,000 tokens are discarded to maintain tractability, resulting in about 10,000 effective training samples. Training minimizes next-token cross-entropy loss using AdamW (learning rate $6 \times 10^{-4}$ , batch size 32, up to 4 epochs), with the best checkpoint selected by validation loss.

No explicit curriculum or data augmentation is used. Regularization derives from LoRA’s built-in dropout (Jimenez et al., 2023).

3. Evaluation Protocols and Quantitative Performance

SWE-Llama is primarily assessed using the SWE-bench benchmark, which poses 2,294 real-world software engineering tasks drawn from 12 high-profile Python repositories. The model is tasked with generating code edits that resolve specified GitHub issues. Key metrics include:

Patch Application Rate: Fraction of generated diffs that can be successfully applied to the codebase.
Resolution Success Rate: Fraction of cases for which all previously failing tests pass after application of the patch.

Formally, the resolution rate is given by:

$\mathrm{SuccessRate} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\text{model solves issue}_i\}$

Under realistic sparse retrieval (BM25, up to 13k tokens), SWE-Llama 13B resolves 0.70% of tasks (compared to 1.96% for Claude 2 and 0.17% for ChatGPT-3.5). In the oracle setting (where only reference-edited files are provided), SWE-Llama 13B achieves 3.97% resolution, approaching Claude 2's 4.80%. Across all settings, resolution rates for all models remain below 5%, highlighting the substantial difficulty of real-world cross-file patching (Jimenez et al., 2023).

4. Conversational and Feedback-Driven Program Repair

Recent work extends SWE-Llama to a conversational, test suite-based pipeline wherein fault localization is oracle-driven, and the model interacts iteratively with feedback from test outcomes. The conversational architecture follows:

Initial prompt with the faulty function and issue description.
LLM proposes a patch within <replace>...</replace> tags.
Driver script applies patch and runs the public test suite; actions depend on outcome:
- Syntax errors trigger a “syntax error” prompt for revision.
- Test failures trigger a “test failure” prompt including the error log for further correction.
- Success triggers evaluation on hidden test cases for final validation.

On 92 single-function bugs from SWE-Bench Lite, an instruct-tuned LLaMA 3.1-70B model achieves a 62% plausible patch rate (passing all public tests) and a 47% correctness rate (also passing the hidden suite). The conversational approach outperforms a repetitive 1-shot sampling baseline (47% vs. 34% correctness with LLaMA). Conversational error feedback and precise fault localization significantly impact overall success (Cheshkov et al., 6 Oct 2024).

Table: Conversational Pipeline Results (SWE-Bench Lite, 92 Problems)

Model	Plausible (%)	Correct (%)
LLaMA 3.1-70B	62	47
GPT-4o-mini	56	46
State-of-the-art	-	44

(Cheshkov et al., 6 Oct 2024)

5. Reinforcement Learning Enhancements via Software Evolution Data

An alternative approach, SWE-RL, leverages reinforcement learning (RL) with policy gradient techniques on massive aggregated software evolution data. Building on Llama-3.3-70B-Instruct, SWE-RL employs a Group Relative Policy Optimization (GRPO) framework. For each training instance, the policy generates multiple candidate patches; a rule-based reward is computed from the similarity to the oracle patch (using Python’s difflib.SequenceMatcher in [0,1]; -1 for invalid format); and the GRPO update favors trajectories with higher normalized reward.

The RL runs over a curated set of roughly 11 million PRs with corresponding issues and patches. No supervised loss is used; all updates are reward-driven.

On SWE-bench Verified (500 human-verified issues), the resulting Llama3-SWE-RL-70B model achieves a 41.0% solve rate, which is state-of-the-art among open LLMs <$100$B and approaches proprietary models (e.g., GPT-4o at 38.8%, Claude-3.5-Sonnet at 50.8%). Furthermore, RL-tuned models retain or improve performance on out-of-domain coding, mathematics, and reasoning tasks, in contrast to supervised-finetuning which often degrades generalization (Wei et al., 25 Feb 2025).

6. Qualitative Insights, Limitations, and Future Work

Qualitative analysis of SWE-Llama reveals compositional strengths and limitations:

For single-file, single-hunk fixes, such as whitespace corrections or simple code replacements, models can frequently produce correct or even improved patches.
In multi-file or structural-edit scenarios (e.g., cross-module consistency, dependency-aware algorithmic changes), current models under-approximate the required scope or introduce regressions in unrelated test cases.
Conversational iterations with granular feedback promote patch diversity and error correction not accessible by naive sampling.

Two principal limitations are identified:

Context Sensitivity: Performance is highly dependent on the precision of retrieved or provided context. Enlarging the window with unrelated files sharply diminishes resolution rates, and sparse retrieval settings can mismatch the model’s training distribution.
Reasoning Across Structure: Models struggle with cross-file, multi-function coordination and global codebase reasoning.

Potential advances include tool-augmented agentic workflows (test-run-refine cycles), improved context retrieval, progressive curriculum training, and deeper integration with code semantics (e.g., type graphs, dependency structures, multi-modal signals like screenshots).

A plausible implication is that RL-based and conversational feedback-driven approaches represent promising steps toward more autonomous, robust, and general synthetic software engineering agents (Jimenez et al., 2023, Cheshkov et al., 6 Oct 2024, Wei et al., 25 Feb 2025).