SWE-Fixer: Open-Source Code Fix Pipeline

Updated 12 March 2026

SWE-Fixer is an open-source two-stage pipeline that maps natural language GitHub issues to minimal code changes passing unit tests.
It combines a coarse-to-fine retrieval module with BM25 and a fine-tuned Qwen2.5-7B to accurately select relevant code files.
The system employs a structured code editing module with Qwen2.5-72B, achieving state-of-the-art performance on SWE-Bench benchmarks with high efficiency.

SWE-Fixer is an open-source, two-stage pipeline designed to efficiently and effectively resolve real-world GitHub issues by integrating information retrieval and code-editing LLMs trained on a large-scale corpus of issues and corresponding patches. The framework targets practical, test-driven software engineering tasks where a natural language issue is mapped to a minimal set of code changes that pass a repository’s existing unit tests. SWE-Fixer is characterized by a coarse-to-fine retrieval module and a structured code editing module, both fine-tuned on a uniquely constructed dataset of 110,000 GitHub issue–patch–test triples. The architecture achieves state-of-the-art performance among open-source models on SWE-Bench Lite and Verified benchmarks, and is notable for its efficiency—requiring only two LLM calls per issue—without reliance on proprietary models (Xie et al., 9 Jan 2025).

1. System Architecture and Workflow

SWE-Fixer is built as a two-stage pipeline:

Code File Retrieval Module: Utilizes Qwen2.5-7B, fine-tuned as a sequence-to-sequence classifier, for identifying which files in a repository are most relevant to a given GitHub issue.
- Input: Receives issue text and a list of file “skeletons”—abstractions that retain only signatures, docstrings, headers, and the first/last five lines of functions or classes.
- Processing: Applies BM25 to index every file as a document, scoring each against the issue’s textual query, and selects the top 30 files. The LLM reranker then filters this set to return those deemed necessary for modification, output as a JSON array of file paths.
- Training Data: 80K retrieval tasks with gold label “to_modify” sets, restricted such that all gold files are within the BM25 top 30 (Xie et al., 9 Jan 2025).
Code Editing Module: Uses Qwen2.5-72B, fine-tuned with a structured JSON output style, to generate executable code patches.
- Input: Receives the issue and the full content (with line numbers) of files selected by the retriever.
- Output: Produces a JSON object containing a “reasoning” array (chain-of-thought style), plus a collection of precise, line-numbered edits (“edits”), where each specifies the path, the old code snippet annotated by original line numbers, and the new patch.
- Training Data: 70K instances, each paired with a chain-of-thought and a gold patch, where the rationale is generated by prompting GPT-4 o with masked diffs in a “rationalization” framework.

The entire workflow makes one call to the retriever and one call to the editor LLM for each issue.

2. Retrieval Methodology and Precision

The code file retrieval module implements a coarse-to-fine selection strategy:

Coarse Retrieval: BM25 is used for rapid scoring. BM25 for a query $Q = \{q_1,…,q_n\}$ and document $D$ is given by

$\mathrm{BM25}(Q, D) = \sum_{i=1}^{n} \mathrm{IDF}(q_i) \times \frac{f(q_i,D)\,(k_1 + 1)}{f(q_i,D) + k_1 (1 - b + b\,|D|/\mathrm{avgdl})},$

where $f(q_i, D)$ is the term frequency, $\mathrm{IDF}(q_i)$ the inverse document frequency, and $k_1 \approx 1.2$ , $b \approx 0.75$ (Xie et al., 9 Jan 2025).

Fine Reranking: The top 30 files from BM25 are categorized by the fine-tuned Qwen2.5-7B to obtain the final set of files for editing. Skeletons used in the input limit context while preserving salient file characteristics.

Empirical retrieval performance on the SWE-Bench Lite benchmark is summarized below:

Method	Precision (%)	Recall (%)
BM25 Top-3	18.9	56.7
BM25 Top-30	2.9	86.7
Qwen2.5-7B finetuned	68.5	69.0

Ablation studies show that context window size and inclusion of the repository README have effects of a few points on precision/recall. Mixing in edit-task instances for retriever tuning yields further marginal improvements.

3. Code Editing Module and Structured Output

The code editing module uses a large code LLM, fine-tuned to produce directly usable JSON patch objects:

Architecture: Qwen2.5-72B, finetuned with a “JsonTuning” methodology, generating both reasoning (the chain-of-thought) and an array of edits.
Output Schema:
- “reasoning”: Optional array justifying changes.
- “edits”: Array of objects, each with a file path, the original code snippet (with line numbers), and the corresponding patch (without line numbers).
Training Protocol:
- Gold-standard patches are paired with GPT-4 o–generated rationalizations.
- Optimization via AdamW with a learning rate of $5 \times 10^{-6}$ , batch size 96, and 5% warmup.
- JSON format is enforced through structured prompting and, in case of invalid output or syntax failure, resampling with a higher temperature (up to 5 attempts).

Ablation experiments indicate that retaining full file context with line numbers is optimal (Fix Rate 20%), and that including class/function names as hints increases the fix rate by 2 points. Removing line numbers yields a substantial 6-point drop.

4. Dataset Curation and Chain-of-Thought Construction

Development of the SWE-Fixer system depended on a newly assembled dataset of real-world code fixes:

Collection: Utilized GitHub REST API event linkage to mine 2,300+ Python repositories (with >100 PRs, excluding SWE-Bench) for issue→PR→test triples, yielding 331,000 triples.
Filtering: Discarded entries with unparsable patches and limited train set to ≤3 edited files per instance (excluding tests), resulting in 110,000 high-quality instances (train_110K).
Statistics:
- 54.7% of issue instances modify one file, 80% affect ≤3 files.
- 73.7% edit <100 lines; 85% edit ≤200 lines.
Chain-of-Thought (CoT) Generation: For each gold patch, GPT-4 o is prompted with a partially masked context to “rationalize” the change, providing both a sequence of reasoning steps and the patch, following a “rationalization” paradigm. This CoT annotation significantly enhances edit model effectiveness.

5. Benchmark Evaluation and Comparative Performance

SWE-Fixer was evaluated on industry-grade benchmarks:

SWE-Bench Lite: 300 real-world curated issues.
SWE-Bench Verified: Subset with stringently validated test coverage.

In both cases, the metric is Pass@1 (“Best@1” if only one suggestion is generated): the percentage of instances for which the generated patch passes all provided project tests.

The main results are as follows:

Method	Type	Verified (%)	Lite (%)
Agentless (GPT-4 o)	Pipeline	38.8	32.0
SWE-Fixer (ours)	Pipeline	30.2	23.3

SWE-Fixer achieves best-in-class performance among open-source models, with a +1.3 point advantage on Lite and tied best accuracy on Verified. Notably, it also surpasses several pipelines built on GPT-4 or Claude-3-Opus. A variant incorporating PASS_TO_PASS (P2P) filtering achieves 24.7% (Lite) and 32.8% (Verified).

6. Efficiency, Hyperparameters, and Implementation

SWE-Fixer adopts a minimalist approach in computational resource utilization:

Total LLM Invocations: Two per instance—one by the retriever and one by the editor.
Hardware and Framework: Trained on 96 NVIDIA A800 GPUs (64GB each) with PyTorch/xtuner-lite; global batch size 96; 64K sub-token context.
Optimization Settings:
- Retriever: AdamW, lr = $1 \times 10^{-5}$ , weight decay = 0.01, warmup 10%.
- Editor: AdamW, lr = $5 \times 10^{-6}$ , weight decay = 0.01, warmup 5%.
- Checkpoints at every 1,000 steps and best scores determined on held-out validation splits.

Compared to prevailing pipelines, which invoke 5–10 LLM calls per issue, SWE-Fixer achieves up to 40% fewer LLM calls and approximately 50% less GPU time on a per-instance basis, matching or exceeding accuracy benchmarks of open-source competitors.

7. Limitations and Prospective Work

Acknowledged limitations and future trajectories for SWE-Fixer include:

Absence of Execution Feedback: Patches are never tested during training (no “test-in-the-loop” learning). Integrating even lightweight execution harnesses (such as LLM-driven test oracles) could enhance fix accuracy.
Language Scope: Currently restricted to Python; generalizing to other programming languages requires language-specific skeleton extraction and greater context budgets.
Context Scaling: File-rich repositories (>1,000 files) may cause the BM25 top-30 window to exclude optimal candidates; adaptive strategies or hierarchical file selection are proposed avenues.
Chain-of-Thought Limitations: While rationalized CoT improves model output, it may sometimes misdirect the editing process. Improving CoT sampling, filtering, or employing self-consistency techniques may yield further gains.
Integration into CI/CD Pipelines: Direct end-to-end deployment into continuous integration workflows, enhanced with test results and artifact checks, is identified as a significant area for future development (Xie et al., 9 Jan 2025).

In summary, SWE-Fixer demonstrates that a streamlined, open-source, two-module pipeline—based on BM25-LLM retrieval, structured code editing, and a large, real-world supervision corpus—can match or surpass the performance of more complex and proprietary alternatives on real-world automated code fixing tasks drawn from GitHub issue data.

Markdown Report Issue Upgrade to Chat

References (1)

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Fixer.