SWE-Fixer: Open-Source Code Fix Pipeline
- SWE-Fixer is an open-source two-stage pipeline that maps natural language GitHub issues to minimal code changes passing unit tests.
- It combines a coarse-to-fine retrieval module with BM25 and a fine-tuned Qwen2.5-7B to accurately select relevant code files.
- The system employs a structured code editing module with Qwen2.5-72B, achieving state-of-the-art performance on SWE-Bench benchmarks with high efficiency.
SWE-Fixer is an open-source, two-stage pipeline designed to efficiently and effectively resolve real-world GitHub issues by integrating information retrieval and code-editing LLMs trained on a large-scale corpus of issues and corresponding patches. The framework targets practical, test-driven software engineering tasks where a natural language issue is mapped to a minimal set of code changes that pass a repository’s existing unit tests. SWE-Fixer is characterized by a coarse-to-fine retrieval module and a structured code editing module, both fine-tuned on a uniquely constructed dataset of 110,000 GitHub issue–patch–test triples. The architecture achieves state-of-the-art performance among open-source models on SWE-Bench Lite and Verified benchmarks, and is notable for its efficiency—requiring only two LLM calls per issue—without reliance on proprietary models (Xie et al., 9 Jan 2025).
1. System Architecture and Workflow
SWE-Fixer is built as a two-stage pipeline:
- Code File Retrieval Module: Utilizes Qwen2.5-7B, fine-tuned as a sequence-to-sequence classifier, for identifying which files in a repository are most relevant to a given GitHub issue.
- Input: Receives issue text and a list of file “skeletons”—abstractions that retain only signatures, docstrings, headers, and the first/last five lines of functions or classes.
- Processing: Applies BM25 to index every file as a document, scoring each against the issue’s textual query, and selects the top 30 files. The LLM reranker then filters this set to return those deemed necessary for modification, output as a JSON array of file paths.
- Training Data: 80K retrieval tasks with gold label “to_modify” sets, restricted such that all gold files are within the BM25 top 30 (Xie et al., 9 Jan 2025).
- Code Editing Module: Uses Qwen2.5-72B, fine-tuned with a structured JSON output style, to generate executable code patches.
- Input: Receives the issue and the full content (with line numbers) of files selected by the retriever.
- Output: Produces a JSON object containing a “reasoning” array (chain-of-thought style), plus a collection of precise, line-numbered edits (“edits”), where each specifies the path, the old code snippet annotated by original line numbers, and the new patch.
- Training Data: 70K instances, each paired with a chain-of-thought and a gold patch, where the rationale is generated by prompting GPT-4 o with masked diffs in a “rationalization” framework.
The entire workflow makes one call to the retriever and one call to the editor LLM for each issue.
2. Retrieval Methodology and Precision
The code file retrieval module implements a coarse-to-fine selection strategy:
- Coarse Retrieval: BM25 is used for rapid scoring. BM25 for a query and document is given by
where is the term frequency, the inverse document frequency, and , (Xie et al., 9 Jan 2025).
- Fine Reranking: The top 30 files from BM25 are categorized by the fine-tuned Qwen2.5-7B to obtain the final set of files for editing. Skeletons used in the input limit context while preserving salient file characteristics.
Empirical retrieval performance on the SWE-Bench Lite benchmark is summarized below:
| Method | Precision (%) | Recall (%) |
|---|---|---|
| BM25 Top-3 | 18.9 | 56.7 |
| BM25 Top-30 | 2.9 | 86.7 |
| Qwen2.5-7B finetuned | 68.5 | 69.0 |
Ablation studies show that context window size and inclusion of the repository README have effects of a few points on precision/recall. Mixing in edit-task instances for retriever tuning yields further marginal improvements.
3. Code Editing Module and Structured Output
The code editing module uses a large code LLM, fine-tuned to produce directly usable JSON patch objects:
- Architecture: Qwen2.5-72B, finetuned with a “JsonTuning” methodology, generating both reasoning (the chain-of-thought) and an array of edits.
- Output Schema:
- “reasoning”: Optional array justifying changes.
- “edits”: Array of objects, each with a file path, the original code snippet (with line numbers), and the corresponding patch (without line numbers).
- Training Protocol:
- Gold-standard patches are paired with GPT-4 o–generated rationalizations.
- Optimization via AdamW with a learning rate of , batch size 96, and 5% warmup.
- JSON format is enforced through structured prompting and, in case of invalid output or syntax failure, resampling with a higher temperature (up to 5 attempts).
Ablation experiments indicate that retaining full file context with line numbers is optimal (Fix Rate 20%), and that including class/function names as hints increases the fix rate by 2 points. Removing line numbers yields a substantial 6-point drop.
4. Dataset Curation and Chain-of-Thought Construction
Development of the SWE-Fixer system depended on a newly assembled dataset of real-world code fixes:
- Collection: Utilized GitHub REST API event linkage to mine 2,300+ Python repositories (with >100 PRs, excluding SWE-Bench) for issue→PR→test triples, yielding 331,000 triples.
- Filtering: Discarded entries with unparsable patches and limited train set to ≤3 edited files per instance (excluding tests), resulting in 110,000 high-quality instances (train_110K).
- Statistics:
- 54.7% of issue instances modify one file, 80% affect ≤3 files.
- 73.7% edit <100 lines; 85% edit ≤200 lines.
- Chain-of-Thought (CoT) Generation: For each gold patch, GPT-4 o is prompted with a partially masked context to “rationalize” the change, providing both a sequence of reasoning steps and the patch, following a “rationalization” paradigm. This CoT annotation significantly enhances edit model effectiveness.
5. Benchmark Evaluation and Comparative Performance
SWE-Fixer was evaluated on industry-grade benchmarks:
- SWE-Bench Lite: 300 real-world curated issues.
- SWE-Bench Verified: Subset with stringently validated test coverage.
In both cases, the metric is Pass@1 (“Best@1” if only one suggestion is generated): the percentage of instances for which the generated patch passes all provided project tests.
The main results are as follows:
| Method | Type | Verified (%) | Lite (%) |
|---|---|---|---|
| Agentless (GPT-4 o) | Pipeline | 38.8 | 32.0 |
| SWE-Fixer (ours) | Pipeline | 30.2 | 23.3 |
SWE-Fixer achieves best-in-class performance among open-source models, with a +1.3 point advantage on Lite and tied best accuracy on Verified. Notably, it also surpasses several pipelines built on GPT-4 or Claude-3-Opus. A variant incorporating PASS_TO_PASS (P2P) filtering achieves 24.7% (Lite) and 32.8% (Verified).
6. Efficiency, Hyperparameters, and Implementation
SWE-Fixer adopts a minimalist approach in computational resource utilization:
- Total LLM Invocations: Two per instance—one by the retriever and one by the editor.
- Hardware and Framework: Trained on 96 NVIDIA A800 GPUs (64GB each) with PyTorch/xtuner-lite; global batch size 96; 64K sub-token context.
- Optimization Settings:
- Retriever: AdamW, lr = , weight decay = 0.01, warmup 10%.
- Editor: AdamW, lr = , weight decay = 0.01, warmup 5%.
- Checkpoints at every 1,000 steps and best scores determined on held-out validation splits.
Compared to prevailing pipelines, which invoke 5–10 LLM calls per issue, SWE-Fixer achieves up to 40% fewer LLM calls and approximately 50% less GPU time on a per-instance basis, matching or exceeding accuracy benchmarks of open-source competitors.
7. Limitations and Prospective Work
Acknowledged limitations and future trajectories for SWE-Fixer include:
- Absence of Execution Feedback: Patches are never tested during training (no “test-in-the-loop” learning). Integrating even lightweight execution harnesses (such as LLM-driven test oracles) could enhance fix accuracy.
- Language Scope: Currently restricted to Python; generalizing to other programming languages requires language-specific skeleton extraction and greater context budgets.
- Context Scaling: File-rich repositories (>1,000 files) may cause the BM25 top-30 window to exclude optimal candidates; adaptive strategies or hierarchical file selection are proposed avenues.
- Chain-of-Thought Limitations: While rationalized CoT improves model output, it may sometimes misdirect the editing process. Improving CoT sampling, filtering, or employing self-consistency techniques may yield further gains.
- Integration into CI/CD Pipelines: Direct end-to-end deployment into continuous integration workflows, enhanced with test results and artifact checks, is identified as a significant area for future development (Xie et al., 9 Jan 2025).
In summary, SWE-Fixer demonstrates that a streamlined, open-source, two-module pipeline—based on BM25-LLM retrieval, structured code editing, and a large, real-world supervision corpus—can match or surpass the performance of more complex and proprietary alternatives on real-world automated code fixing tasks drawn from GitHub issue data.