Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-Fixer: Open-Source Code Fix Pipeline

Updated 12 March 2026
  • SWE-Fixer is an open-source two-stage pipeline that maps natural language GitHub issues to minimal code changes passing unit tests.
  • It combines a coarse-to-fine retrieval module with BM25 and a fine-tuned Qwen2.5-7B to accurately select relevant code files.
  • The system employs a structured code editing module with Qwen2.5-72B, achieving state-of-the-art performance on SWE-Bench benchmarks with high efficiency.

SWE-Fixer is an open-source, two-stage pipeline designed to efficiently and effectively resolve real-world GitHub issues by integrating information retrieval and code-editing LLMs trained on a large-scale corpus of issues and corresponding patches. The framework targets practical, test-driven software engineering tasks where a natural language issue is mapped to a minimal set of code changes that pass a repository’s existing unit tests. SWE-Fixer is characterized by a coarse-to-fine retrieval module and a structured code editing module, both fine-tuned on a uniquely constructed dataset of 110,000 GitHub issue–patch–test triples. The architecture achieves state-of-the-art performance among open-source models on SWE-Bench Lite and Verified benchmarks, and is notable for its efficiency—requiring only two LLM calls per issue—without reliance on proprietary models (Xie et al., 9 Jan 2025).

1. System Architecture and Workflow

SWE-Fixer is built as a two-stage pipeline:

  1. Code File Retrieval Module: Utilizes Qwen2.5-7B, fine-tuned as a sequence-to-sequence classifier, for identifying which files in a repository are most relevant to a given GitHub issue.
    • Input: Receives issue text and a list of file “skeletons”—abstractions that retain only signatures, docstrings, headers, and the first/last five lines of functions or classes.
    • Processing: Applies BM25 to index every file as a document, scoring each against the issue’s textual query, and selects the top 30 files. The LLM reranker then filters this set to return those deemed necessary for modification, output as a JSON array of file paths.
    • Training Data: 80K retrieval tasks with gold label “to_modify” sets, restricted such that all gold files are within the BM25 top 30 (Xie et al., 9 Jan 2025).
  2. Code Editing Module: Uses Qwen2.5-72B, fine-tuned with a structured JSON output style, to generate executable code patches.
    • Input: Receives the issue and the full content (with line numbers) of files selected by the retriever.
    • Output: Produces a JSON object containing a “reasoning” array (chain-of-thought style), plus a collection of precise, line-numbered edits (“edits”), where each specifies the path, the old code snippet annotated by original line numbers, and the new patch.
    • Training Data: 70K instances, each paired with a chain-of-thought and a gold patch, where the rationale is generated by prompting GPT-4 o with masked diffs in a “rationalization” framework.

The entire workflow makes one call to the retriever and one call to the editor LLM for each issue.

2. Retrieval Methodology and Precision

The code file retrieval module implements a coarse-to-fine selection strategy:

  • Coarse Retrieval: BM25 is used for rapid scoring. BM25 for a query Q={q1,,qn}Q = \{q_1,…,q_n\} and document DD is given by

BM25(Q,D)=i=1nIDF(qi)×f(qi,D)(k1+1)f(qi,D)+k1(1b+bD/avgdl),\mathrm{BM25}(Q, D) = \sum_{i=1}^{n} \mathrm{IDF}(q_i) \times \frac{f(q_i,D)\,(k_1 + 1)}{f(q_i,D) + k_1 (1 - b + b\,|D|/\mathrm{avgdl})},

where f(qi,D)f(q_i, D) is the term frequency, IDF(qi)\mathrm{IDF}(q_i) the inverse document frequency, and k11.2k_1 \approx 1.2, b0.75b \approx 0.75 (Xie et al., 9 Jan 2025).

  • Fine Reranking: The top 30 files from BM25 are categorized by the fine-tuned Qwen2.5-7B to obtain the final set of files for editing. Skeletons used in the input limit context while preserving salient file characteristics.

Empirical retrieval performance on the SWE-Bench Lite benchmark is summarized below:

Method Precision (%) Recall (%)
BM25 Top-3 18.9 56.7
BM25 Top-30 2.9 86.7
Qwen2.5-7B finetuned 68.5 69.0

Ablation studies show that context window size and inclusion of the repository README have effects of a few points on precision/recall. Mixing in edit-task instances for retriever tuning yields further marginal improvements.

3. Code Editing Module and Structured Output

The code editing module uses a large code LLM, fine-tuned to produce directly usable JSON patch objects:

  • Architecture: Qwen2.5-72B, finetuned with a “JsonTuning” methodology, generating both reasoning (the chain-of-thought) and an array of edits.
  • Output Schema:
    • “reasoning”: Optional array justifying changes.
    • “edits”: Array of objects, each with a file path, the original code snippet (with line numbers), and the corresponding patch (without line numbers).
  • Training Protocol:
    • Gold-standard patches are paired with GPT-4 o–generated rationalizations.
    • Optimization via AdamW with a learning rate of 5×1065 \times 10^{-6}, batch size 96, and 5% warmup.
    • JSON format is enforced through structured prompting and, in case of invalid output or syntax failure, resampling with a higher temperature (up to 5 attempts).

Ablation experiments indicate that retaining full file context with line numbers is optimal (Fix Rate 20%), and that including class/function names as hints increases the fix rate by 2 points. Removing line numbers yields a substantial 6-point drop.

4. Dataset Curation and Chain-of-Thought Construction

Development of the SWE-Fixer system depended on a newly assembled dataset of real-world code fixes:

  • Collection: Utilized GitHub REST API event linkage to mine 2,300+ Python repositories (with >100 PRs, excluding SWE-Bench) for issue→PR→test triples, yielding 331,000 triples.
  • Filtering: Discarded entries with unparsable patches and limited train set to ≤3 edited files per instance (excluding tests), resulting in 110,000 high-quality instances (train_110K).
  • Statistics:
    • 54.7% of issue instances modify one file, 80% affect ≤3 files.
    • 73.7% edit <100 lines; 85% edit ≤200 lines.
  • Chain-of-Thought (CoT) Generation: For each gold patch, GPT-4 o is prompted with a partially masked context to “rationalize” the change, providing both a sequence of reasoning steps and the patch, following a “rationalization” paradigm. This CoT annotation significantly enhances edit model effectiveness.

5. Benchmark Evaluation and Comparative Performance

SWE-Fixer was evaluated on industry-grade benchmarks:

  • SWE-Bench Lite: 300 real-world curated issues.
  • SWE-Bench Verified: Subset with stringently validated test coverage.

In both cases, the metric is Pass@1 (“Best@1” if only one suggestion is generated): the percentage of instances for which the generated patch passes all provided project tests.

The main results are as follows:

Method Type Verified (%) Lite (%)
Agentless (GPT-4 o) Pipeline 38.8 32.0
SWE-Fixer (ours) Pipeline 30.2 23.3

SWE-Fixer achieves best-in-class performance among open-source models, with a +1.3 point advantage on Lite and tied best accuracy on Verified. Notably, it also surpasses several pipelines built on GPT-4 or Claude-3-Opus. A variant incorporating PASS_TO_PASS (P2P) filtering achieves 24.7% (Lite) and 32.8% (Verified).

6. Efficiency, Hyperparameters, and Implementation

SWE-Fixer adopts a minimalist approach in computational resource utilization:

  • Total LLM Invocations: Two per instance—one by the retriever and one by the editor.
  • Hardware and Framework: Trained on 96 NVIDIA A800 GPUs (64GB each) with PyTorch/xtuner-lite; global batch size 96; 64K sub-token context.
  • Optimization Settings:
    • Retriever: AdamW, lr = 1×1051 \times 10^{-5}, weight decay = 0.01, warmup 10%.
    • Editor: AdamW, lr = 5×1065 \times 10^{-6}, weight decay = 0.01, warmup 5%.
    • Checkpoints at every 1,000 steps and best scores determined on held-out validation splits.

Compared to prevailing pipelines, which invoke 5–10 LLM calls per issue, SWE-Fixer achieves up to 40% fewer LLM calls and approximately 50% less GPU time on a per-instance basis, matching or exceeding accuracy benchmarks of open-source competitors.

7. Limitations and Prospective Work

Acknowledged limitations and future trajectories for SWE-Fixer include:

  • Absence of Execution Feedback: Patches are never tested during training (no “test-in-the-loop” learning). Integrating even lightweight execution harnesses (such as LLM-driven test oracles) could enhance fix accuracy.
  • Language Scope: Currently restricted to Python; generalizing to other programming languages requires language-specific skeleton extraction and greater context budgets.
  • Context Scaling: File-rich repositories (>1,000 files) may cause the BM25 top-30 window to exclude optimal candidates; adaptive strategies or hierarchical file selection are proposed avenues.
  • Chain-of-Thought Limitations: While rationalized CoT improves model output, it may sometimes misdirect the editing process. Improving CoT sampling, filtering, or employing self-consistency techniques may yield further gains.
  • Integration into CI/CD Pipelines: Direct end-to-end deployment into continuous integration workflows, enhanced with test results and artifact checks, is identified as a significant area for future development (Xie et al., 9 Jan 2025).

In summary, SWE-Fixer demonstrates that a streamlined, open-source, two-module pipeline—based on BM25-LLM retrieval, structured code editing, and a large, real-world supervision corpus—can match or surpass the performance of more complex and proprietary alternatives on real-world automated code fixing tasks drawn from GitHub issue data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Fixer.