SWE-Bench-Lite: Python APR Benchmark

Updated 12 November 2025

SWE-Bench-Lite is a curated, execution-driven benchmark that evaluates automated program repair systems using 300 real-world Python bug instances.
It is constructed by subsampling from a larger dataset, focusing on 12 high-profile Python repositories with strict filtering for unit-test-verified repairs.
The benchmark employs clear metrics like resolve rate from pass/fail tests, enabling rapid prototyping and cross-method comparison in APR research.

SWE-Bench-Lite is a curated, execution-driven benchmark for evaluating automated program repair (APR) systems and language-model-driven software agents on real-world Python bugs. It constitutes a lightweight yet fully executable subset (300 instances) of the broader SWE-Bench benchmark, derived from GitHub issues and pull requests across 12 high-visibility Python repositories. SWE-Bench-Lite defines a tractable, unit-test-verified arena for assessing end-to-end code-fixing pipelines, both agentic and pipelined, and has emerged as the primary leaderboard for rapid prototyping and cross-method comparison in APR research.

1. Construction and Dataset Composition

SWE-Bench-Lite is constructed by subsampling the original SWE-Bench dataset, which comprises 2,294 GitHub issues paired with corresponding bug-fixing pull requests (Pham et al., 20 Apr 2025). The selection procedure for the Lite split enforces several constraints:

Repository Scope: 12 popular open-source Python projects, including sympy, matplotlib, scikit-learn, flask, astropy, requests, seaborn, sphinx, xarray, pylint, pytest, and Django (Aleithan et al., 9 Oct 2024, Xie et al., 9 Jan 2025).
Task Definition: Each instance contains the repository snapshot at the buggy commit, the natural-language issue description, the reference developer patch (unified diff), and the full applicable test suite at that point.
Filtering Criteria:
- Only bug-fixing issues (not feature requests).
- Instances with parsable, one-file/tractable patches and a test suite that fails pre-patch and passes post-patch.
- Each patch must affect at most three non-test files, and instances requiring editing more than one file are generally excluded (Pan et al., 30 Dec 2024, Reddy et al., 7 May 2025).
- The repository state and test suite must fit within standard context window constraints (e.g., ≤64K tokens).
- All instances are sourced from Python repositories.

These filters aim to preserve a diversity of bug types: algorithmic/logical errors, API misuse, off-by-one, missing imports, and test-related issues. Importantly, each instance reflects an authentic software maintenance scenario documented in GitHub issue trackers (Pham et al., 20 Apr 2025, Martinez et al., 20 Jun 2025).

2. Evaluation Methodology and Metrics

The central evaluation paradigm in SWE-Bench-Lite is executable patch assessment: a candidate patch is accepted as “resolved” if, when applied to the repository, it causes all unit tests to pass. The main quantitative metrics are:

Resolve Rate/RR (Success Rate):

$\mathrm{RR} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\bigl[\mathrm{Pass}(v_i\oplus \Delta_i,\,T_i)\bigr]\times 100\%$

where $N=300$ , $v_i$ is the buggy version, $T_i$ the test suite, and $\Delta_i$ the candidate patch.

Correct Patch Rate/CPR: Exact-match accuracy at AST level.
Empty Patch Rate/EPR: Fraction of “no-op” predictions.

The primary ranking criterion—often simply called “percent resolved” or “precision”—is the number of tasks for which any correct patch is found, divided by 300 (Martinez et al., 20 Jun 2025).

Additional metrics used in published work include precision/recall of file retrieval (when evaluating retrieval+edit pipelines), localization rates at various granularities (file, class/function, line), and cost-efficiency (token/currency consumption per fix) (Xia et al., 1 Jul 2024, Xie et al., 9 Jan 2025).

P2P Filtering (Pass-to-Pass): Some methods consider multiple patch attempts, accepting only those that convert failing tests to passing and then re-ranking candidates (Xie et al., 9 Jan 2025).

3. Submission Protocols and Leaderboard Practices

SWE-Bench-Lite adopts a low-bureaucracy, open submission model (Martinez et al., 20 Jun 2025):

Submission Artifacts:
- List of resolved issues and corresponding patches.
- Metadata on repair approach and environment (metadata.yaml).
- Complete test logs and execution results.
Leaderboard Ranking: Techniques are ranked exclusively by the number of resolved tasks (precision); no recall or F1 is computed due to the fixed set size. Unlike SWE-Bench Verified (which requires Docker images and further curation), Lite accepts submissions via pull requests to the main SWE-Bench experiments repository and emphasizes accessibility.

This structure encourages wide participation from academia, industry, and independent researchers, resulting in 79 tracked Lite submissions as of mid-2025.

4. Analysis of Instance Quality and Benchmark Hygiene

Detailed analyses have revealed several benchmark quality issues that affect interpretability and reliability of results (Aleithan et al., 9 Oct 2024, Pan et al., 30 Dec 2024, Yu et al., 10 Jun 2025):

Solution Leakage: 33% of successful patches in sampled instances directly reproduce code snippets or patches already present verbatim in the issue description or comments. After excluding these, true resolution rates can drop by nearly half (e.g., 18% → 9% for SWE-Agent+GPT-4) (Aleithan et al., 9 Oct 2024).
Weak Test Suites: 14.8% of initial “resolved” instances actually passed due to insufficient test coverage rather than correct bug fixes; model-generated patches may be semantically incorrect yet survive official tests.
Temporal Leakage: >94% of Lite issues predate the GPT-4 and GPT-3.5 knowledge cutoff dates, suggesting LLMs may have encountered these issues and solutions in training data.
Test Adequacy (TAM): The ratio of robustly verified to weakly-tested resolutions is recommended for future quality control.

Tools like UTBoost and UTGenerator have demonstrated that nearly a quarter of agent-generated patches originally labeled as correct in Lite were false positives, prompting proposals for automatic test augmentation and stricter filtering (Yu et al., 10 Jun 2025).

Table: Correction Impact of Test Augmentation

Split	Patches Reclassified	Affected Leaderboard Entries
Lite	176	40.9% (18 agents)
Verified	169	24.4% (11 agents)

5. Model and System Performance

SWE-Bench-Lite serves as a stress test for both open-source and proprietary LLM-driven repair systems, yielding highly differentiated performance across architectures and agentic depth.

Proprietary LLMs (e.g., Claude 3.5, GPT-4o) have consistently topped the Lite leaderboard (up to 60%), especially in hybrid or dual-agent systems (Martinez et al., 20 Jun 2025).
Open-source LLMs—including fine-tuned Qwen2.5 and Code-LLaMA models (up to 72B parameters)—achieve 19–24.7% RR, with SWE-Fixer 72B reporting 23.3% (Best@1), rising to 24.7% with P2P filtering (Best@8) (Xie et al., 9 Jan 2025).
Agentless and Minimal Pipelines: Simpler, non-agentic pipelines (e.g., Agentless) can reach 27.3% solve rate at minimal inference cost (\$0.34/issue), indicating many issues do not require complex tool use or agentic reasoning (Xia et al., 1 Jul 2024).
Reinforcement and Verifier Scaling: Fine-tuning on 491 trajectory-collected successes from SWE-Gym increases open-weight RR from 3.0% (zero-shot) to 15.3% (fine-tuned); verifier reranking at inference elevates RR to 26.0%, setting the state-of-the-art for open-weight agents (Pan et al., 30 Dec 2024).

Table: Representative Open-Source Method Results (Lite, Best@1 unless noted) (Xie et al., 9 Jan 2025)

Method (LLM)	RR (%)	Special Filtering
SWE-Gym-32B	19.1
SWE-SynInfer-72B	22.0
SWE-Fixer-72B	23.3
SWE-Fixer-72B	24.7	Best@8(P2P)
Agentless (GPT-4o)	32.0	Best open-source arch

Leaderboard trends indicate open-source tracks lag proprietary model solutions by 15–35 pp, but architectural choices (multi-agent, agentic workflow, patch ensemble/re-ranking) can halve this gap (Martinez et al., 20 Jun 2025).

6. Architectural Taxonomy and Trends

Research using SWE-Bench-Lite has clarified several design patterns associated with higher patch success rates (Martinez et al., 20 Jun 2025):

Human-Authored, Agentic Workflows: Scaffolds where agent autonomy is structured and workflow is modularized (e.g., file localizer → patch generator) tend to outperform monolithic or strictly pipelined approaches.
Multi-Agent Pipelines: Fixed dual-agent (localizer+repairer) pipelines achieve median 37% RR, with values up to 60% for some hybrid architectures.
Emergent Autonomy: Systems allowing dynamic workflow adaptation—though not yet dominant—reach up to 56% but frequently underperform compared to scaffolded alternatives.
Open Source Reproducibility: The reproducible, open-source track is essential but remains below 40% RR for all tested approaches.

Distribution of architectural types among 79 submissions:

Human-Workflow / Fixed / No Agent (23 entries, median 27%)
Human-Workflow / Fixed / Multiple Agents (8, median 37%)
Human-Workflow / Scaffolded / Single Agent (16, median 38%)
Emergent Workflow / Emergent Autonomy / Single Agent (10, median 25%) (Martinez et al., 20 Jun 2025)

7. Limitations, Recommendations, and Future Improvements

SWE-Bench-Lite’s design entails several strengths and constraints:

Strengths

End-to-end executable evaluation with real, diverse defects.
Scalable evaluation: 300 problems yield rapid model benchmarking.
Simple, interpretable metrics (pass/fail; per-bug AST checks).
Broad community uptake enabled by open submission and minimal computational cost barriers.

Limitations

Python-only; exclusion of multi-language scenarios or cross-language reasoning limits generality (Pham et al., 20 Apr 2025).
Fixed-size and lack of difficulty labels reduces the capacity for fine-grained, stratified benchmarking (Pham et al., 20 Apr 2025).
Solution and temporal leakage, as well as test suite sparsity, can inflate RR/precision.
No systematic measurement of patch fragility (single test suite; no adversarial perturbation).

Recommendations

Enforce leakage detection and only curate post-LLM-cutoff issues in future expansions (Aleithan et al., 9 Oct 2024, Pan et al., 30 Dec 2024).
Augment test suites via automated LLM-based test generation (UTGenerator/UTBoost) to catch false positives and latent errors (Yu et al., 10 Jun 2025).
Require architectural disclosure and correctness checklists in submissions.
Support further stratification by bug type, patch size, or failed test count to diagnose model strengths.
Maintain unified, forked benchmarking tracks (Lite, Verified, Lite-S, mini-sets) to avoid evaluation fragmentation.
Promote open-sourcing of agents, scaffolds, and prompt templates (Martinez et al., 20 Jun 2025).

By addressing these recommendations, SWE-Bench-Lite will continue to serve as a foundational probe for scalable, reproducible, and practical automated program repair evaluation. Its influence is evident in the rapid convergence toward agentic software development workflows and the transparent assessment of LLM capabilities in real-world maintenance tasks.