SWE-Bench Lite: LLM Bug-Fix Benchmark

Updated 3 February 2026

SWE-Bench Lite is a curated benchmark that evaluates LLM and agent-based systems on real-world single-file bug-fixing tasks drawn from 12 major Python repositories.
The benchmark employs rigorous evaluation protocols, using metrics like resolved rate, localization accuracy, and cost to enable fair and reproducible comparisons.
Recent innovations such as UTBoost and TestPrune enhance test suite quality by addressing solution leakage and weak test coverage for more reliable benchmarking.

SWE-Bench Lite is a curated benchmark for evaluating LLM and agent-based systems on end-to-end bug-fixing tasks in real-world Python repositories. It represents a subset of the broader SWE-Bench suite, with specific sampling and filtering aimed at tractability and reproducibility for rigorous research comparisons.

1. Construction, Scope, and Task Design

SWE-Bench Lite consists of 300 handpicked tasks drawn from the original 2,294-issue SWE-Bench corpus, which spans twelve prominent open-source Python repositories (e.g., sympy, matplotlib, scikit-learn, flask, astropy, requests, seaborn, sphinx, xarray, pylint, pytest, and one additional unnamed project) (Aleithan et al., 2024, Martinez et al., 20 Jun 2025, Xia et al., 2024, Pan et al., 2024). The benchmark is strictly limited to single-file bug-fixing instances:

Each instance provides a real GitHub issue description corresponding to a bug, the pre-fix repository state, and a pull request containing the developer-authored patch plus associated unit tests.
All tasks involve only one bug and require a one-file patch in ground truth. No synthetic puzzles, toy code, or feature requests are admitted (Xia et al., 2024).
The Lite subset excludes ambiguous statements, multi-file edits, large/complex ground-truth patches, and tests that merely check error-message text (Pan et al., 2024).

This design prioritizes tasks suitable for rapid prototyping, evaluation, and iteration, enabling fair comparison between diverse LLM pipelines and APR (Automated Program Repair) architectures (Martinez et al., 20 Jun 2025, Aleithan et al., 2024).

2. Evaluation Protocol and Metrics

The evaluation protocol enforces black-box, repository-level assessment:

Primary metric: Resolved rate (percent of instances for which the submitted patch passes all human-written, unseen unit tests):

$\text{Resolved Rate (RR)} = \frac{|\{\text{tasks for which submitted patch passes all unit tests}\}|}{N} \times 100\%$

where $N=300$ (Pan et al., 2024, Xia et al., 2024, Martinez et al., 20 Jun 2025).

Other metrics include:
- Localization accuracy at the file, function, and line level:
$\text{LocAcc}_{\text{line}} = \frac{\#\,\text{patches editing correct lines}}{\#\,\text{total resolved}}$

with analogous formulae for higher granularity (Xia et al., 2024). - Cost: Average LLM inference cost per problem, e.g., Agentless achieves \$0.34 per instance (Xia et al., 2024). - Pass@k/Best@k: The probability that at least one of $k$ model samples solves the task; Best@k applies verifier selection (Pan et al., 2024). - Auxiliary metrics: Empty patch rate (percentage of runs that propose no edits), stuck-in-loop rate (three identical edits in the last three turns), average number of turns to solution (Pan et al., 2024).

All submissions must execute the provided test suite for validation, with no access to ground truth.

3. Data Quality, Solution Leakage, and Benchmark Limitations

SWE-Bench Lite inherits data integrity challenges from its parent set. Empirical audits and leaderboard analyses highlight two principal threats to validity:

Solution leakage: Approximately 33% of "solved" instances contain the exact solution—or sufficient patch details—directly in the issue body or comments, enabling trivial copy-matching rather than semantic repair (Aleithan et al., 2024, Xia et al., 2024). Explicit solution-leak patterns were annotated in (Aleithan et al., 2024), which found that among “passed” GPT-4 fixes, 33.33% were solution leaks.
Weak test suites: About 15% of “passes” are traced to insufficient or incomplete test coverage. Incorrect or incomplete patches can escape detection by the original unit tests, as exemplified by high-profile analysis in (Yu et al., 10 Jun 2025) and (Aleithan et al., 2024). The effective correct-fix rate for GPT-4 on Lite falls from 18.0% raw to 9.33% after filtering for test strength and leakage (Aleithan et al., 2024).

Additionally, over 94% of Lite’s issues predate major LLM knowledge cut-offs, leading to potential data contamination during model pre-training (Aleithan et al., 2024).

The Lite-S subset, proposed in (Xia et al., 2024), eliminates instances with verbatim ground-truth patches, severe information deficits, or misleading solution cues. Lite-S contains 252 rigorously filtered problems, offering a more trustworthy basis for system comparisons.

4. Benchmark Leaderboard, Agent Architectures, and Systematic Findings

SWE-Bench Lite is the definitive leaderboard for repository-level bug repair evaluations (Martinez et al., 20 Jun 2025). As of July 2025, it has accumulated 79 entries spanning academia, industry, and community developers:

Resolved Rate Distribution: Scores range from a baseline 0.33% to 60.3% (top result: ExpRepair with Claude 4 Sonnet).
- Mean ( $\mu$ ): ~34.2%.
- Standard deviation ( $\sigma$ ): ~11.8%.
Open-source vs. Closed-source Models: 67% of submissions are open-source; 33% employ proprietary LLM APIs (Claude, GPT-4 families).
Agentic vs. Non-Agentic Approaches: Agentic systems (e.g., multi-phase autonomous agents, scaffolded workflows) dominate with higher medians (36.5% vs. 27.3% for non-agentic).
Submitter Types: Small companies are overrepresented at the top by both median and maximum scores; academia achieved the single highest result but a lower median.
Best Practices:
- Use of proprietary LLMs correlates with peak performance.
- Multistage workflows—separating localization, repair, and verification—show systematic advantages over monolithic designs.
- Candidate patch generation and verifier-based reranking boost recall and precision; iterative self-refinement further improves outcomes (Martinez et al., 20 Jun 2025, Xia et al., 2024, Pan et al., 2024).

5. Recent Advances: Test Suite Augmentation, Test Minimization, and Benchmark Corrections

Systematic benchmarking on SWE-Bench Lite has revealed vulnerabilities in evaluation protocols due to weak or narrow test coverage. Two recent frameworks address these deficits:

UTBoost (Yu et al., 10 Jun 2025):
- Automated test case generation (UTGenerator): LLMs are prompted at file, function/class, and line level to synthesize targeted pytest functions that surface corner cases missed by human authors.
- Reevaluation pipeline: Agent patches that previously passed original tests are retested under the augmented suite. 234 original “passed” patches (of 300) were revealed as incorrect or erroneously labeled.
- Leaderboard impact: 40.9% of SWE-Bench Lite leaderboard positions changed after correction; pass@1 dropped by 1–2 points in the top-5 and agent ranks reshuffled.
TestPrune (Chen et al., 21 Oct 2025):
- Issue-based test suite minimization: LLM-guided localizers select suspicious methods, and coverage heuristics identify a minimal subset (~9 tests per task) targeting buggy lines. This process enables high-precision regression validation (precision 0.63, recall 0.71) while reducing suite size 1000-fold and speeding up runs by 27×.
- Integration: Used in both Otter (test generation) and Agentless (patch validation) pipelines, TestPrune consistently raises resolution rates (2–12.9% relative) on Lite.
- Cost and Efficiency: TestPrune adds negligible cost ($0.02–$0.05 per task) relative to conventional approaches.

Combined, these developments mark a shift towards “benchmarks that scrutinize benchmarks”—routine test suite augmentation and minimization are now best practice to reveal false positives and correct mislabels (Yu et al., 10 Jun 2025, Chen et al., 21 Oct 2025).

6. Comparative Performance and Agentless vs. Agentic Paradigms

Recent studies evaluated simplistic, non-agentic paradigms versus fully autonomous agents (Xia et al., 2024, Pan et al., 2024):

Agentless (Xia et al., 2024): A fixed 3-phase system (localization, repair, validation) reached 27.33% resolved (82/300) with GPT-4o, outperforming all contemporary open-source agent-based systems while incurring far lower cost ($0.34 per bug). Line-level localization accuracy achieved 34.3%.
Top open-source agents (e.g., AutoCodeRover-v2, CodeR) achieved 30.67% and 28.33% respectively, but at higher cost and comparable localization accuracy.
Fine-tuning data (SWE-Gym): Off-policy and self-improvement strategies yield best-of-k scores up to 26.0% in open-weight settings (Pan et al., 2024). Further training-time scaling remains promising, as resolution rates do not saturate with current data sizes.

A plausible implication is that, while agentic decomposition and advanced workflows offer improvement, careful design of test regimes and cost-effective prompt engineering can enable competitive baselines.

7. Recommendations and Implications for Benchmark Usage

SWE-Bench Lite’s rapid adoption within the software engineering research community demonstrates its utility but also highlights the necessity for best practices (Martinez et al., 20 Jun 2025, Aleithan et al., 2024):

Screen for solution leaks: Exclude issues where literal patches or excessive hints are present.
Audit and enhance test suites: Apply frameworks like UTBoost and TestPrune to combat overestimation due to insufficient coverage.
Prefer robust metrics: Consider filtered resolution rates, localization accuracy, and cost efficiency—not just raw fix counts.
Adopt enhanced subsets: Use Lite-S for rigorous, leak-free comparisons and transition to SWE-Bench⁺ for cut-off aligned, leakage-free tasks (Aleithan et al., 2024).

Without these interventions, results may substantially overstate true LLM or agentic system capabilities—even by factors of two to three.

Key References: