SWE-bench Lite Benchmark

Updated 9 March 2026

SWE-bench Lite is a specialized program-repair benchmark featuring 300 curated Python bug-fix tasks extracted from major open-source repositories.
It employs precise evaluation protocols and metrics, including % Resolved and MRR, to assess APR performance in containerized, real-world settings.
The benchmark drives methodological advances and innovations in LLM- and agent-based repair systems while revealing critical issues like solution leakage and test insufficiency.

SWE-bench Lite is a widely adopted program-repair benchmark derived as a fast, cost-efficient subset of the original SWE-Bench suite. It enables fine-grained, real-world evaluation of automated program repair (APR) and software engineering (SWE) agents, with a focus on LLM- and agent-based systems. The benchmark is constructed from actual bug-fixing GitHub issues and corresponding patches from mature open-source Python projects. SWE-Bench Lite underpins a large leaderboard-driven ecosystem and has catalyzed methodological advances, rigorous evaluation practices, and critical scrutiny of APR progress in the era of large LMs.

1. Dataset Construction and Task Specification

SWE-bench Lite comprises 300 curated bug-fixing tasks extracted from 11–12 major open-source Python repositories, all sampled from the 2,294 issues and pull requests of SWE-Bench Full. Each instance contains:

A natural-language GitHub issue description (including title and details, but omitting test files for evaluation realism).
A repository snapshot at the buggy commit.
The ground-truth (gold) patch as applied by the original developer.
The full regression test suite as present prior to the human fix (Martinez et al., 4 Feb 2026, Chen et al., 21 Oct 2025, Pan et al., 2024, Martinez et al., 20 Jun 2025, Aleithan et al., 2024).

The selection procedure ensures:

Proportional distribution of repositories and issue types (e.g., off-by-one errors, null-pointer dereferences, API misuses, logical bugs).
Favoring tasks where tests execute deterministically in containerized environments.
Exclusion of overly large or ambiguous issues, tasks needing >1 file edited, or test suites that merely confirm error messages (Chen et al., 21 Oct 2025, Pan et al., 2024).

Each Lite instance is a standalone task: produce a patch such that, when applied to the repository, all previously passing tests remain passing, and all tests that failed before the patch pass after application. Unit and integration test coverage averages ~9,012 tests per project, yielding a total evaluation workload that is feasible even at scale (Chen et al., 21 Oct 2025).

2. Evaluation Protocols and Metrics

The benchmark supports several core evaluation modalities:

Issue Resolution (Patch Generation): Systems must submit a patch that (i) compiles, (ii) passes all previously failing reproduction and regression tests, and (iii) does not regress any existing tests.
- Resolution Rate: $R = \frac{\text{\# issues correctly resolved}}{300}$
- The Lite leaderboard reports only Precision (identical to “% Resolved”), corresponding to the fraction of tasks for which at least one submitted patch is deemed successful according to the above criteria (Martinez et al., 4 Feb 2026, Chen et al., 21 Oct 2025).
Issue Reproduction: Generate a test that fails on the buggy code and passes on the fixed code.
- F→P@1: Fraction of issues for which the top-1 generated test satisfies this fail-then-pass property (Chen et al., 21 Oct 2025).
Leaderboard Metrics: Additional metrics occasionally reported across studies include Recall, Accuracy, and Mean Reciprocal Rank (MRR), but only % Resolved is enforced for leaderboard ranking (Martinez et al., 4 Feb 2026).
- $\mathrm{MRR} = \frac{1}{N} \sum_{i=1}^N \frac{1}{\textrm{rank}_i}$ , where $\textrm{rank}_i$ is the rank of the first correct patch for task $i$ .

Patches must be validated in an isolated, containerized environment; workflows vary from simple batch evaluation (single fix per task) to agentic multi-attempt protocols.

3. Systematic Weaknesses: Solution Leakage, Test Adequacy, and Data Contamination

Critical analyses—including "SWE-Bench+: Enhanced Coding Benchmark for LLMs" and "UTBoost"—identified several major threats to the rigor of reported performance on SWE-bench Lite:

Solution Leakage: Approximately one third (18/54) of successful patches in top system evaluations directly regurgitated code or pseudocode that appeared verbatim in issue descriptions or comments. This directly inflates success rates (Aleithan et al., 2024).
Test Suite Insufficiency: About 15% of passing patches are “suspicious”—incorrect/incomplete, but not detected due to inadequate or weak tests. Additional ~10% of Lite instances have test suites too weak to distinguish gold from erroneous patches, which is exposed when augmenting with LLM-generated or intramorphic (variant-based) tests (Aleithan et al., 2024, Yu et al., 10 Jun 2025).
Timing-based Data Leakage: Over 94% of Lite issues predate frontier LLMs’ training cutoff, rendering the set susceptible to knowledge contamination and further deflating its credibility as a true generalization test (Aleithan et al., 2024).

These flaws have a pronounced quantitative effect: the effective solution rate for a top model (SWE-Agent+GPT-4) drops from 18% to 9.33% on Lite when solution-leak and weak-test patches are excluded (Aleithan et al., 2024). Automated test-injection and log-parsing improvements (UTBoost) show that 40.9% of leaderboard positions change after correcting such errors (Yu et al., 10 Jun 2025).

4. System Architectures and Submission Ecosystem

SWE-bench Lite’s leaderboard drives a rich ecosystem of submissions, with 79 entries from 47 submitters (industry, academia, collaborations, open source, and individuals), accounting for 52 unique system architectures (Martinez et al., 4 Feb 2026, Martinez et al., 20 Jun 2025):

System patterns:

Human-crafted static pipelines (G1): e.g., Agentless, employ fixed Issue Localization → Patch Generation → Patch Validation sequences.
Scaffolded/Locally agentic pipelines (G4–G5): e.g., PatchPilot, employ dynamic intra-stage decisions with LLM submodules for context-aware branching.
Emergent agent-based workflows (G6–G7): e.g., SWE-Agent, operate via reinforcement or “ReAct”-style loops with tool use, long-horizon planning, or multi-agent orchestration.
Hybrid and ranking-based designs: Leverage combinations of static and emergent components, plus LLM/critic voting, patch majority, or custom semantic filtering.

Leaderboard insights:

Proprietary LLMs dominate, especially Anthropic Claude (3.5, 3.7, 4 Sonnet) and OpenAI GPT-4(o).
- Claude 3.7 Sonnet + o4-mini achieves up to 60% resolve rate; pure open-source LLMs remain at ≤ 24% on Lite (Martinez et al., 4 Feb 2026, Martinez et al., 20 Jun 2025).
Industry submissions, especially from small companies, comprise the majority, but both academic and open-source approaches remain competitive.
The highest-performing published architectures frequently employ agentic planning, multi-stage or multi-hypothesis branching, and “sanity checker”-style patch vetting (Martinez et al., 4 Feb 2026).

5. Experimental Results and Progression

Agentic and LLM-based Systems

Recent high-performing models evaluated directly on SWE-bench Lite include:

SWE-Adept (GPT-5.2/Claude-4.5): Two-agent framework with depth-first dependency-guided localization and structured, checkpoint-driven resolution (He et al., 1 Mar 2026):
- Localization Acc@3 (file): up to 97.0%
- Localization Acc@5 (function): up to 87.6%
- End-to-end resolution rate: 71.3% (Claude-4.5), a 2.6–6.0 pp gain over prior SOTA.
- Key technical innovations: context-window minimization via DFS, branch-backed patch exploration, and disciplined rollback mechanisms.
Open-weight Agents (with SWE-Gym): Using Qwen-2.5-Coder-Instruct, filtered behavior cloning finetuning, and reward-model-based verifier reranking, best@8 resolve rate is 26.0%—the best among open-source models to date (Pan et al., 2024).
Regression test suite minimization (TestPrune): Reduces the ~9,000-test average per project to ~9 per issue, improving issue resolution rates by 9.4–10.7% at negligible cost overhead (<$0.05/issue) (Chen et al., 21 Oct 2025).

Automated Test Augmentation and Intramorphic Evaluation

UTBoost, with LLM-based unit test generation and log-parsing correction, identifies:

10.3% of agent-resolvable Lite instances have insufficient tests.
28.4% of “passing” agent patches for these instances are erroneous when exposed to new tests.
After correction, 18/44 leaderboard agents experience rank changes (40.9%) on Lite—a correction rate not observed in prior program-repair benchmarks (Yu et al., 10 Jun 2025).

6. Limitations, Recommendations, and Evolution

SWE-bench Lite is constrained by Python-only coverage, the small number (300) of tasks, and historical task definitions that do not prevent leakage or guarantee strong test adequacy (Chen et al., 21 Oct 2025, Pan et al., 2024, Aleithan et al., 2024). Its continued use as an APR standard-bearer is contingent on methodological reforms, including:

Eliminating solution-hint content from issue texts and comments.
Retrospectively or prospectively curating tasks to postdate LLM training cutoffs.
Comprehensive strengthening and augmentation of test suites, preferably via automated means (e.g., LLM-injected corner-case generators, intramorphic oracles).
Manual spot-checks and peer review for leaks, test weakness, and semantic correctness.
Transparent reporting of both raw and filtered (weakness-corrected) resolve rates.
Mandating full submission metadata (model, prompt, patch, log) and open-sourcing for reproducibility (Aleithan et al., 2024, Martinez et al., 4 Feb 2026, Martinez et al., 20 Jun 2025, Yu et al., 10 Jun 2025).

Ongoing advances—newer variants (e.g., Lite 2.0, open-weight agent fly-offs), automated overfitting detection, and cross-language/task generalization—reflect both the strengths and the imperatives for continued benchmark evolution.

7. Broader Impact and Research Ecosystem

SWE-bench Lite’s design—repository-level, real-issue-driven, and open—has made it the central platform for APR method development, ablation benchmarking, and leaderboard-driven comparison. It supports the evaluation of modular, agent-based, and hybrid LLM systems; facilitates rapid prototyping and iterative methodology improvement; and has inspired dedicated training environments (e.g., SWE-Gym), as well as critical meta-analyses guiding best practice (Pan et al., 2024, He et al., 1 Mar 2026, Martinez et al., 4 Feb 2026, Chen et al., 21 Oct 2025, Aleithan et al., 2024, Martinez et al., 20 Jun 2025, Yu et al., 10 Jun 2025).

Simultaneously, its limitations have exposed the urgent need for benchmarks that are less vulnerable to data contamination, overfitting, test-weakness artifacts, and transparency deficits. As such, SWE-bench Lite is both a proving ground for state-of-the-art repair agents and a living case study in the experimental rigor essential to measuring progress in data-driven software engineering research.