SWE-Bench Lite: Bug-Fixing Benchmark

Updated 30 December 2025

SWE-bench Lite is a benchmark dataset that evaluates LLM agents on real-world bug-fixing tasks using curated open-source Python repositories.
The dataset comprises 300 GitHub issues with repository snapshots, natural-language bug descriptions, gold patches, and categorized unit tests (PASS_TO_PASS and FAIL_TO_PASS).
Test augmentation via UTBoost revealed weaknesses in original test suites and shifted leaderboard rankings, highlighting the need for robust evaluation methodologies.

SWE-bench Lite is a rigorously constructed benchmark for evaluating LLM agents on real-world code generation, specifically bug-fixing tasks. It is a filtered subset of the broader SWE-Bench family and is designed to present autonomous agents with authentic, non-synthetic software engineering problems extracted from open-source Python repositories. Each instance constitutes a triad: a repository snapshot, a GitHub issue description, and a corresponding human-authored patch with associated test cases. The dataset has been intensively scrutinized for issues such as insufficient test coverage, information leakage, and annotation artifacts, and has been the focus of standardized test-augmentation and sanitization methodologies in recent research.

1. Dataset Composition and Structure

SWE-bench Lite consists of 300 real-world GitHub issues exclusively focusing on bug-fixing tasks, sampled from 11 widely used Python repositories. Each issue–pull-request instance is, to the extent possible, self-contained: it includes a repository snapshot at a specified commit, a natural-language description of the software bug (the GitHub issue), a gold patch (the human-provided solution), and a curated suite of unit tests. The unit test suite for each instance is typically small, containing 1–5 test functions, and is divided into two non-overlapping categories:

PASS_TO_PASS: Tests which must pass both before and after patching.
FAIL_TO_PASS: Tests which fail on the buggy code, but pass after applying the correct patch.

All test cases are constructed using either pytest or unittest frameworks and were written by original maintainers or the SWE-Bench Lite creators. However, these manually authored test cases tend to exercise only the most explicit edge cases referenced in issue descriptions. The dataset is intended as an evaluation-only resource and lacks predefined train/development/test splits; users are permitted to impose custom splits for experimental purposes (Yu et al., 10 Jun 2025, Xia et al., 1 Jul 2024).

2. Test Coverage and Benchmark Deficiencies

Subsequent research exposed several critical deficiencies in the original test suites of SWE-bench Lite. Using the UTBoost framework—an augmentation pipeline grounded in LLM-driven unit test generation (UTGenerator)—systematic analysis revealed that 23 of 300 instances (7.7%) had insufficient tests. For these flagged cases, 599 agent-generated patches passed all original tests; of these, 170 were later shown to be incorrect under the expanded test suites, yielding an error rate $E_\ell = 170/599 \approx 28.4\%$ for Lite. This outcome underscores the insufficiency of original test coverage: canonical tests often fail to disqualify spurious patches that do not actually resolve the underlying bug. A representative case (mwaskom__seaborn-3010) demonstrated how a missing test for an edge case (only $x$ is NaN) allowed an incorrect agent patch to pass undetected (Yu et al., 10 Jun 2025).

In addition to original test insufficiency, parsing-related artifacts were corrected, revealing that 54.7% of Lite instances (164/300) had at least one test log mis-parsed, leading to the identification of 64 additional erroneous patches (Yu et al., 10 Jun 2025).

3. Construction and Filtering Methodology

The selection of SWE-bench Lite from the larger SWE-Bench involved excluding feature-addition tasks and curating for functional, self-contained bug fixes. Manual analysis by follow-up studies classified benchmark items based on the sufficiency of problem descriptions, the presence or absence of solution leakage, and the quality of location information:

9.3% of issues were deemed to lack sufficient information, rendering them unsolvable.
4.3% embedded the exact gold patch in the description.
10% included the complete solution in natural language.
4.3% exhibited misleading instructions incompatible with the actual gold patch.

To mitigate the effect of such confounds, the SWE-bench Lite-S subset was defined by removing all problems with exact solution leakage, insufficient information, or misleading directives. This resulted in a cleaned corpus of 252 instances (84% of the original Lite set), recommended for reliable head-to-head comparison of repair agents (Xia et al., 1 Jul 2024).

4. Evaluation Protocols and Metrics

The evaluation protocol for SWE-bench Lite requires agents to propose a patch that, when applied, causes all provided unit tests to pass. Historically, leaderboards reported the percentage of resolved problems:

$\%\,\mathrm{Resolved} = 100 \times \frac{\text{Number of correctly‐fixing patches}}{\text{Total number of problems}}$

Recent augmentation via UTBoost incorporates a two-stage evaluation:

Original-suite filtering: Only candidate patches whose behavior matches the gold patch on the original test suite are advanced.
Test augmentation and re-check: A battery of LLM-generated tests (UTGenerator) is applied to both gold and candidate patches. If the candidate fails any newly synthesized test that the gold patch passes, it is demoted.

Leaderboard standings shifted substantially after adoption of the augmented evaluation: 40.9% of SWE-Bench Lite agent rankings changed (18/44), evidencing the importance of robust, adversarial test coverage (Yu et al., 10 Jun 2025). Metrics also include error-rate under augmentation and location-accuracy (the percentage of patches editing a superset of ground truth locations):

$\%\,\mathrm{CorrectLocation} = 100 \times \frac{\# \text{ of patches editing ground truth superset}}{\text{Total problems}}$

5. Identified Artifacts and Quality Risks

Analysis of SWE-bench Lite highlighted several sources of potential bias or overfitting:

Solution Leakage: Presence of gold patch or explicit step-by-step natural language guides in the issue text compromises benchmark integrity.
Insufficient Test Suites: Over-reliance on minimal, developer-authored tests allows agent-generated patches to achieve superficial correctness.
Location Hints: Variability in whether descriptions point precisely to file/function/line or give only vague indicators introduces uneven difficulty.
Misparsing: Test log mis-parsing, especially on multi-line assertions or library-specific log formatting, led to erroneous success or failure annotations.

By excising leaky or unsolvable problems (forming SWE-bench Lite-S) and augmenting tests with UTBoost, these artifacts can be largely mitigated (Xia et al., 1 Jul 2024, Yu et al., 10 Jun 2025).

6. Recommendations and Impact

Recent research offers several recommendations and observations to maintain a high-quality benchmark:

Always complement hand-written tests with LLM-generated adversarial edge cases; this is particularly vital for repositories with complex dependency graphs and abundant legacy code.
Adopt intramorphic testing (white-box oracle): compare agent and gold patches on identical test inputs.
Systematically audit for solution leakage and stratify leaderboards by difficulty or leak status.
Maintain resilient log-parsing infrastructure to cope with the heterogeneity of open-source test outputs.
Re-evaluate legacy benchmarks when new solutions or agents emerge, as retrospective test augmentation is scalable and frequently identifies additional evaluation gaps.
Advance toward multilingual or multi-language benchmark instantiations and expand to unsolved or non-Python targets (Yu et al., 10 Jun 2025, Xia et al., 1 Jul 2024).

The integration of UTBoost and manual sanitization procedures has materially shifted both the empirical landscape and methodological rigor of SWE-bench Lite, with substantial changes to leaderboard standing (40.9% of entries reordered) and the exposure of near one-third error rate among previously passing agent patches. These measures jointly advance the maintainability and diagnostic value of the dataset for the broader program repair and LLM-agent research community.