SWE-bench Lite-S: High-Precision Code Repair Benchmark
- SWE-bench Lite-S is a high-precision benchmark that curates real-world Python bug fixes by eliminating patch leakage and under-specification.
- It employs detailed manual curation on description quality, solution clarity, and location information to ensure only solvable, pristine tasks are included.
- The benchmark supports robust evaluation with metrics like Fix–Accuracy, driving innovation in LLM-driven automated program repair systems.
SWE-bench Lite-S constitutes a high-precision, rigorously filtered evaluation benchmark for automated code repair, derived from real-world Python bug-fixing tasks. Designed to resolve critical deficiencies identified in the original SWE-bench Lite dataset—including patch leakage, under-specification, and misleading guidance in issue descriptions—Lite-S supports robust measurement of LLM and agent-based software engineering systems by excluding all contaminated or unsolvable tasks. Its adoption has established a new, more stringent baseline for assessing autonomous program repair agents, prompting methodological reevaluation and accelerating progress in the LLM-driven automated program repair ecosystem.
1. Motivation and Genesis
SWE-bench Lite-S emerged in direct response to weaknesses in the earlier SWE-bench Lite benchmark, which was constructed to enable rapid, lower-cost evaluation of automated program repair on real issues sourced from open-source Python repositories. Researchers discovered that approximately 20% of Lite’s 300 tasks were compromised by one or more of the following: verbatim patch leakage in the issue description, insufficient information to specify a testable repair, or misleading natural-language “solutions” incongruent with the repository’s ground-truth fix. This contamination risked inflating reported performance by allowing shortcut “copy” strategies or forcing participants to guess at under-specified requirements (Xia et al., 2024).
To address this, every Lite task underwent granular manual curation along three annotation axes: Description Quality, Solution-In-Description, and Location Information. Any task exhibiting (1) exact-patch leakage (≥80% diff overlap), (2) lack of a unique, testable spec, or (3) misleading or incorrect solution guidance was systematically excluded, resulting in the high-fidelity Lite-S set.
2. Dataset Composition and Annotation
After curation, SWE-bench Lite-S comprises precisely 252 real GitHub issues across twelve major Python projects (e.g., Django, NumPy, scikit-learn, Sphinx) (Xia et al., 2024, Chen et al., 21 Oct 2025). Each task supplies:
- The full repository snapshot and executable environment.
- The original, pruned issue description (with all “cheat” or misleading solutions excised).
- A developer-authored, test-driven ground-truth patch serving as the correctness oracle.
Annotation statistics for the 252 retained instances are as follows:
| Annotation Axis | Categories | Distribution (%) |
|---|---|---|
| Description Quality | Sufficient-NL | 48.0 |
| Reproducible-Example | 21.8 | |
| Partially-Reproducible | 13.5 | |
| Insufficient | 0.0 | |
| Solution-In-Description | No-Solution | 63.5 |
| Partial-NL-Hint | 15.1 | |
| Complete-NL-Recipe | 21.4 | |
| Exact-Patch / Misleading | 0.0 | |
| Location Information | Line-Level | 8.7 |
| Function-Level | 44.4 | |
| Keyword-Searchable | 47.1 | |
| None | 0.0 |
All instances are single-file fixes, with multi-file editing and highly complex patches eliminated by design. This ensures each remaining issue both requires and rewards genuine localization and source code reasoning, invalidating trivial retrieval-or-copy approaches.
3. Evaluation Protocols and Metrics
The principal evaluation metric for SWE-bench Lite-S is "Fix–Accuracy" or "Resolved Rate," formally:
This measures the fraction of experiments for which the system produces a patch passing all repository regression tests, using the developer test suite as the correctness standard. Secondary metrics include:
- : identical to Fix–Accuracy.
- Average $$$Cost: LLM <a href="https://www.emergentmind.com/topics/anime-production-oriented-image-api-dataset" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">API</a> call expenditure per task.</li> <li>Avg. #Tokens: input/output token budget per task.</li> <li>$\%\text{Correct–Location}_{\ell}\ell$ parameterizing the granularity).
Distinctly, Lite-S’s removal of unsolvable and “leaky” tasks ensures these metrics strictly reflect substantive code understanding and do not reward superficial pattern matching or information leakage (Xia et al., 2024).
4. Leaderboard Structure and System Diversity
The SWE-bench Lite-S leaderboard, as of mid-2025, comprised 79 submissions representing 52 unique repair architectures (Martinez et al., 20 Jun 2025). Major trends include:
- Model Source: Proprietary LLMs (notably Claude 3.5/4, GPT-4) underpin ~72% of entries; open-source LLMs (Qwen 2.5, LLama 3) form ~28%.
- Architectural Taxonomy: Entries span Human-Workflow/Fixed/No-Agent (G1), Human-Workflow/Fixed/Multi-Agent (G3), Human-Workflow/Scaffolded/Single-Agent (G4), and fully agentic (G6) groupings. Maximal precisions are observed in mixed-agent and scaffolded single-agent pipelines, both exceeding 50% Resolved Rate.
- Submitter Type: Submissions come from small companies (~24%), large corporations (~9%), academia (~32%), academia–industry collaborations (~15%), and the open-source/individual sector (~10%). Notably, small startups display high innovation density and claim three of the top four leaderboard positions.
The leaderboard uniformly orders entrants by decreasing Fix–Accuracy, with no composite or multi-factor ranking.
5. Recent Methodological Advances and Tool Integration
SWE-bench Lite-S has catalyzed methodological advances in agent-based, LLM-driven program repair:
- Agentless Baselines: Simple, modular approaches eschewing autonomous multi-step tool usage—such as Agentless’s three-phase pipeline (localization, repair, patch validation)—set strong, interpretable baselines under the Lite-S regime (Xia et al., 2024).
- TestPrune for Efficient Test Minimization: TestPrune, an agent-agnostic regression test minimizer, achieves a >1000× reduction in test suite size per task (∼9000 to ∼9 tests) and drives up Fix–Accuracy rates by 9–11% with negligible cost overhead (\$0.02–\$0.05/task), primarily by suppressing context noise and enhancing patch validation precision (Chen et al., 21 Oct 2025).
- SWE-Gym Agent Training: Fine-tuning open-weight agents (e.g., Qwen2.5-Coder) on successful agent trajectories plus inference-time verifier selection (Best-of-K) yields up to 26% resolved rate—closing the gap to proprietary models, particularly when paired with trajectory reward modeling (Pan et al., 2024).
These approaches demonstrate that the rigor of Lite-S amplifies the value of architectural/scaffolding innovations and test-time ablation, highlighting both model and system-level bottlenecks.
6. Impact and Community Adoption
Since its introduction, Lite-S has established itself as the de facto “clean” subset for rigorous comparison of program repair systems. By removing shortcut-prone and ambiguous issues, it enables:
- Reliable benchmarking of true code understanding and localization abilities.
- Discouragement of overfitting to task artifacts and trivial solution modes.
- Accelerated innovation, observable in the increasing accuracy ceiling (breakthroughs above 60% resolved rate from proprietary multi-agent and scaffolded architectures) (Martinez et al., 20 Jun 2025).
- Widespread adoption in academic and industry experimental pipelines, and as a default ablation dataset for investigating test regression, context pruning, and outcome-based validation strategies.
A plausible implication is that as systems saturate Lite-S performance, future evaluation will shift to even larger and more structurally diverse benchmarks (e.g., SWE-bench Verified).
7. Extensions and Limitations
SWE-bench Lite-S’s focus on single-file, clearly specified Python bugfixes enhances reproducibility and scalability. However, all experiments and most tools are limited to Python, and the benchmark’s preclusion of multi-file and highly complex scenarios means it does not stress cross-cutting code changes or large-scale architectural reasoning (Chen et al., 21 Oct 2025). Extension to multi-language, multi-file, and more open-ended “repair” tasks remains a future research direction.
Moreover, while Lite-S eliminates test and patch leakages, recurring limitations include the natural cap imposed by single-file repair, coupled with dependency on the completeness and fidelity of corresponding developer-authored tests. As benchmarks evolve, integrating richer oracles and expanding language/project scope are expected to define the next phase of autonomous program repair evaluation.