SWT-Bench: Python Bug-Reproducing Test Benchmark
- SWT-Bench is a benchmark designed to evaluate the ability of large language models and code agents to generate bug-reproducing tests for actual Python code issues.
- It leverages real GitHub issue reports, minimal bug-fix patches, and corresponding golden tests, ensuring rigorous evaluation in software maintenance scenarios.
- The benchmark uses clear metrics like fail-to-pass success rate and change coverage to quantify test generation accuracy and thoroughness.
SWT-Bench is a rigorously constructed, execution-based benchmark specifically designed to evaluate the ability of LLMs and code agents to generate bug-reproducing tests (BRTs) for real-world Python code bases. Distinct from benchmarks oriented toward code synthesis or repair, SWT-Bench grounds the assessment of test-generation techniques in the context of authentic software maintenance by leveraging real GitHub issue reports, their associated bug-fix patches, and corresponding ground-truth test cases. It defines a clear metric for automated systems: whether, given only an issue report and buggy code, the system can produce a test that fails on the unpatched code and passes with the fix applied. SWT-Bench has catalyzed swift progress in automated test generation, with the best public systems now approaching a 50% resolution rate on the hardest, hand-verified subset (Mündler et al., 2024, Khatib et al., 23 Jul 2025, Xie et al., 18 Feb 2026).
1. Motivation and Distinction from Prior Benchmarks
SWT-Bench was initiated to address critical gaps in the landscape of automated program understanding and test generation evaluation. Existing benchmarks such as HumanEval, MBPP, APPS, and SWE-Bench focus predominantly on code writing or repair, not on the generation of executable tests from natural-language specifications. Prior datasets for test generation (e.g., Defects4J, symbolic-execution examples) are limited in scale, scope (often Java-centric), or in their lack of real-world, high-quality bug reports and ground-truth fixes. There was no large, systematic Python benchmark offering: (a) authentic, natural-language bug reports, (b) the minimal, artifacted bug fix, and (c) corresponding “golden” tests to serve as an oracle for evaluating test generation.
SWT-Bench uniquely situates the test-generation task at the intersection of natural language processing and program analysis, requiring systems to formalize open-ended bug reports into precise, executable unit tests that are empirically validated on real source code (Mündler et al., 2024, Khatib et al., 23 Jul 2025).
2. Benchmark Construction and Dataset Composition
SWT-Bench is curated from approximately 90,000 merged pull requests across 12 widely used open-source Python repositories, including major projects such as Django, Matplotlib, Astropy, Sympy, and scikit-learn. The data pipeline operates as follows:
- Scrape PR metadata, extracting only PRs that (a) close a referenced GitHub issue, (b) modify at least one test file, and (c) introduce at least one new test that fails on the buggy (pre-merge) codebase.
- For each eligible PR:
- Record the natural-language issue description, original codebase (), existing tests (), golden-fix patch (), and golden tests ().
- Validate that fails on and passes on , ensuring that each instance is a bona fide bug/fix/test triple.
- Exclude instances incompatible with the coverage toolchain or whose fixes do not resolve the issue as determined by .
- Apply additional automated and, for some subsets, manual validation by professional developers to ensure bug report clarity and patch minimality.
Two principal subsets are provided:
- SWT-Bench-Lite: 276 issues with single-file patches and only automatic filtering.
- SWT-Bench-Verified: 433 issues vetted for unambiguous bug description and minimal, meaningful patch+test pairs (Khatib et al., 23 Jul 2025, Mündler et al., 2024).
Statistically, each full-instance codebase averages 210 files (~52,300 LOC), and golden test sets add on average 2.8 new tests per instance (split evenly between failing and passing cases). Issue descriptions average 315 words per instance (Mündler et al., 2024).
3. Evaluation Criteria and Metrics
SWT-Bench employs two complementary, formally defined metrics:
- Fail-to-Pass Success Rate ():
where 0 if at least one generated test for instance 1 fails on the buggy version and passes on the fixed version; otherwise 2 (Khatib et al., 23 Jul 2025).
- Change Coverage (3):
4
where 5 is the number of source-code lines modified by the bug-fix in instance 6, and 7 is the number of those lines covered by generated tests when run on the fixed version. Thus, 8 quantifies how thoroughly the generated tests exercise the relevant code changes.
Evaluation is performed using standard Python test runners (unittest/pytest) and the built-in trace module for coverage measurement. For each candidate test, its ability to reproduce the bug (fail→pass) and to cover changed lines is protocolled. In the leaderboard and public reporting, 9 serves as the principal ranking criterion, with 0 as an auxiliary measure of test thoroughness (Mündler et al., 2024, Khatib et al., 23 Jul 2025).
4. Systems, Baselines, and Empirical Findings
SWT-Bench has enabled the quantitative comparison of a diverse spectrum of test-generation agents and LLM-based systems, categorized as follows:
- Pure LLM prompting approaches:
- ZeroShot: Basic zero-shot prompts with unified diff output.
- ZeroShotPlus: Employs a fault-tolerant, custom diff format (function-level insert/rewrite blocks).
- Pass@5 (oracle): Generate five candidates via ZeroShotPlus, select the most faithful test.
- LIBRO: Combines few-shot prompting with heuristic execution-trace clustering (Mündler et al., 2024).
- Code Agents adapted for test generation:
- SWE-Agent / SWE-Agent+: Tool-augmented LLM agents, repurposed to write tests with explicit fail-before/fix-after motivation.
- AutoCodeRover: Two-stage agents leveraging code search and single-shot test generation.
- Recent agentic and advanced systems benchmarked:
- AssertFlip: Pass-then-invert test generation; outperforms prior approaches at 43.6% on the Verified subset (Khatib et al., 23 Jul 2025).
- Hybrid-Gym-trained agents: Synthetic-skills pretraining yields a 7.9 percentage point gain (from 9.01% to 16.86%) on SWT-Bench Verified (Xie et al., 18 Feb 2026).
- Otter++, Issue2Test, AEGIS, Amazon Q, and OpenHands: Varied agentic and commercial strategies, with Amazon Q reaching 49.0% on the Verified set (Khatib et al., 23 Jul 2025).
Empirical highlights:
- SWE-Agent achieves 36.4% fail-to-any-test and 9.9% fail-to-pass on the reference set, with a change coverage of 15.5%; LibRO and AutoCodeRover are comparably strong (Mündler et al., 2024).
- AssertFlip and Otter++ push fail-to-pass rates above 40%, with Amazon Q recently setting the highest mark at 49.0% (Khatib et al., 23 Jul 2025).
- Combining outputs of top methods covers 72.2% of the Verified split, greatly exceeding the coverage of any single model, indicating low overlap between systems' successful bug instances.
A summary table of principal performance results is below (Verified subset):
| System | Fail-to-Pass (%) | ΔC (%) |
|---|---|---|
| ZeroShotPlus | ~14 | N/A |
| LIBRO | ~18 | N/A |
| Otter++ | ~37 | N/A |
| AssertFlip | 43.6 | 49.1 |
| Amazon Q | 49.0 | N/A |
5. Methodological Principles and Benchmarking Protocol
To ensure rigor and reproducibility:
- All instances are validated to guarantee that the golden test fails on the buggy code and passes with the patch, eliminating spurious or trivial examples.
- For the Verified subset, manual review by professional developers ensures (a) clarity/unambiguity of the issue report, (b) minimality and correctness of the patch, (c) that the ground-truth test faithfully exposes and resolves the underlying bug (Khatib et al., 23 Jul 2025).
- Benchmarking is performed in a controlled Python 3.11 Docker environment, with dependencies restricted to those minimally required by the specific function for each instance (Xie et al., 18 Feb 2026).
- Only held-out (never-before-seen) instances are used; there is no supervised in-domain training.
SWT-Bench supports two primary usage scenarios: benchmarking pure test-generation systems and hybrid code-repair-and-test agents, enabling dual perspectives on LLM-mediated program robustness and correctness.
6. Impact, Observed Limitations, and Pathways for Future Work
SWT-Bench has accelerated the field’s understanding of LLMs’ test-generation capabilities, establishing baseline rates, discovering method-specific strengths, and making progress on nuanced evaluation. Noteworthy findings include:
- Code agents originally designed for repair (e.g., SWE-Agent) can outperform domain-specific test-generation methods on both fail-to-pass (R) and coverage (ΔC) metrics (Mündler et al., 2024).
- Custom output formats (function-level diffs) and explicit agent prompts materially increase applicability and success rates.
- Generated BRTs are highly effective as correctness oracles: incorporating a self-generated reproducing test increases code-repair precision from ~20% to ~45%, though at some recall cost (Mündler et al., 2024).
Limitations detected include:
- Restriction to Python and single-file, small-patch bug fixes.
- Exclusion of configuration, performance, or multi-step bug classes.
- Potential for overfitting by closed LLMs, which may have seen some benchmark data during pretraining (Khatib et al., 23 Jul 2025).
- Current evaluation focuses on line coverage; more granular metrics (e.g., branch coverage, mutation testing) remain open for exploration.
Future directions advocated in the literature are:
- Extending the methodology to other programming languages and ecosystems (e.g., Java, JavaScript).
- Developing test-generation-specialized agent tools (e.g., coverage-guided sampling, edge-case search).
- Integrating more nuanced real-world developer feedback and branching out to more complex bug scenarios.
7. Role in the Research Ecosystem and Leaderboard-Driven Advancements
SWT-Bench’s rigor, public availability, and leaderboard structure have made it a central reference in the empirical evaluation of test-generation systems for Python code. Its two-tiered division (Lite, Verified) supports rapid iteration as well as stringent, apples-to-apples comparisons. Since its introduction, performance has increased from around 3.6% (via direct prompting) to nearly 50% thanks to specialized prompting, agentic reasoning, and synthetic skill pretraining (Mündler et al., 2024, Khatib et al., 23 Jul 2025, Xie et al., 18 Feb 2026).
A notable ecosystem effect is the diversity of solution strategies: top-performing techniques exhibit substantial complementarity, with ensemble coverage of over 70% on the Verified subset. This suggests a strong case for further method diversification and for future benchmarks to similarly incentivize ensemble robustness.
SWT-Bench’s influence is evident in the design and evaluation of new agent architectures such as AssertFlip and Hybrid-Gym, which have been directly validated against its rigorous standards and have reported quantifiable gains (Khatib et al., 23 Jul 2025, Xie et al., 18 Feb 2026). The availability of detailed protocol, standardized Docker environments, and exhaustive test harness validation ensures that research findings are robust and reproducible across teams.