SWE-Bench Verified Benchmark

Updated 9 November 2025

SWE-Bench Verified is a curated benchmark that measures LLMs' ability to localize, repair, and validate bugs in real-world open-source codebases.
It employs a rigorous curation protocol with hand-filtered issues and robust tests, using metrics like pass@1 to assess performance.
The benchmark addresses challenges like solution leakage and overfitting by integrating comprehensive validation and enhanced test suites.

SWE-Bench Verified is a rigorously curated benchmark for evaluating LLMs and agent-based software engineering systems on real-world issue resolving in open-source codebases. Developed to address limitations in earlier code-generation and competitive programming benchmarks, SWE-Bench Verified focuses on measuring an agent’s ability to localize, repair, and validate bug fixes at repository scale under realistic conditions. The design and history of this benchmark, its underlying metrics, empirical outcomes, and ongoing methodological debates inform much of the contemporary research landscape in automated program repair and LLM software reasoning.

1. Benchmark Definition and Curation Protocol

SWE-Bench Verified is a 500-instance, hand-filtered subset of the broader SWE-Bench benchmark, itself derived from 2,294 real GitHub issues paired with pull requests across 12 major Python repositories. Each task presents a snapshot of the repository at a historical commit, a natural-language issue description, and a mandatory set of FAIL_TO_PASS and PASS_TO_PASS unit tests; the agent must produce a patch such that the official tests pass when applied to the buggy version (Aleithan et al., 9 Oct 2024, Wang et al., 19 Mar 2025, Xu et al., 12 May 2025).

The original construction protocol applies a series of filters:

Attribute filter: Retain only "bug" or "feature" issues with both a linked PR and at least one newly added or updated unit test covering the issue.
Execution/validation filter: Clone the relevant repo at the specific commit, apply the proposed patch, and confirm all tests transition as expected (FAIL→PASS and no regressions in PASS→PASS).
Manual verification: Screen out issues with solution leakage (i.e., where the answer or code fix appears in the issue or comments), those with underspecified descriptions, or insufficient/overfit test suites. Only problems that are self-contained and where the gold patch passes a robust, non-trivial test suite are retained.

The result is a benchmark where, for each issue, an automated patch–test–validate pipeline mirrors the developer’s actual bug-fix workflow. The public leaderboard associated with SWE-Bench Verified requires submitters to evaluate on the full suite of 500 problems and mandates complete result and log submission (Martinez et al., 20 Jun 2025).

2. Evaluation Metrics and Methodology

The central metric is the "resolve rate" or "Verified" pass@1, defined as:

$\mathrm{Verified} = \frac{\sum_{t \in V} \mathbf{1}[\text{task } t \text{ passes on first try}]}{|V|} \times 100\%$

where $V$ is the set of verified tasks and an agent is credited for a task only if the patch passes all official tests on the first (deterministic) attempt (Xu et al., 12 May 2025, Pan et al., 30 Dec 2024). For more nuanced analyses, pass@ $k$ is computed as the fraction of tasks for which at least one valid patch appears among $k$ attempts:

$\text{pass@}k = 1 - \frac{\binom{N-c}{k}}{\binom{N}{k}}$

where $N$ is the total generated samples and $c$ the number of correct ones (Jain et al., 9 Apr 2025, Chen et al., 31 Jul 2025).

Beyond simple test-based validation, recent studies advocate stronger measures:

Running the full developer test suite, not just the files touched by the pull request, to detect regression errors (Wang et al., 19 Mar 2025).
Use of differentiating tests through LLM-assisted PatchDiff to expose semantic divergences between the LLM patch and the gold patch.
Intramorphic oracles and automated test suite augmentation (UTBoost) to systematically identify insufficient test coverages (Yu et al., 10 Jun 2025).

These enhancements are motivated by evidence that standard harnesses can substantially overstate model performance, either due to inadequate tests or solution leakage (Aleithan et al., 9 Oct 2024, Wang et al., 19 Mar 2025).

3. Data Quality: Leakage, Coverage, Memorization, and Overfitting

Explicit data quality analysis reveals substantial confounds even in SWE-Bench Verified. Key findings include:

Solution Leakage: Up to 33% of "passed" patches in SWE-Agent+GPT-4 runs occurred in instances where the patch was present in the issue description or comments (Aleithan et al., 9 Oct 2024).
Weak Test Suites: 12.5%–22% of successful patches were logically wrong or incomplete but unflagged due to insufficient tests.
Instance-Specific Memorization: LLMs show 5–10 point drops in performance on non-Verified or "fresh" issues from the same repositories, indicating possible memorization rather than generalizable reasoning. On unrelated repositories (e.g., pandas, pytorch), accuracy is even lower, pointing towards repository-specific overfitting (Liang et al., 14 Jun 2025).
Behavioral Divergence: PatchDiff analysis finds that 29.6% of plausible patches induce different behavior than the ground truth, with ~28.6% of divergences representing certainly incorrect code (Wang et al., 19 Mar 2025).

The combined effect of these issues can inflate reported resolution rates by as much as 6.2 points, with ~7.8% due to un-run tests and nearly 30% of "plausible" patches diverging in function (Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025). Public leaderboards have thus begun to update their evaluation protocols to incorporate these findings and reduce the inflation of model scores.

4. Benchmark Comparisons and Leaderboard Results

SWE-Bench Verified is positioned as an intermediate to high-difficulty benchmark within its family:

Split	# Tasks	Curation	Median Fix Rate (%)	Max Fix Rate (%)
SWE-Bench Full	~2,294	unfiltered	33.8	65.4
SWE-Bench Lite	300	breadth, little vet.	31.5	60.0
SWE-Bench Verified	500	manual, high quality	46.9	75.2

Verified achieves higher pass rates than Lite due to strong solvability curation and removes spurious test noise. It saturates more slowly than code-generation tasks such as HumanEval (99.4%) or MBPP (94.2%) (Xu et al., 12 May 2025, Martinez et al., 20 Jun 2025), and provides meaningful headroom for discriminating between strong models, agents, and orchestration approaches.

Recent leading results on the 500-task SWE-Bench Verified include:

System	Model	Pass@1 / Resolved Rate (%)
Claude 4/Gemini 2.5 ensemble	proprietary	75.2 (max, leaderboard)
Kimi-Dev	Qwen 2.5 72B	60.4 (agentless, workflow)
FrogBoss	Qwen3-32B + AllData mix	54.6
R2E-Gym (hybrid verifier)	Qwen2.5-32B	51.0 (Best@26)
Llama3-SWE-RL-70B	Llama3 70B	41.0
SWE-Dev	Qwen2.5-32B	36.6
SWE-Exp	DeepSeek-V3-0324	41.6
RepoForge-8B	Qwen3-8B	17.4

Manual analysis indicates that even top models may overstate their true generalization ability due to these lingering methodological issues (Martinez et al., 20 Jun 2025, Wang et al., 19 Mar 2025, Sonwane et al., 22 Oct 2025).

5. Limitations, Threats to Validity, and Methodological Debates

Despite extensive data cleaning, SWE-Bench Verified is not immune to:

Memorization and data leakage: Many issues were created before LLM training cutoff dates. Models may recall public issue–fix pairs from pretraining, leading to instance- and repository-specific overfitting (Liang et al., 14 Jun 2025, Aleithan et al., 9 Oct 2024).
Test coverage gaps: Even with manual verification, 5.2% of tasks in Verified are inadequately covered, as detected by UTBoost (Yu et al., 10 Jun 2025).
Overspecification bias: Lengthy GitHub-issue–style prompts (with explicit reproducers or solution snippets) can inflate measured agent performance compared to real-world, concise developer queries (Garg et al., 10 Oct 2025).
Inadequate semantic validation: Pass@1 as measured by test success does not guarantee functional equivalence between model and developer patches, motivating the adoption of semantic differential testing (Wang et al., 19 Mar 2025).

These points have prompted a wave of research introducing diagnostic tasks (file-path prediction), synthetic “mutation” of queries to emulate real developer interactions, filtering out pretraining-overlap issues, augmenting test oracles, and multi-dimensional evaluation frameworks (Liang et al., 14 Jun 2025, Garg et al., 10 Oct 2025, Yu et al., 10 Jun 2025, Bhatia et al., 12 Jul 2025).

6. Design Recommendations and Future Directions

Emerging best practices for robust benchmarking, as recommended across the literature, include:

Temporal and cross-repo controls: Evaluate only on issues created after the LLM’s last knowledge cutoff; hold out entire codebases to inhibit instance/repository bias (Liang et al., 14 Jun 2025).
Prompt and repo anonymization/synthetic mutation: Randomize real identifiers and prompt structures to decouple LLM recall from genuine reasoning (Garg et al., 10 Oct 2025).
Test suite augmentation: Employ automated tools (e.g., UTBoost, PatchDiff) to close coverage gaps and identify semantic divergences, integrating adversarial or edge-case tests (Yu et al., 10 Jun 2025, Wang et al., 19 Mar 2025).
Benchmark mixture and polyglot expansion: Blend Python and other ecosystems (C#, Java, TypeScript) to resist overfitting to a single language/domain (Zan et al., 26 Aug 2024, Garg et al., 10 Oct 2025).
Composite, semantics-based metrics: Report both classic pass@k and post-hoc “clean” rates after filtering contaminated, leaked, or insufficiently specified tasks; include semantic correctness assessments beyond unit-test oracles (Wang et al., 19 Mar 2025, Martinez et al., 20 Jun 2025).
Continual and real-world query evaluation: Introduce temporally ordered, developer-like query streams (SWE-Bench-CL, mutation tasks) for evaluating long-term agent learning and robustness (Joshi et al., 13 Jun 2025, Garg et al., 10 Oct 2025).

7. Significance and Role in Software Engineering LLM Evaluation

SWE-Bench Verified represents a pivotal shift away from “toy” code-generation tasks to the large-scale, ambiguous, and context-crossing repairs emblematic of software engineering practice. It anchors competitive LLM/APR agent research by enforcing strict patch validation, realistic test execution, and human-centric problem framing. However, quantitative gains on this benchmark must be interpreted with caution, as methodological artifacts (memorization, test suite weakness, leakage) remain non-trivial. The onus is now on the research community to drive next-generation benchmarks that are temporally-resilient, contamination-robust, semantically rich, and better aligned with actual developer needs (Liang et al., 14 Jun 2025, Garg et al., 10 Oct 2025, Yu et al., 10 Jun 2025, Aleithan et al., 9 Oct 2024, Wang et al., 19 Mar 2025).