SWE-bench Verified: LLM Software Engineering Benchmark

Updated 7 August 2025

SWE-bench Verified is a benchmark that evaluates automated repair agents by using 500 human-curated GitHub issue-fix pairs in Dockerized, reproducible environments.
It employs rigorous evaluation protocols including manual curation, strict pass criteria, and differential test methods to ensure high reliability in assessing coding fixes.
Despite strong median precision metrics, challenges such as solution leakage, weak test suites, and data contamination drive ongoing improvements in benchmark design.

SWE-bench Verified is a widely adopted benchmark designed to rigorously evaluate the ability of LLMs and automated agents to resolve real-world software engineering tasks, specifically by fixing genuine GitHub issues within Python repositories. It is a human-filtered subset of the original SWE-bench, aiming for higher reliability by incorporating manual curation and more robust evaluation protocols. Its centrality in the evaluation ecosystem for automated program repair (APR) and LLM-based code reasoning has made it both a standard and a subject of extensive critical analysis.

1. Benchmark Design and Evaluation Protocols

SWE-bench Verified comprises 500 curated issue-fix pairs, each representing a real GitHub issue and its corresponding resolved pull request. Each task includes:

An executable repository snapshot at the time of the issue,
The issue description and associated metadata,
The developer’s patch (gold solution),
Modified tests (test patch) associated with the original fix,
A Dockerized environment to ensure reproducible builds, dependency management, and automated test execution (Zeng et al., 24 Jun 2025).

The canonical evaluation metric is the resolution rate or precision: $\text{Precision (\%)} = \frac{\text{Number of Correctly Fixed Issues}}{\text{Total Issues}} \times 100$ Success is defined strictly: a candidate patch must apply, compile, and pass all modified tests associated with the fix. The benchmark leaderboard is used for evaluating both LLM-based repair models and agentic code reasoning frameworks (Martinez et al., 20 Jun 2025).

2. Data Quality Control, Manual Curation, and Reliability

To mitigate ambiguities and improve reliability over the original SWE-bench, SWE-bench Verified employs a multi-stage curation protocol:

Candidate issues are selected from popular open-source Python repositories.
Instances are manually screened and validated for clarity of issue statement, test coverage, and reproducibility.
Only issues with unambiguous problem statements and non-trivial test transitions (i.e., there exists at least one FAIL_TO_PASS test upon applying the developer’s fix) are retained (Pan et al., 30 Dec 2024).

Despite these quality control measures, subsequent studies have shown non-trivial rates of solution leakage (wherein solution code is present in the issue description or comments) and weak test suites (tests are insufficient to guarantee correctness) in SWE-bench Verified:

Manual inspection reveals 33.04% of instances contain direct solution leaks (Aleithan et al., 9 Oct 2024).
Approximately 31.08% of "passed" patches may be due to weak test coverage, failing to enforce semantic or behavioral correctness (Aleithan et al., 9 Oct 2024).
When filtered for solution leakage and weak test cases, reported resolution rates for state-of-the-art agents (e.g. SWE-Agent+GPT-4) fall sharply from 12.47% to 3.97% in the full dataset, and even lower in leak-filtered subsets (Aleithan et al., 9 Oct 2024).

3. Limitations: Overfitting, Data Contamination, and Memorization

Several recent analyses have raised concerns regarding the benchmark’s continued validity as a measure of generalizable reasoning:

Over 94% of task instances predate the training cutoff dates of popular LLMs, raising substantial risk for data contamination: models may see identical or highly similar issues during pretraining, thereby artificially inflating results (Aleithan et al., 9 Oct 2024, Liang et al., 14 Jun 2025).
High accuracy in sub-tasks such as file path identification using just the issue text (performance up to 76% on SWE-bench Verified, but down to ~53% on novel repositories) offers evidence of instance- and repository-level memorization (Liang et al., 14 Jun 2025).
Performance is markedly higher on the static SWE-bench Verified split compared to decontaminated, continuously refreshed benchmarks (e.g., SWE-rebench or SWE-bench-Live), where state-of-the-art agent resolve rates drop from >43% to ~19–22% (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025).

4. Advances in Validation Techniques

To address deficiencies in the benchmark’s validation mechanism, several technical advancements have been proposed:

PatchDiff: An LLM-based, differential patch testing method that generates distinguishing test cases between a candidate and oracle patch. PatchDiff reveals that 7.8% of plausible (test-passing) patches on SWE-bench Verified actually fail the full developer-written test suite, and 29.6% induce behavioral divergences, with 28.6% of those proven wrong on manual inspection. This leads to an estimated inflation of ~6.4 percentage points in reported resolution rates due to specious correctness (Wang et al., 19 Mar 2025).
UTBoost: An LLM-powered test augmentation and intramorphic testing pipeline that, through generated test cases, uncovers numerous patches incorrectly labeled as correct—causing leaderboard ranking changes in 24.4% of SWE-bench Verified submissions (Yu et al., 10 Jun 2025).
Expanded testing protocols: Recommendations include executing all developer-provided tests (not only those modified by the original fix), employing differential testing, and augmenting test suites to reduce false positives (Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025).

5. Leaderboard Analysis and Architectural Diversity

SWE-bench Verified serves as the central leaderboard for automated bug repair research, supporting a wide array of system architectures:

The median precision on Verified is consistently higher than on less-curated splits, reaching 51–55%, with top-performing entries up to ~68.2% (Martinez et al., 20 Jun 2025).
Systems span fixed human-authored workflows, emergent multi-agent frameworks, and hybrid approaches. Notably, high-performing designs employ scaffolded, multi-phase agent architectures with explicit planning, patch localization, and verification stages.
The leaderboard is dominated by industry submissions leveraging proprietary LLMs (Claude, GPT-4 variants), but open-source agent-based systems (OpenHands, SWE-Agent-LM-32B, Skywork-SWE-32B, etc.) continue to close the performance gap using advanced training protocols, inference-time scaling, and RL techniques (Pan et al., 30 Dec 2024, Zeng et al., 24 Jun 2025, Yang et al., 30 Apr 2025).

System Category	Median Precision (%)	Example Maximum (%)
Human-Workflow, Scaffolded, Single Agent	~55	65.4
Emergent Workflow, Multi-Agent	Varies	68.2
Fixed Workflow (Baselines, Non-Agentic)	Lower	--

6. Benchmark Evolution, Continual Learning, and Future Directions

SWE-bench Verified is now treated as a foundational, but not exhaustive, testbed:

Its methodology inspired subsequent multilingual, dynamic, and agentic benchmarks such as Multi-SWE-bench (Zan et al., 3 Apr 2025), SWE-bench-Live (Zhang et al., 29 May 2025), SWE-MERA (Adamenko et al., 15 Jul 2025), and continual learning adaptations (SWE-Bench-CL) that organize issues in temporal and curricular sequences to reflect codebase evolution and promote robust, transfer-oriented agent evaluation (Joshi et al., 13 Jun 2025).
Community recommendations urge routine reevaluation with “fresh” issues (postdating LLM pretraining), inclusion of repositories beyond the original set, and mutation or anonymization of tasks to combat memorization.
The integration of synthetic, agent-generated, and RL-derived datasets (SWE-smith (Yang et al., 30 Apr 2025), R2E-Gym (Jain et al., 9 Apr 2025), Skywork-SWE (Zeng et al., 24 Jun 2025), RepoForge (Chen et al., 3 Aug 2025)) further diversify training and evaluation modalities, making it possible to scale systematic and contamination-resistant SWE agent assessment.
Open questions remain regarding leaderboard interpretability, reproducibility (due to closed-weight model dominance), and the creation of robust, dynamic, and multilingual SWE agent benchmarks.

7. Summary and Significance

SWE-bench Verified has become the canonical benchmark for the evaluation of automated software engineering agents, driving significant advances in LLM-based repair and agentic code reasoning. However, issues of solution leakage, weak test validation, data contamination, and pervasive memorization necessitate a transition toward more rigorous, dynamic, and contamination-resistant benchmarks. Recent technical innovations and new benchmark releases have provided methodologies to improve confidence and reproducibility in future SWE agent evaluation while maintaining continuity with the field’s prevailing standards.