SWE-bench-Verified Benchmark

Updated 31 December 2025

SWE-bench-Verified is a benchmark that evaluates automated program repair (APR) systems using 500 human-verified Python repository issues.
It employs rigorous test-driven validation through Pass@k metrics and a generate-test-rerank pipeline for assessing patch accuracy.
The benchmark distinguishes itself by addressing multi-file dependencies and realistic code repair challenges from real GitHub projects.

SWE-bench-Verified is a repository-level benchmark for evaluating @@@@1@@@@ (APR) systems and LLM agents on real-world software issues. Constructed as a high-quality subset of the broader SWE-bench suite, SWE-bench-Verified focuses on human-verified bug-fixing tasks within popular Python projects, emphasizing functional correctness under realistic development conditions. The benchmark now anchors state-of-the-art research trajectories in software repair, multi-turn agent reasoning, and LLM capability assessment.

1. Definition, Composition, and Formal Properties

SWE-bench-Verified consists of 500 repository-level Python issue instances, each hand-selected for reproducibility and clarity. For each instance $p \in P$ , the specification comprises:

$s_0(p)$ : snapshot of the repository prior to the fix,
$T(p)$ : a human-vetted unit-test suite (includes fail-to-pass/regression tests),
$\varphi^*(p)$ : reference patch validated to resolve the issue.

A candidate patch $\varphi$ is deemed correct if and only if all tests pass after application: $\text{Verify}(p,\varphi) = \text{True} \iff \text{RunTests}(s_0(p) \circ \varphi, T(p)) = \text{AllPass}$

Issues fall into four categories: logic bugs, API misuse, configuration errors, and test failures. Approximately 40% of tasks require single-file changes, while ~60% demand multi-file, multi-component patching across diverse repository conventions (Chen et al., 31 Jul 2025). This ensures extensive coverage of cross-file semantic dependencies and nontrivial reasoning required for robust repair.

2. Evaluation Methodology and Metrics

The principal metric is Pass@k, denoting the probability that at least one of $k$ independent repair attempts yields a correct solution: $\text{Pass@}k = E_p \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]$ where $n$ is the number of sampled candidate patches per issue and $c$ the number of correct patches. For $k=1$ (Pass@1), this reduces to average accuracy: $\text{Pass@1} = E_p \left[ \frac{c}{n} \right]$

Some experimental protocols employ a “generate–test–rerank” pipeline, sampling hundreds of solutions and extracting consensus groups that pass identical subsets of reproduction tests, typically scored by group size and coverage squared (Wei et al., 25 Feb 2025). Additional metrics may include average solve iterations and code-edit statistics, though not all are reported in published results (Chen et al., 31 Jul 2025).

3. Distinction from Prior Benchmarks

SWE-bench-Verified achieves lower label noise and higher fidelity than synthetic benchmarks such as ManyBugs or Defects4J by exclusively retaining instances with developer-confirmed, reproducible fixes (Chen et al., 31 Jul 2025). Each issue is paired with its full repository context rather than isolated functions or synthetic bugs, mandating multi-step, tool-enabled reasoning and cross-component diff construction.

Tasks are derived from actual GitHub tracker data, filtered for reproducibility (all referenced tests must pass post-fix) and excluding ambiguous or partial repairs. Human-verification ensures the benchmark is more challenging and reliable than previous automated coding datasets (Wei et al., 25 Feb 2025, Yu et al., 10 Jun 2025).

4. Repair System Performance on SWE-bench-Verified

SWE-bench-Verified defines the central public leaderboard for automated software repair. Recent open-source and proprietary models have reported the following Pass@1 results:

System	LLM Backend	Pass@1 (%)
SWE-Exp	DeepSeek-V3-0324	41.6
SWE-Agent	DeepSeek-V3-0324	38.8
Agentless	DeepSeek-V3-0324	36.6
OpenHands	DeepSeek-V3-0324	38.8
SWE-Search	DeepSeek-V3-0324	35.4

Category-wise, SWE-Exp attains 43.2% on logic bugs, 40.8% on API misuse, 39.5% on configuration, 41.0% on multi-file fixes, and 42.3% on single-file fixes (Chen et al., 31 Jul 2025). The agent outperforms both prior agentic designs (by 2.8 percentage points) and pure workflow pipelines, including closed-source baselines (AutoCodeRover, CodeAct on GPT-4o)(Chen et al., 31 Jul 2025). These benchmarks reveal that state-of-the-art performance is contingent on high-level diagnostic ability, efficient patch strategy recall, and synergistic multi-agent collaboration.

5. Technical and Experimental Challenges

SWE-bench-Verified is characterized by cross-file dependencies, deep semantic bug sources, high repository style variance, and rigorous test-driven patch validation. Only fully passing patches are accepted, heavily penalizing shallow or symptom-level edits (Chen et al., 31 Jul 2025).

Agents must contend with:

Implicit localization (identification of fault locations within a large codebase),
Cross-module patching,
Correction of intricate logic or API contracts,
Avoidance of redundant or regressive edits.

Memoryless agents typically fail by re-exploring dead-end trajectories and misdiagnosing symptom-level issues. The SWE-Exp agent leverages a dual-agent MCTS architecture intertwined with an experience bank distilled from prior successful/failed attempts, enabling retrieval of semantically relevant repair “perspectives” and patch “modification experiences”. A reranker agent synthesizes these into strategy hints for subsequent patch design (Chen et al., 31 Jul 2025).

6. Limitations, Validity Threats, and Future Directions

Several challenges remain:

Quality of experience recall can degrade performance if weakly relevant experiences are retrieved; empirical results peak when a single prior experience is recalled per issue (Chen et al., 31 Jul 2025).
No formal mechanism for scoring applicability of experiences exists; further work is needed on confidence estimation and retrieval supervision.
Data leakage from benchmark instances into pretraining corpora can confound agent scores, though same-repo instance exclusion is used to minimize this risk.
SWE-bench-Verified is restricted to Python and a narrow set of projects; generalizing to other languages and maintenance tasks is an open research direction.

A growing body of research highlights that scores on SWE-bench-Verified may, in some cases, reflect recall of benchmark tasks during model pretraining, rather than generalizable problem-solving ability. This has sparked discussion about contamination-resistant benchmarks, continual task pools, and robust evaluation protocols (Prathifkumar et al., 11 Dec 2025, Liang et al., 14 Jun 2025). Mutation-based approaches and cross-repository diagnostic tasks are being developed to mitigate overestimation and probe actual reasoning skill (Garg et al., 10 Oct 2025).

7. Impact and Benchmark Evolution

SWE-bench-Verified has become the standard for public evaluation in APR and LLM-based agent research, with over 99 leaderboard submissions spanning both open-source and proprietary systems (Martinez et al., 20 Jun 2025). Industry submissions dominate the leaderboard, achieving median resolve rates around 50% and maximums exceeding 75%. Agent-based and emergent workflow designs consistently outperform workflow-only or fixed approaches.

The benchmark has spurred advances in agent architecture (multi-agent MCTS, memory-augmented repair, dynamic tool synthesis) and evaluation methodology (Pass@k metrics, mutation frameworks, differential patch testing)(Wang et al., 19 Mar 2025, Yang et al., 27 Sep 2025, Xia et al., 17 Nov 2025). Nevertheless, as models approach or exceed 75% Pass@1, the research community is re-examining SWE-bench-Verified’s role, seeking enhanced contamination controls, broader language coverage, and more adaptive evaluation standards.

SWE-bench-Verified exemplifies rigorous, reproducible empirical evaluation for automated software repair, but its ongoing adaptation is required to ensure continued relevance and validity for state-of-the-art LLM agents and future research on scalable, generalizable program repair.