SWE-bench Pro: Contamination-Resistant LLM Benchmark

Updated 27 June 2026

SWE-bench Pro is a contamination-resistant benchmark suite designed to evaluate long-horizon, multi-file software engineering tasks and genuine generalization in LLMs.
It aggregates 1,865 problems from diverse repositories, structured in public, held-out, and commercial sets, and uses human verification with Dockerized setups to mimic real-world coding challenges.
The suite employs rigorous evaluation protocols with metrics like Pass@1 and contamination analysis, alongside diagnostic subtasks, to benchmark multi-turn reasoning and tool coordination in coding agents.

SWE-bench Pro is a contamination-resistant benchmark suite for evaluating the long-horizon, real-world software engineering abilities of LLMs and autonomous coding agents. Conceived as a successor to the widely-used SWE-Bench family, it was specifically designed to address the saturation, triviality, and memorization artifacts exposed in earlier benchmarks. SWE-bench Pro consists of hundreds to thousands of issues sourced from complex, diverse codebases and focuses on tasks representative of professional software engineering, including multi-file bug fixes and reasoning under realistic environment and dependency constraints. It serves as a testbed for measuring genuine generalization, cross-repository transfer, and practical autonomy in automated software development systems (Deng et al., 21 Sep 2025, Liang et al., 14 Jun 2025, Vergopoulos et al., 10 Mar 2025).

1. Motivation, Scope, and Distinctions

SWE-bench Pro was motivated by the limitations in previous benchmarks—namely, data contamination, overrepresentation of popular repositories in pretraining corpora, and predominance of short-horizon or trivial tasks. For example, SWE-Bench Verified includes only 500 tasks from 12 repositories and has been shown to admit substantial memorization bias, with top LLMs achieving >70% Pass@1 accuracy, a level not representative of genuine real-world difficulty (Liang et al., 14 Jun 2025). Empirical studies using minimal context subtasks (file path identification and function reproduction) revealed up to 76% accuracy on in-benchmark tasks but a collapse to ≈53% accuracy for out-of-benchmark repositories, indicating significant contamination and instance memorization.

SWE-bench Pro explicitly targets enterprise-scale realism and contamination resistance through the following criteria:

Use of GPL-licensed and proprietary datasets to avoid inclusion in open-crawled corpora.
Inclusion of long-horizon tasks spanning multiple files (average 4.1 files, 107.4 LOC).
Human verification and dockerized execution to ensure task resolvability and environmental fidelity.
Heavy emphasis on business, B2B, and developer tool domains with diverse programming languages (Python, Go, JavaScript/TypeScript) (Deng et al., 21 Sep 2025).

2. Dataset Composition, Task Structure, and Collection Protocol

SWE-bench Pro aggregates 1,865 problems from 41 actively maintained repositories, structured into three partitions:

Public set: 731 tasks from 11 GPL-licensed repositories.
Held-out set: 858 tasks from 12 repositories, reserved for contamination diagnostics and overfitting controls.
Commercial set: 276 tasks from 18 proprietary startup repositories (results released, but code not public).

Each problem is reconstructed from historical GitHub issues/PRs, preserving realistic issue texts but rewritten to remove implementation leakage. Task record construction includes:

Bullet-list requirements authored by human annotators to ground expected test behaviors.
Explicit function/class signatures to prevent naming mismatches in test harnesses.
Docker images specifying all runtime dependencies, environment setup, and test commands, ensuring reproducibility regardless of ecosystem changes.

Each task involves multi-turn reasoning, code search, and patching, and is anchored with a fail2pass and pass2pass test suite designed to distinguish successful regression fixes from undesired side effects (Deng et al., 21 Sep 2025, Vergopoulos et al., 10 Mar 2025).

3. Evaluation Protocols, Metrics, and Contamination Controls

SWE-bench Pro employs a unified harness (SWE-Agent) constraining agent actions to a fixed interface (bash, file viewing, patch submit), ensuring model-agnostic and reproducible trajectories. Key evaluation protocols are:

Primary metric: Pass@1 (resolve rate)—fraction of tasks for which a single agent output transitions the test suite from fail to pass across all required tests.

$\text{Pass@1} = \frac{N_{\text{resolved}}}{N_{\text{total}}}$

For $k$ -shot evaluations, Pass@ $k$ is defined as $1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where $n$ is the number of samples and $c$ is the correct output count.

Diagnostic subtasks: As established in (Liang et al., 14 Jun 2025), tasks include path identification and ground-truth function reproduction, with accuracy and longest common subsequence (LCS) similarity formally specified for measurement of memorization.
- Path accuracy:
$\mathrm{Acc}_{\mathrm{path}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat p_i \in p_i^*]$ - Function reproduction LCS similarity and exact match.
Contamination analysis: The contamination rate $C$ and memorization bias $\Delta$ are formally quantified:

$C = \frac{|\mathcal{T} \cap \mathcal{E}|}{|\mathcal{E}|} \qquad \Delta = \mathrm{Acc}(\mathcal{E}\,|\,\mathcal{T}\neq\emptyset) - \mathrm{Acc}(\mathcal{E}'\,|\,\mathcal{T}\cap\mathcal{E}'=\emptyset)$

Additional diagnostics: Repository-anonymization, query mutation, and McNemar’s significance test to probe robustness and overfitting.

Release datasets include versioned snapshots, contamination scripts, Dockerized evaluation harness code, and exhaustive logs for baseline model configurations (Liang et al., 14 Jun 2025, Zhong et al., 10 May 2026).

4. Empirical Results and Failure Analysis

Convergent evaluation across multiple studies (Deng et al., 21 Sep 2025, Xia et al., 17 Nov 2025, Ahmad et al., 14 Jun 2026) consistently shows that SWE-bench Pro is substantially more challenging than prior benchmarks. For the public set (731 issues), the following Pass@1 (resolve rate) figures are reported:

Model/Agent	Pass@1 (%)
GPT-5	23.3
Claude Opus 4.1	22.7
Claude Sonnet 4	17.6
Gemini 2.5 Pro	13.5
SWE-Smith-32B	6.8
Qwen3-32B	3.4
Live-SWE-agent	45.8 (on Pro v2)
Open-SWE-Agent	36.8

Live-SWE-agent, featuring runtime scaffold evolution with self-generated tools, outperforms all previously published open-source baselines (45.8%), narrowing the gap to proprietary agents (Xia et al., 17 Nov 2025). Similarly, Open-SWE-Traces' distilled Qwen3-30B model achieves 36.8%, with no-think mode outperforming think mode (Ahmad et al., 14 Jun 2026).

Failure analysis indicates the primary error clusters are wrong_solution (logical errors, patch applied in the wrong place), syntax_error, tool_error, context overflow, and endless file listing. The most capable models remain heavily bottlenecked by multi-file contextual reasoning, fragile tool use, and context window management (Deng et al., 21 Sep 2025).

5. Technical Implementation and Benchmarking Suite Architecture

The benchmarking suite is released as an "executable benchmarking suite" (Zhong et al., 10 May 2026) comprising:

Workload adapters: Support loading of SWE-bench Pro tasks as Dockerized environments with precise snapshotting.
Task manifests: Each task is uniquely identified by (family, task_id, env snapshot, etc.), ensuring cross-run traceability.
Runner and event schema: All agent actions, tool invokes, and test outcomes are logged as type-tagged events with per-step timing and provenance.
Evidence-admission gate: Only runs with complete manifest, driver declaration, outcome, schema metadata, and provenance bindings are "paper-facing." Non-admitted runs are quarantined for audit/debug.

The formal evidence-admission contract is: $k$ 0 where predicates represent manifest resolved, driver declared, trace completeness, terminal outcome, schema version, replay/freeze metadata, and exclusion of fixture-only runs, respectively.

All admitted runs carry a FreezeRecord tuple encoding complete provenance, enabling fully deterministic replay. Metrics such as model latency, patch cost, token counts, and invalid-action penalties are reported per-run, supporting rigorous, reproducible results analysis (Zhong et al., 10 May 2026).

6. Interpretations, Impact, and Future Directions

SWE-bench Pro, by enforcing repository hold-out, temporal splits, explicit contamination scoring, and multi-faceted diagnostic subtasks, establishes a new standard for measuring the genuine, contamination-resistant reasoning ability of LLM-based coding agents. The step change in difficulty—reflected by a sharp drop in agent accuracy from >70% on SWE-Bench Verified to <25% on Pro for the best proprietary models (Deng et al., 21 Sep 2025, Liang et al., 14 Jun 2025)—underscores the necessity of evaluating development at a professional level.

Current agent architectures reveal critical deficits in long-horizon, multi-file reasoning, semantic correctness, tool coordination, and context management. Promising future avenues include hierarchical task decomposition, memory-efficient context retrieval, integration with static analysis, collaborative multi-agent workflows, and expanded evaluation criteria encompassing maintainability, security, and code quality (Deng et al., 21 Sep 2025). The release of open, contamination-audited, and reproducible testbeds such as SWE-bench Pro is shaping the direction of LLM-for-code research by exposing hidden failure modes and quantifying authentic generalization in software engineering automation.