SWE-Bench Pro: Industrial AI Coding Benchmark

Updated 25 September 2025

SWE-Bench Pro is a contamination-resistant benchmark that evaluates AI coding agents on long-horizon, multi-file engineering tasks reflective of enterprise development.
It features 1,865 curated, human-verified problems drawn from diverse public, held-out, and commercial repositories to mirror realistic business challenges.
The benchmark employs rigorous Pass@1 evaluation and detailed error analysis to diagnose limitations in semantic reasoning, context management, and tool integration.

SWE-Bench Pro is a contamination-resistant, industrial-scale benchmark designed to evaluate the capabilities of AI coding agents on complex, long-horizon software engineering tasks that mirror the demands of enterprise development. Building on the criteria and learnings from earlier benchmarks such as SWE-Bench, SWE-Bench Pro introduces a suite of human-verified, challenging problems that span large, multi-file code changes, and are sourced from a broad and diverse set of public, held-out, and commercial repositories. This benchmark establishes a new baseline for measuring meaningful progress toward truly autonomous software engineering agents that can operate at a professional level (Deng et al., 21 Sep 2025).

1. Benchmark Design and Scope

SWE-Bench Pro departs fundamentally from its predecessor SWE-Bench by prioritizing industrial realism and contamination resistance. The benchmark contains 1,865 curated problems, each drawn from 41 actively maintained repositories representing business applications, B2B services, and developer tools. The benchmark is partitioned as follows:

Partition	# Repositories	# Problems	Description
Public	11	731	Fully open access, GPL-licensed repositories
Held-Out	12	858	Private set for overfitting/contamination checks
Commercial	18	276	Proprietary repositories (results released, code kept confidential)

The construction methodology explicitly enforces diversity by selecting repositories with strong copyleft licenses and by integrating proprietary startup codebases via formal partnership, thus reducing the likelihood of representation in model pretraining data. This structural design explicitly addresses prior benchmark limitations, including overfitting “saturation” and memorization effects prevalent in tests built on popular or overexposed repositories.

2. Nature and Challenge of Problems

SWE-Bench Pro exemplifies long-horizon, professional-grade software engineering challenges. Each problem typically demands hours to days of expert engineering effort—far exceeding the complexity of previously available benchmarks. Notable properties include:

Multi-file, Substantial Modifications: Tasks require, on average, changes to 107.4 source lines in 4.1 files, with a strict minimum of 10 modified lines and >100 tasks involving edits of over 100 lines.
Business Logic and Cross-Cutting Concerns: Problems span realistic feature integrations, major refactorings, and intricate bug fixes, often necessitating understanding business logic, technical debt, and layered dependencies.
Contextual and Under-Specified Requirements: Issue statements reflect the ambiguity and underspecification characteristic of real software development—the agent must robustly interpret and act on requirements often only partially stated.
Examples: Feature requests such as integrating a new external metadata API (e.g., a “Google Books” enrichment requiring code changes across interfaces, tests, configurations, and utility modules).

Such complexity demands that AI agents accurately reason over large project graphs, coordinate multi-module changes, maintain backward compatibility, and produce test-passing, maintainable code—a substantial escalation over prior benchmarks.

3. Evaluation Methodology and Results

Agents are evaluated on the Pass@1 metric, defined as the proportion of tasks resolved correctly on the first attempt (success defined by all “fail2pass” and “pass2pass” test suites passing in a containerized, language-specific execution environment). Results are reported for well-known LLMs and agentic frameworks, including OpenAI GPT-5, Claude Opus 4.1, Gemini 2.5 Pro Preview, SWE-Smith-32B, GPT-4o, and Qwen-3 32B.

Model	Pass@1 (%)	Set
GPT-5	23.3	Public
Claude Opus 4.1	22.7	Public
Claude Sonnet 4	17.6	Public
Gemini 2.5 Pro Preview	11.1	Public
SWE-Smith-32B	9.1	Public
GPT-4o	7.7	Public
Qwen-3 32B	3.4	Public

Despite frontier capabilities, top-performing models remain below 25% on this benchmark. Importantly, these rates contrast sharply with much higher pass rates (often >70%) reported on simpler or more contaminated datasets such as SWE-Bench Verified, highlighting a significant gap between research benchmarks and industrial application (Deng et al., 21 Sep 2025).

4. Failure Mode and Error Analysis

A trajectory-level error analysis is conducted using an LLM-based judge that classifies failures into fine-grained buckets, resulting in the following key findings:

Wrong Solution: The most common failure mode, where generated code is plausible but does not satisfy the acceptance criteria.
Syntax Error: Frequent errors seen even in advanced models, typically due to incomplete or malformed diffs, or context truncation effects.
Tool Error: Misuse of agentic interfaces, such as incorrect invocation of commands, or corrupted patch application workflows.
Context Management Failure: LLMs often exhaust their context windows by endless file reading, leading to context overflow; following the “wrong” or irrelevant files is characteristic of context management limits.
Multi-file Edit Failure: Agents frequently miss instructions or fail to consistently coordinate changes across all necessary files, especially when simultaneous edits are required for correctness.

These patterns underline the limitations of current models in semantic reasoning, execution environment management, and large-context navigation. Even top models such as GPT-5 primarily fail due to wrong solution and syntax errors, while others disproportionately suffer from context management collapse.

5. Benchmark Contamination Resistance and Utility

SWE-Bench Pro is architected for contamination resistance—an explicit response to inherent flaws in previous datasets (such as SWE-Bench Verified, which has been shown to suffer memorization and direct solution leakage as a result of repository overlap with LLM pretraining (Liang et al., 14 Jun 2025)). Key mechanisms include:

Repository Selection: Public set sourced exclusively from strong copyleft repositories, held-out repositories kept private to enable overfitting detection, and a commercial set under strict legal agreements.
Test Suite Augmentation: All tasks are equipped with human-augmented “fail2pass” and “pass2pass” tests, ensuring robust verification; only robustness-approved, comprehensive solutions are counted as passes.
Environment Isolation: Containerized, language-specific evaluation environments are provided (e.g., Python venv, Node.js/NPM, Go modules), supporting reproducible, practical test execution.
Overfitting Analysis: Held-out and commercial splits facilitate explicit monitoring of model generalization and reveal signal of overfitting or contamination if performance diverges sharply.

This design provides a rigorous and fair foundation on which the field can meaningfully measure progress toward enterprise-grade autonomous engineering.

6. Implications for Agentic Software Engineering

SWE-Bench Pro advances the evaluation and development of AI coding agents in meaningful ways:

Realistic Autonomy Challenge: By requiring end-to-end resolution of complex, ambiguous, and multi-file tasks, it forces agents to close the gap between synthetic benchmarks and the professional development lifecycle.
Failure Mode Diagnosis: Structured error analysis enables precise identification of where models struggle—be it semantic reasoning, tool integration, or large-context management—guiding the design of more capable architectures.
Benchmark for Robustness: As a contamination-resistant and representative testbed, SWE-Bench Pro is positioned to become the field’s de facto benchmark for measuring progress toward practical software engineering autonomy across public and private codebases.
Guidance for Research: The demonstrated performance ceiling (below 25% Pass@1 for all current models) exposes the limits of current techniques and highlights the need for advances in context management, high-level reasoning, agent tooling, and multi-file coordination strategies.

7. Technical Protocol and Reporting

All agent evaluations are performed under a unified scaffold with explicit specification of the environment and acceptance criteria. Test suites encompass both transition tests (fail2pass) and regression guards (pass2pass), supporting strict Pass@1 measurement. Performance breakdowns are summarized in LaTeX-style tables (e.g., Pass@1 across subsets, error buckets per model), and trajectory-level analyses are systematized using LLM-based judges for error clustering.

Environment setup for each language is containerized, with explicit attention paid to the particular configuration needs of Python (venv), Node.js/TypeScript (npm), Go (Go mod/GOPATH), and others. All modifications, execution logs, and test outcomes are kept for detailed post-hoc analysis, and the public partition is openly released for independent benchmarking.

SWE-Bench Pro—by moving beyond the limitations of trivial edits, repository saturation, and data contamination—is establishing itself as the new benchmark for rigorously charting the evolution of AI-driven, autonomous engineering agents able to tackle the long-horizon, multi-file, and semantically challenging software problems found in real enterprise environments (Deng et al., 21 Sep 2025).