Papers
Topics
Authors
Recent
2000 character limit reached

SWE-Bench-Pro Benchmark Suite

Updated 14 December 2025
  • SWE-Bench-Pro is a benchmark suite that evaluates autonomous software engineering agents on realistic, long-horizon coding tasks with nontrivial code modifications.
  • It employs rigorous methodologies including human verification, automated setup using SETUPAGENT, and semantic differential testing with PatchDiff to ensure high fidelity.
  • Agents are assessed by Pass@k metrics in diverse codebases, emphasizing real-world performance, contamination resistance, and scalable evaluation protocols.

SWE-Bench-Pro is a contemporary benchmark suite for evaluating autonomous software engineering agents on realistic, long-horizon coding tasks requiring nontrivial code modifications across diverse codebases. It augments the SWE-Bench and SWE-Bench Verified frameworks, introducing rigorous contamination resistance, comprehensive test coverage, and a suite of methodological improvements designed to drive progress toward truly autonomous professional-level software agents (Deng et al., 21 Sep 2025, Wang et al., 19 Mar 2025, Xia et al., 17 Nov 2025, Vergopoulos et al., 10 Mar 2025).

1. Benchmark Construction and Dataset Structure

SWE-Bench-Pro comprises 1,865 human-verified tasks sourced from 41 actively maintained repositories. The benchmark emphasizes enterprise and real-world diversity with domains covering business applications, B2B services, and developer tooling. All codebases are partitioned into:

  • Public set: 731 instances, 11 GPL repositories; fully released for open evaluation.
  • Held-Out set: 858 instances, 12 GPL repositories; kept private to mitigate future overfitting.
  • Commercial set: 276 instances, 18 proprietary, startup repositories; code not released, only evaluation results published (Deng et al., 21 Sep 2025).

Tasks are derived from consecutive commit pairs where the later “instance” fixes a bug or adds a feature. Nontriviality is enforced: only patches spanning ≥10 lines and ≥2 files are retained (mean: 107.4 LOC, 4.1 files). Each instance undergoes automated and human verification, including:

  • Conversion of raw PR/commit/issue context into a concise “Problem Statement” (GitHub issue style).
  • Explicit human-authored “Requirements” for unambiguous specification.
  • “Interface” specification for new or modified APIs when needed.

Public and held-out sets derive from GPL/copy-left repositories, chosen to exclude from commercial or pre-training corpora, while commercial sets come from private codebases for maximal contamination resistance.

2. Evaluation Protocol and Performance Metrics

Agents operate within a unified evaluation scaffold. Each agent is presented the problem statement, requirements, and interface description, alongside access to a containerized, language-specific environment (e.g., Python venv, Node.js, Go modules). The typical interaction model is turn-based with a cap (≤ 200 interactions).

The primary metric is Pass@k, the probability that at least one of k candidate solutions is correct for a given task. For Pass@1 (the most commonly reported):

Pass@k=1j=1k(1niN)\text{Pass@k} = 1 - \prod_{j=1}^k \left(1 - \frac{n_i}{N}\right)

where NN is the number of candidate solutions, nin_i the number correct, and MM is the number of tasks (Deng et al., 21 Sep 2025). Success requires all test suite assertions (not just those modified in the patch) to pass, with three test reruns per instance to root out flakiness.

Sample performance (public set, Pass@1):

Model Resolve (%) Notes
GPT-5 23.3 Highest open, public result
Claude Opus 4.1 22.7
Claude Sonnet 4 17.6
Gemini 2.5 Pro Preview 13.5
SWE-Smith 32B 6.8 Open-source
OpenAI GPT-4o 4.9
Qwen-3 32B 3.4

State-of-the-art open-source agents (Live-SWE-agent: 45.8%; SWE-agent: 43.6%) can close much of the gap to proprietary agents (Kimi K2-thinking: ∼51.8%) (Xia et al., 17 Nov 2025).

3. Automated Generation, Scale, and Distributional Shift

Automated benchmark generation is realized via SETUPAGENT, which reconstructs historically accurate dependency setup and test execution, supporting large-scale curation (hundreds of repositories, thousands of instances). SETUPAGENT iteratively improves install and test commands using LLM-driven repair, ensuring ≥95% test pass coverage before tasks are admitted to the benchmark (Vergopoulos et al., 10 Mar 2025).

Compared to prior manually curated benchmarks:

  • Repository count: 366 (SWEE-Bench, extended) vs 12 (SWE-Bench)
  • Instance count: 884 (SWEE-Bench) vs 2,294 (SWE-Bench)
  • Fix complexity: Mean patch size 169.9 lines (SWEE), 41.0 (SWE); mean files touched: 2.05 (SWEE), 1.52 (SWE).
  • Agent success rates: Up to 40% lower on SWEE/SWA than the original SWE, due to lower issue description quality, higher fix complexity, and less pre-training contamination.

This expanded coverage mitigates distributional mismatch: success rates sharply decline beyond the limited scope of popular, well-documented libraries, demonstrating that benchmarks restricted to such repositories are not representative of general agent performance (Vergopoulos et al., 10 Mar 2025).

4. Robust Validation: Semantic Correctness and PatchDiff

SWE-Bench-Pro adopts best-practice validations to address the problem of plausible-but-incorrect patches. Studies on SWE-Bench Verified reveal that limited test coverage (patch-level only) can cause up to 7.8% of “plausible” patches to count as correct while failing full developer test suites. Moreover, 29.6% of plausible patches induce functional behavior divergent from human-written ground truths; roughly 11% are certainly incorrect, inflating reported resolution rates by ≈6.2 percentage points (Wang et al., 19 Mar 2025).

PatchDiff is introduced as an automated differential patch testing pipeline:

  • Procedure: Instrument execution traces in both oracle and agent-patch repo versions, extract target functions, prune context, and prompt an LLM to synthesize “differentiating tests” that discriminate agent and oracle behaviors.
  • Detection: A patch is flagged divergent if there exists a synthesizable test that passes in one repo but fails in the other.
  • Scalability: Deployed on 500 tasks, PatchDiff differentiated 146 patches, revealing subtle semantic divergences largely missed by traditional test coverage.

SWE-Bench-Pro recommends integrating PatchDiff into continuous integration pipelines, augmenting canonical test suites with community-contributed differentiators, and hardening issue specifications via executable contracts.

5. Agent Architectures and Dynamic Adaptation

Live-SWE-agent exemplifies successful agent adaptation for SWE-Bench-Pro (Xia et al., 17 Nov 2025). Rather than relying exclusively on a fixed toolset, Live-SWE-agent self-evolves scaffolding and tool-creation at runtime:

  • Basic Loop: Proposes individual bash commands (one-at-a-time), observes results, decides whether to create/modify utility tools on-the-fly.
  • Reflection Prompt: After each command, prompts deliberation on tool synthesis. This increases solve rate (from 64% to 76% on SWE-Verified; analogous gains on Pro).
  • Cost and Efficiency: Maintains low mean LLM cost (\$0.73/issue on Pro) and outperforms all known open-source baselines.

Performance comparison (SWE-Bench-Pro, public split; Table from (Xia et al., 17 Nov 2025)):

Tool (LLM) Solve-Rate Avg. Cost Notes
SWE-agent (Claude 4.5) 43.6% Open-source leaderboard
Live-SWE-agent (Claude 4.5) 45.8% \$0.73 State-of-the-art open/non-proprietary
GPT-5.1 dynamic patch ∼49.5% Proprietary
Gemini 2.5 Pro ∼49.4% Proprietary (DeepMind)
Kimi K2-thinking ∼51.8% Commercial customized agent

Agents employing flexible, context-sensitive tool synthesis demonstrate an explicit advantage over those with rigid scaffolds, particularly on long-horizon, multi-file tasks.

6. Real-World Fidelity, Contamination Resistance, and Future Directions

SWE-Bench-Pro’s design enforces contamination resistance by utilizing GPL/copyleft repositories (legal exclusion from commercial pre-train corpora) and sourcing additional tasks from proprietary, private startup repositories—minimizing chance of direct overlap with model training sets.

Failure mode analyses on frontier LLMs (GPT-5, Claude Opus, Gemini 2.5 Pro) reveal dominant error classes:

  • Semantic/algorithmic errors dominate (35–52% of failures)
  • Syntax errors persist (23–33%)
  • Tool mismanagement and context navigation issues (endless file reading, context overflow) afflict smaller or less capable models (Deng et al., 21 Sep 2025)

Despite >70% Pass@1 on simpler SWE-Bench Verified tasks, solve rates remain <25% for SWE-Bench-Pro’s full public set among frontier LLMs, reflecting real-world complexity and the enduring gap to autonomous engineering proficiency.

Recommended future directions include:

  • Expanded framework/language support (Java, C#, Rust).
  • Enhanced evaluation: code architecture, performance, security, and collaborative workflows.
  • Advanced agent scaffolding: dynamic planning, improved context retrieval, tool integration.
  • Continuous evolution: on-the-fly agent adaptation, persistent and transferable “skills,” and training objectives incorporating scaffold and tool refinement (Xia et al., 17 Nov 2025).

7. Significance and Benchmarking Ecosystem

SWE-Bench-Pro establishes a contamination-resistant, semantically rigorous testbed with far-reaching consequences for code-agent research. Its methodology—integrating exhaustively validated test suites, semantic differential testing, and collaborative suite augmentation—creates a self-reinforcing cycle: solvers expose new deficiencies, benchmarks are extended, and agents converge toward true developer intent.

The benchmark, together with supporting frameworks like SETUPAGENT and PatchDiff, marks a paradigmatic shift from manual, limited benchmarking to scalable, realistic evaluation protocols essential for the future of autonomous software engineering (Vergopoulos et al., 10 Mar 2025, Wang et al., 19 Mar 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SWE-Bench-Pro.