Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 438 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? (2509.16941v1)

Published 21 Sep 2025 in cs.SE and cs.CL

Abstract: We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

Summary

The paper presents SWE-Bench Pro, a contamination-resistant benchmark for long-horizon software engineering tasks that require multi-file, large-scale code modifications.
It details a rigorous human-in-the-loop process and diverse dataset construction, emphasizing task complexity with concrete metrics like average 107.4 LOC changes per task.
The evaluation shows frontier models achieve below 25% Pass@1, highlighting major challenges in aligning AI agent capabilities with real-world software development demands.

SWE-Bench Pro: A Rigorous Benchmark for Long-Horizon Software Engineering Agents

Motivation and Context

SWE-Bench Pro addresses critical gaps in the evaluation of LLM-based software engineering agents by introducing a benchmark that is both contamination-resistant and representative of real-world, enterprise-level software development. Existing benchmarks, such as SWE-Bench and SWE-Bench Verified, have become saturated, with top models achieving over 70% Pass@1. However, these benchmarks are limited by contamination risks due to permissive licensing and by the prevalence of trivial, single-file tasks that do not reflect the complexity of professional software engineering. SWE-Bench Pro is designed to overcome these limitations by curating a diverse set of challenging, long-horizon tasks from both public GPL-licensed and proprietary commercial repositories, emphasizing multi-file, large-scale code modifications.

Figure 1: SWE-Bench Pro is a dataset with challenging, enterprise-level, long-horizon software engineering tasks. Frontier models, such as GPT-5 and Claude Opus 4.1, score less than 25% on SWE-Bench Pro with the SWE-Agent scaffold.

Dataset Construction and Design Principles

SWE-Bench Pro comprises 1,865 human-verified and augmented problems sourced from 41 actively maintained repositories, partitioned into public, held-out, and commercial subsets. The public set (731 instances) is derived from GPL-licensed repositories, the commercial set (276 instances) from proprietary startup codebases, and the held-out set (858 instances) is reserved for future overfitting checks. The benchmark enforces several key design principles:

Contamination Resistance: By selecting only GPL-licensed public repositories and private commercial codebases, SWE-Bench Pro minimizes the risk of benchmark leakage into LLM training corpora.
Task Complexity: All tasks require at least 10 lines of code changes, with an average of 107.4 LOC across 4.1 files per task. Over 100 tasks require more than 100 LOC modifications, and trivial edits are explicitly excluded.
Human Augmentation and Verification: Each problem undergoes a three-stage human-in-the-loop process to clarify ambiguity, augment requirements, and ensure robust test-based validation. This process guarantees that tasks are both challenging and resolvable.
Diversity: The benchmark spans multiple domains (consumer, B2B, developer tools) and languages (Python, Go, JavaScript, TypeScript), with each repository contributing a capped number of instances to prevent overfitting.
Figure 2: SWE-Bench Pro mimics real, challenging software engineering tasks with larger, multi-file changes. Frontier models score >70% on SWE-Bench Verified but less than 25% on SWE-Bench Pro.

Figure 3: SWE-Bench Pro public set distributions show complex, long-horizon tasks across several files and diverse task types, including feature requests and bug fixes in optimization, security, UI/UX, and backend.

Evaluation Protocol and Model Performance

All models are evaluated using the SWE-Agent scaffold, which provides a unified agent-computer interface for code manipulation. The evaluation is performed in a setting with minimal ambiguity: agents receive the problem statement, requirements, and interface specification. Each task is validated using a suite of human-reviewed fail2pass and pass2pass tests in containerized, language-specific environments.

Key findings include:

Frontier Model Performance: The best models, GPT-5 and Claude Opus 4.1, achieve 23.3% and 22.7% Pass@1, respectively, on the public set. On the commercial set, the best performance drops to 17.8% (Opus 4.1), highlighting the increased difficulty of enterprise codebases.
Open-Source Model Gap: Open-weight models such as Qwen-3 32B and SWE-Smith-32B achieve only 3.4% and 6.8% Pass@1, respectively, indicating a substantial capability gap relative to proprietary frontier models.
Language and Repository Variance: Model performance is higher on Python and Go tasks, with some models exceeding 30% resolve rates, while JavaScript/TypeScript tasks are more challenging. Repository-specific factors (e.g., codebase complexity, documentation) significantly impact resolve rates.
Figure 4: Model performance varies across languages and repositories; models currently perform better at Python, with significant variance in resolve rates across different repos and problem sizes.

Failure Mode Analysis

A detailed LLM-as-a-judge analysis, using GPT-5 as the judge, categorizes failure modes across models. The analysis reveals:

Frontier Models: Opus 4.1 and GPT-5 primarily fail due to semantic errors (wrong solutions, 35.9–51.7%) and syntax errors (23.2–32.7%) in large, multi-file edits. These models demonstrate strong technical execution but struggle with complex reasoning and algorithmic correctness.
Smaller Models: Models like Qwen-3 32B exhibit high tool error rates (42.0%) and frequent syntax/formatting issues, reflecting limitations in both code generation and tool integration.
Context Management: Sonnet 4 and other models often fail due to context overflow and endless file reading, indicating that context window limitations and inefficient file navigation remain open challenges.

Limitations

SWE-Bench Pro, while a significant advance, has several limitations:

Language Coverage: The benchmark underrepresents languages such as Java, C++, and Rust, limiting its generalizability across the full software engineering landscape.
Task Scope: The focus is on issue resolution via code patches; broader engineering activities (e.g., system design, code review) are not captured.
Test Suite Dependency: Reliance on test-based verification may not fully capture solution correctness, especially for tasks with multiple valid implementations.
Ambiguity Reduction: Human augmentation may make tasks overly prescriptive, diverging from the inherent ambiguity of real-world engineering problems.

Implications and Future Directions

SWE-Bench Pro establishes a new, more realistic baseline for evaluating the capabilities of LLM-based software engineering agents. The substantial performance gap between SWE-Bench Verified and SWE-Bench Pro underscores the need for further advances in agent reasoning, context management, and tool integration. The benchmark's contamination resistance and task diversity make it a robust testbed for both academic and industrial research.

Future work should expand language and framework coverage, develop alternative evaluation metrics (e.g., code quality, maintainability), and introduce collaborative and multi-agent scenarios. There is also a need for evaluation protocols that go beyond test suites, incorporating human judgment and code review practices.

Conclusion

SWE-Bench Pro provides a rigorous, contamination-resistant, and industrially realistic benchmark for long-horizon software engineering tasks. The low Pass@1 rates of current frontier models highlight the significant gap between LLM agent capabilities and the demands of professional software development. SWE-Bench Pro will serve as a critical resource for measuring progress and guiding research toward the development of truly autonomous, high-performing software engineering agents.