SWE-Bench: Real-World Software Benchmark
- SWE-Bench is a large-scale, repository-level benchmark that evaluates language models by resolving authentic GitHub issues in full codebases.
- It involves multi-file patch generation, context retrieval, and execution-based verification to ensure patches pass both new and regression tests.
- Key metrics include resolution rates and patch-apply percentages, highlighting the challenges of scaling automated software repair to real-world scenarios.
SWE-Bench is a large-scale, repository-level benchmark for evaluating LLMs on real-world software engineering tasks, specifically the resolution of actual GitHub issues in the context of full codebases. Unlike traditional program synthesis or bug-fixing tasks that operate on small, artificially constructed snippets, SWE-Bench presents a challenging, execution-verified testbed where models must generate patches for significant open-source Python projects. Each instance comprises a full codebase snapshot and a natural language issue report, requiring the model to generate a patch that, when applied, resolves the issue and passes all regression and modified unit tests. This framework confronts models with the intricacies and scale of authentic software maintenance, demanding multi-file editing, context retrieval, patch formatting, and rigorous behavioral validation.
1. Benchmark Structure and Dataset Construction
SWE-Bench consists of 2,294 curated, high-quality software engineering problems, each corresponding to a real GitHub issue and its resolving pull request, drawn from 12 popular Python repositories. The construction pipeline filters thousands of pull requests through a three-stage process:
- Repository Selection and Data Scraping: Selection targets widely-used, well-tested Python packages. For each, the pipeline collects issue descriptions and codebase snapshots anchored to the relevant PR base commit.
- Attribute-based Filtering: Task instances are kept only if their corresponding PRs are merged, linked to a public issue, and make changes in test files, indicating that changes are test-verified.
- Execution-based Filtering: Each candidate patch is replayed on the codebase. Tests are executed to ensure at least one test switches from fail to pass (identifying “fail-to-pass” tests) and that other functionalities remain uncompromised. Instances causing installation/runtime errors are excluded.
The resulting dataset exhibits the following statistics:
- Median issue description: ~195 words
- Average repository size: ~3,010 non-test files, ~438,000 LOC
- Gold patches: mean of 1.7 files, 3 functions, and ~32.8 lines (added/removed)
- Each instance: ~9.1 fail-to-pass tests, with ~51 additional tests for regression checking
This configuration ensures every SWE-Bench sample is non-trivial, reproducible, and automatically executable for robust evaluation (Jimenez et al., 2023).
2. Evaluation Protocol and Metrics
SWE-Bench employs a strictly execution-based, fully automated evaluation process:
- Input: The LLM receives the full issue description and, due to context length limitations, a retrieved subset of likely relevant files (by BM25 or oracle retrieval).
- Patch Generation: The model outputs a diff-format patch specifying file changes.
- Verification: The patch is applied and repository tests are executed in a controlled environment. An instance is considered “resolved” if:
- The patch applies without error; and
- At least one previously failing test now passes (fail-to-pass), with all other tests (regression: pass-to-pass) still passing.
The key quantitative metric is:
In addition, the “patch apply” rate (the share of instances where the patch applies but may not resolve the underlying issue) is reported, clarifying partial progress versus full success.
3. Task and Modeling Challenges
SWE-Bench task instances present unique technical demands:
- Long-Context Retrieval and Representation: Codebases contain hundreds of thousands of tokens, making context selection and windowing critical. Irrelevant file inclusion introduces distractors that models must overcome.
- Repository-Scale Reasoning: Bug fixes often span multiple functions, classes, and files, requiring reasoning over dependencies and project-wide invariants.
- Structured Output Generation: Patches must be syntactically valid diff files, complicating the output space relative to unconstrained code generation.
- Execution Coordination: Solutions must fix the targeted test(s) while preserving all pre-existing functionality—prematurely aggressive or incomplete changes fail this bar.
These challenges position SWE-Bench well beyond "toy" code completion and draw attention to the gap between LLM skill on confined snippet generation and authentic repository maintenance.
4. Model Performance and Baseline Results
SWE-Bench reveals the limited real-world effectiveness of contemporary LLMs:
- Under realistic BM25 retrieval and a 13k-token window, the best-performing proprietary model (Claude 2) resolves only 1.96% of issues.
- Models such as ChatGPT-3.5 and GPT-4 achieve near-zero rates on a subset.
- Open-source SWE-Llama (7B and 13B) variants match only 0.70% resolved, though they achieve ~51–53% “patch apply” rates, indicating many plausible but ultimately insufficient attempts.
- When provided "oracle" retrieval of exactly the gold-patched files, Claude 2 improves to 4.8% resolution, quantifying the upper bound if pool selection is perfect.
- Human-authored (gold) patches in the dataset average 32.8 changed lines over multiple files; models that succeed typically produce smaller, more localized edits, underlining the complexity gap.
Even finely tuned models addressing formatting or leveraging advanced retrieval struggle; current approaches do not robustly generalize or orchestrate the multi-file reasoning required for high success (Jimenez et al., 2023).
5. Limitations and Known Issues
The original SWE-Bench construction introduced several critical limitations that have prompted substantial follow-up research:
- Solution Leakage: Manual review found 32.67% of model-marked "successful" cases featured the answer in the issue description or comments, leading to spurious high scores (the "solution leakage" problem).
- Insufficient Test Coverage: 31.08% of accepted patches passed because the underlying test suite was inadequate to reject incorrect or incomplete solutions. Filtering out these cases reduces the apparent effectiveness of state-of-the-art systems from 12.47% to 3.97% (Aleithan et al., 9 Oct 2024).
- Data Contamination: Over 94% of the issues were filed before the knowledge cutoff date for pre-trained LLMs, raising the risk of inadvertent memorization or exposure in training data.
As a result, current SWE-Bench resolution rates—when taken at face value from execution-based validation—substantially overestimate models’ true ability to generalize and repair diverse, unseen bugs.
6. Extensions, Derivatives, and Ecosystem
The influence of SWE-Bench has catalyzed the development of numerous downstream resources and improvements:
- SWE-bench Verified and Lite: Curated subsets for streamlined and higher-confidence evaluation.
- Benchmark Variants: SWE-bench-java ported the task to Java (Zan et al., 26 Aug 2024), while SWE-bench Multimodal extended the paradigm to visual, user-facing domains in JavaScript and TypeScript, with image- or screenshot-augmented tasks (Yang et al., 4 Oct 2024).
- Automated Curation Pipelines and Scaling: Tools such as SetUpAgent automate historic dependency setup, yielding broader and more representative datasets (SWEE-Bench, SWA-Bench) that expose distributional weaknesses and further reduce overfitting (Vergopoulos et al., 10 Mar 2025).
- Test Suite Augmentation: UTBoost and SPICE provide LLM-augmented verification and test coverage assessment, surfacing latent errors missed by the original test suites and enhancing leaderboard reliability (Yu et al., 10 Jun 2025, Bhatia et al., 12 Jul 2025).
- Contamination Resistance and Continual Learning: Recent dynamic and live benchmarks such as SWE-bench-Live (Zhang et al., 29 May 2025), SWE-MERA (Adamenko et al., 15 Jul 2025), and continual learning splits (e.g., SWE-Bench-CL) (Joshi et al., 13 Jun 2025) improve robustness by curating tasks post-LLM training and organizing instances by creation date. SWE-Bench Pro (Deng et al., 21 Sep 2025) targets long-horizon, multi-file and enterprise-grade problem settings where existing models fall below a 25% success rate.
- Security and Failure Modes: SWE-Bench has exposed that LLMs and LLM agent frameworks inject distinctive vulnerability classes absent from human patches, with a nearly 9× increase in new vulnerabilities under certain settings (Sajadi et al., 30 Jun 2025).
These developments collectively delineate an increasingly rigorous ecosystem around SWE-Bench, grounding LLM evaluation in both challenge diversity and contamination resistance.
7. Implications and Future Directions
SWE-Bench has redefined the realism and difficulty of code generation and automated program repair benchmarks. Nevertheless, the experience with solution leakage and weak test coverage demonstrates that even execution-verified metrics are insufficient unless complemented by rigorous manual screening, test suite strengthening, and contamination auditing. The field is moving toward dynamic, continually updated, and multi-language benchmarks, integration of automated test case generation, and agentic evaluation protocols that require transfer, continual learning, and multi-modal or tool-use capabilities.
A plausible implication is that future LLM and agent research must innovate not only on model architecture and retrieval but on the entire workflow: including robust patch generation, strong test oracle construction, cross-issue and cross-repository generalization, and predictive assessment of solution security and completeness. The SWE-Bench trajectory suggests that high-quality, diverse, contamination-resistant evaluation ecosystems will remain critical to tracking genuine progress toward practical, autonomous software engineering agents (Jimenez et al., 2023, Aleithan et al., 9 Oct 2024, Vergopoulos et al., 10 Mar 2025, Zhang et al., 29 May 2025, Yu et al., 10 Jun 2025, Deng et al., 21 Sep 2025).