SWE-bench Verified Issues Benchmark

Updated 9 December 2025

SWE-bench Verified Issues are a curated set of real GitHub software engineering problems validated for reproducibility and accurate fail-to-pass test transitions.
They employ a rigorous multi-stage pipeline, including automated reproducibility checks, Dockerized orchestration, and manual annotation to ensure high test fidelity.
The benchmark enables empirical evaluation of agentic systems with metrics like pass@1 and resolution rates, driving progress in automated bug repair.

SWE-bench Verified Issues are a hand-curated, high-fidelity subset of real-world GitHub software engineering problems designed to rigorously measure the bug-resolving and patch-synthesizing abilities of large code models and agentic systems. Developed to address deficiencies in prior ML evaluation datasets—namely unreproducible bugs, ambiguous requirements, and inadequate test harnesses—the SWE-bench Verified set consists of issues that have been thoroughly filtered, human-validated, and equipped with reliable “fail-to-pass” (F→P) unit tests, providing a robust testbed for empirical methodology and agentic system benchmarking (Jimenez et al., 2023, Ma et al., 1 Nov 2024, Liu et al., 17 Sep 2025).

1. Dataset Construction: Pipeline and Selection Criteria

The curation of SWE-bench Verified follows a multi-stage filtering and verification pipeline, enforcing strict reproducibility and testability standards:

Repository and PR Mining: Initially, a broad pool of candidate repositories is scraped for historical pull requests (PRs) that are linked to GitHub issues and contain at least one change to a test file. For example, in SWE-bench-java-verified, repositories are drawn both from high-star public Java projects and the Defects4J corpus to ensure domain diversity (Zan et al., 26 Aug 2024).
Automated Reproducibility Validation: Each candidate issue instance is subject to environment recreation at the base commit using project-native build tools (e.g., Maven/Gradle for Java; conda/pip/pytest for Python). Instances failing to compile or missing dependencies are discarded (Zan et al., 26 Aug 2024, Jimenez et al., 2023).
Fail-to-Pass Test Enforcement: For each issue, two controlled environments—one with buggy code, one with the developer patch—are subjected to the same test suite. An issue is retained only if at least one test purposely fails pre-patch and flips to pass post-patch, with zero “pass-to-fail” regressions, enforcing the F→P invariant (Zan et al., 26 Aug 2024, Jimenez et al., 2023).
Manual Annotation and Consensus Filtering: Trained annotators independently review each remaining instance for the following criteria:
- Clarity of issue description (ordinal: 0–3)
- Test coverage strength (ordinal: 0–3)
- Absence of major flaws (binary: 0/1) Only issues satisfying (Q1≤1) AND (Q2≤1) AND (Q3=0) are admitted (Zan et al., 26 Aug 2024). This enforces a consensus threshold on comprehensibility and test adequacy, minimizing ambiguity or hidden requirements.
Final Corpus and Statistics: For SWE-bench Verified (Python version), this process yields 500 issues across 12 major projects, and for SWE-bench-java-verified, 91 issues spanning 19 Java repositories (Zan et al., 26 Aug 2024, Chen et al., 21 Oct 2025).

2. Verification Processes and Environment Design

To ensure the fidelity and reproducibility of benchmarks and agent submissions:

Dockerized Orchestration: For every issue, benchmark maintainers provide a Docker image that captures the precise base repository snapshot, pinned dependency versions, and correct build toolchain (Zan et al., 26 Aug 2024). This enables exact replay of the patch-application and test-execution steps, removing confounding variables such as environment drift or dependency updates.
Test Harness Lockdown: All “test.patch” artifacts (unit or integration tests that formalize correctness) are hidden from both the agent and the user, ensuring that only pre-existing information is leveraged during resolution (Wang et al., 10 Sep 2025). The harness checks both the fail→pass and pass→pass invariants to guard against regressions.
Leaderboard Reporting: Each agent or model submission is evaluated by mounting these Docker containers, applying the submission as a git diff, and running the full prescribed test suite. The main metric is typically “pass@1,” i.e., the rate at which the top-ranked agent patch causes all tests to pass (Jimenez et al., 2023, Jain et al., 9 Apr 2025, Ma et al., 1 Nov 2024).

3. Quality Controls: Manual Review, Test Adequacy, and Inter-Annotator Reliability

Multi-Annotator Review: All submitted issues are examined by multiple human experts, with explicit annotation guidelines. For example, up to ten Java developers participated in SWE-bench-java-verified (Zan et al., 26 Aug 2024). Disagreements are resolved by consensus exclusion.
Test Coverage Verification: Annotators manually rate the comprehensiveness of test coverage, with issues lacking adequate negative-path or edge-case tests pruned from the benchmark. Although explicit inter-annotator agreement metrics (e.g., Cohen’s κ) are not always reported, the pipeline prescribes a formula for post-hoc computation: κ = (pₒ – pₑ) / (1 – pₑ), where pₒ is observed agreement and pₑ is chance agreement (Zan et al., 26 Aug 2024).
Reproducibility and Fairness: Ground-truth fixes must apply cleanly and cause the defined F→P test transitions in the standardized environment. Any instance showing compilation or runtime failures, even after applying the ground-truth patch, is excluded (Zan et al., 26 Aug 2024, Jimenez et al., 2023, Wang et al., 10 Sep 2025).
Ongoing Maintenance: Leaderboards, Docker images, and dataset splits are continuously maintained and updated, and contributions (test augmentations, fixes) are encouraged via pull requests (Zan et al., 26 Aug 2024).

4. Evaluation Methodologies and Metrics

SWE-bench Verified utilizes rigorous, transparent evaluation protocols:

Task Definition: Each instance supplies the agent with the issue description, full repository snapshot at issue time, and no privileged access to hidden gold patches or tests.
Patch Application and Testing: Agents submit candidate patches, which are automatically applied and tested in the controlled environment. A patch is accepted only if all fail→pass tests flip and no new failures are introduced.
Resolution Rate: The core quantitative metric is ResolutionRate(M) = (number of issues resolved by model M) / (number of issues attempted) × 100%, as in (Ma et al., 1 Nov 2024). Other metrics include pass@k (fraction of issues resolved in k attempts), and fault-localization accuracy, computed on chunk, function, and file granularities.
Experimental Controls: For precise benchmarking, code is reviewed so that versions and dependencies are locked and random seeds may be set in agent frameworks.

5. Observed Challenges, Limitations, and Recommendations

Despite stringent pipeline controls, several structural and empirical weaknesses persist:

Test Suite Weaknesses: Over 15% of instances in SWE-bench Verified require augmentation, as many test patches are incomplete and allow erroneous or partial model patches to pass the test harness. Advanced frameworks such as UTBoost and PatchDiff have revealed that leaderboard success rates may be inflated by 6–7 absolute percentage points due to latent test inadequacies and latent behavioral divergences between model and human patches (Yu et al., 10 Jun 2025, Wang et al., 19 Mar 2025, Aleithan et al., 9 Oct 2024).
Memorization and Leakage: A disproportional fraction of issues were created well before LLM knowledge cutoffs, and direct solution leakage (copying solution code from issue text/comments to model output) persists in 30%+ of successful “passes” without further filtering. Such contamination can be measured empirically using instance-wise, repository-wise, and temporal controls, and diagnostic subtasks (e.g., blind file path identification) demonstrate that LLMs can achieve up to 76% accuracy by mere memorization, not reasoning (Liang et al., 14 Jun 2025, Aleithan et al., 9 Oct 2024).
Human Annotation Limitations: While manual review reduces noise, many issue statements remain under-specified or ambiguous regarding semantic edge cases. Not all annotator agreement statistics are reported, though the literature prescribes Cohen’s κ as a standard measure (Zan et al., 26 Aug 2024).
Dynamic Maintenance Needs: Without continual refresh and temporal decontamination, benchmarks quickly become obsolete as LLM training windows close. Recent studies advocate for rolling task pipelines (e.g., SWE-rebench, SWE-MERA) that only admit issues created post-model-cutoff and actively filter for overlap with public model training corpora (Badertdinov et al., 26 May 2025, Adamenko et al., 15 Jul 2025).

6. Significance, Impact, and Future Directions

SWE-bench Verified has become the reference standard for measuring AI-driven software engineering capability due to its reproducibility, breadth, and challenges posed by real-world, multi-file codebases with nuanced specifications. Key advances enabled by the benchmark include:

Agentic and Process-Centric LLMs: Models trained on SWE-bench Verified have evolved from static code predictors to full agentic systems, leveraging dynamic execution, self-correction, and ensemble voting to approach or exceed 50% task resolution rates (Ma et al., 1 Nov 2024, Jain et al., 9 Apr 2025, Chen et al., 31 Jul 2025).
Augmented Verification Methodologies: The incorporation of LLM-driven test augmentation, behavioral-differencing oracles, and hybrid verifier techniques (blending execution-based and execution-free criteria) has allowed open-weight models to match or exceed proprietary closed-source pipelines (Jain et al., 9 Apr 2025, Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025).
Extension Beyond Python: The introduction of SWE-bench-java-verified demonstrates the pipeline’s adaptability to other languages and the feasibility of generating high-quality, multi-lingual agent benchmarks (Zan et al., 26 Aug 2024).
Diagnostic and Defensive Benchmarking: Methodologies such as benchmark mutation (converting formal issues to chat-style queries based on real-world IDE usage) or blind subtask probing (e.g., file-path identification, function reproduction) have exposed limitations in both test representativeness and the real-world transferability of current LLM capabilities (Garg et al., 10 Oct 2025, Liang et al., 14 Jun 2025).
Benchmarks for Continual and Experience-Driven Learning: Datasets such as SWE-Bench-CL and experience-bank augmentations now enable measurement of both plasticity and retention, critical for agents that must learn across evolving software histories (Joshi et al., 13 Jun 2025, Chen et al., 31 Jul 2025).

Continued advances in SWE-bench Verified and derivatives will require dynamic, contamination-resistant task updates, deeper test–specification validation, and more representative modeling of real user queries. In summary, despite open challenges, SWE-bench Verified Issues have been pivotal in establishing empirical, reproducible, and meaningful standards for software engineering agents, driving methodological rigor and enabling measurable progress in automated programming (Jimenez et al., 2023, Zan et al., 26 Aug 2024, Ma et al., 1 Nov 2024, Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025, Jain et al., 9 Apr 2025, Badertdinov et al., 26 May 2025, Liang et al., 14 Jun 2025, Chen et al., 21 Oct 2025, Garg et al., 10 Oct 2025, Joshi et al., 13 Jun 2025, Chen et al., 31 Jul 2025, Wang et al., 10 Sep 2025).