Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SWE-bench Verified Benchmark

Updated 1 July 2025
  • SWE-bench Verified is a curated benchmark of 500 real GitHub issues from 12 Python projects used to evaluate language models and agents on end-to-end software engineering tasks.
  • The benchmark evaluates systems by applying a generated code patch to a codebase and checking if designated developer-written tests transition or remain passing.
  • While driving progress, SWE-bench Verified faces limitations like test oracle weakness and data contamination, leading research towards live, automated benchmarks for future evaluation.

The SWE-bench Verified Benchmark is a curated, realistic evaluation suite for LLMs and agent systems in the domain of automated software engineering. Designed to assess end-to-end capability in resolving real-world GitHub issues by generating functionally correct patches, it draws from the complex, multifaceted challenges found in large, actively developed open-source Python repositories.

1. Construction and Structure

SWE-bench Verified is a subset of the original SWE-bench benchmark, which consists of 2,294 software engineering tasks, each derived from actual GitHub issue reports and their corresponding merged pull requests across 12 widely adopted Python repositories (e.g., Django, Flask, scikit-learn). The Verified benchmark selects 500 instances through manual curation to increase reliability, focusing on issues with well-defined problem statements and strong, developer-written test coverage. For every task, the benchmark provides:

  • The textual issue statement describing the problem or feature request.
  • The full codebase snapshot at the time of the issue and fix.
  • The reference patch (from the merged PR) that resolved the issue.
  • The relevant tests contributed or modified in the PR, enabling automated validation of solutions.
  • Structured metadata, including test logs pre- and post-patch and contextual repository information.

The curated nature of SWE-bench Verified is intended to avoid the pitfalls of under-specified issues or inadequate test coverage that can plague automated program repair benchmarks.

2. Evaluation Protocol and Metrics

SWE-bench Verified adopts a primarily automatic, test-based validation mechanism. For each task, a candidate system is presented with the issue statement and access to the codebase. The system must generate a code patch. This patch is then applied to the original (buggy) codebase, and the suite of developer-written and contributed tests is executed. The patch is judged as successful if:

t(TFPTPP), statusafter(t)=pass\forall t \in (T_{F \rightarrow P} \cup T_{P \rightarrow P}),\ \text{status}_\text{after}(t) = \text{pass}

Where:

  • TFPT_{F \rightarrow P}: Tests that transition from fail to pass after applying the gold (reference) fix.
  • TPPT_{P \rightarrow P}: Tests that must remain passing (preserved behavior).

The main quantitative metric is the resolved rate, i.e., the percentage of instances for which the generated patch leads to all designated tests passing.

Due to the scale of codebases (averaging 3,010 files and 438,000 lines) and the nature of issues (requiring, on average, edits over 1.7 files and 3.0 functions, and changing 32.8 lines per patch), the benchmark is significantly more challenging than traditional code completion or function-level datasets.

3. Performance Trends and Baseline Results

Initial evaluations on SWE-bench and its Verified subset demonstrate the difficulty of the task for state-of-the-art models. For example, Claude 2 achieved only 1.96% resolved on the full benchmark in the original paper (2310.06770). Improvements since then have seen top open-source models such as SWE-Dev-32B reaching 36.6% (2506.07636), Skywork-SWE-32B achieving 38.0% (single run) and 47.0% with test-time scaling (2506.19290), and Llama3-SWE-RL-70B attaining 41.0% (2502.18449). Closed-source models like GPT-4o and Claude 3.7 Sonnet have reached even higher rates (typically above 50%).

A representative table of recent open-source models:

Model Params Framework Pass@1 (%) Pass@K / Ensemble (%)
SWE-Dev-32B 32B OpenHands 36.6
Skywork-SWE-32B 32B OpenHands 38.0 47.0 (Best@8)
SWE-RL-70B 70B Pipeline 41.0
SWE-Gym-32B+Verifier 32B OpenHands 32.0 42.8 (Pass@16)
R2E-Gym-32B (Hybrid) 32B R2E-Gym 34.4 51.0 (Hybrid@26)
SWE-Fixer-72B 72B Pipeline 30.2
Lingma SWE-GPT-72B 72B SWESynInfer 30.2 39.8 (Pass@3)

Closed-source entries (Claude 3.7 Sonnet, GPT-4o, Amazon Q, etc.) cluster at and above the 60% range, often with more sophisticated agent and ensemble methods (2506.17208).

4. Technical and Evaluation Challenges

SWE-bench Verified targets the core challenges in full-repository code repair, such as:

  • Contextual reasoning: Models must process long, noisy issue descriptions and navigate vast codebases that far exceed context window limits, necessitating retrieval or search-based input selection.
  • Cross-file and multi-component changes: Realistic issues demand patches touching multiple classes, files, or subsystems, complicating bug localization and repair composition.
  • Test-based validation limitations: Only code paths covered by tests can be evaluated. Weak or incomplete tests may fail to distinguish correct from incorrect solutions, leading to overestimated resolve rates.
  • Formatting and patch application: Generating patches that integrate smoothly with repository convention and that can be programmatically applied is non-trivial, with format or syntax errors common in model outputs.
  • Dependency and environment handling: Reproducible evaluation requires careful tracking of all software dependencies and historical environment state.

Recent research has highlighted several pitfalls in the evaluation protocol. Empirical analyses reveal that many passing patches are artifacts of solution leakage (issues or comments directly providing information about the fix) or insufficient test cases, inflating reported performance by as much as a factor of three (2410.06992). PatchDiff, an automated differential testing technique, suggests that up to 6.4% absolute points of apparent accuracy gains may be illusory, caused by unsound or under-specified evaluation (2503.15223). UTBoost introduces automated test augmentation and a more robust parser, revising more than 24% of leaderboard entries (2506.09289).

A leading controversy is data contamination—over 94% of the original SWE-bench instances predate the knowledge cutoff for major LLM training epochs, raising the possibility that high leaderboard scores partly result from direct or indirect memorization rather than generalizable reasoning skill (2410.06992, 2506.12286). Diagnostic experiments have shown that SOTA LLMs can guess buggy file paths from issue descriptions alone with 76% accuracy on SWE-bench Verified, far outstripping performance on other, more diverse or temporally disjoint datasets (2506.12286).

5. Architectures and Approaches on the Leaderboard

Submission analysis for SWE-bench Verified reveals a wide spectrum of system architectures, spanning:

  • Agent-based frameworks: Most SOTA results are achieved by agentic systems—either fixed-workflow (single or multi-agent) or scaffolded workflows allowing iterative reasoning, tool use, and multi-step correction.
  • Workflow/pipeline systems: Non-agentic designs use static, prompt-driven stepwise pipelines ("localize → patch → validate"), achieving moderate but not SOTA results.
  • Hybrid/ensemble strategies: Many teams employ multiple LLMs (often combing proprietary and open-source models) for task decomposition, patch ranking, or verification. Test-time scaling via multiple rollouts and (execution-based or LLM-based) verifiers is now standard for top performance (2504.07164).

The leaderboards are dominated by industry teams and companies (over 68% of Verified submissions), particularly those with access to strong proprietary LLMs, but open-source models and frameworks continue to underpin much of the innovation and serve as strong baselines (2506.17208).

6. Impact, Limitations, and Future Development

SWE-bench Verified has catalyzed rapid progress and wide adoption for benchmarking code agents and LLMs but is now recognized to have critical limitations:

  • Benchmark staleness and contamination: Public release and lack of continual update allow LLM pretrainers to absorb benchmark data, undermining generalizability assessments.
  • Underspecified or weak oracles: Many passing solutions exploit poorly specified or incomplete tests or solution leakage in issue descriptions (2410.06992, 2503.15223, 2506.09289).
  • Leaderboards may reflect memorization: As shown by file-path-prediction diagnostics, current scores overstate real-world, out-of-distribution engineering skill (2506.12286).

To address these issues, recent proposals include:

  • Developing live-updating, automatized benchmarks (e.g., SWE-bench-Live (2505.23419), SWE-rebench (2505.20411)) that focus on more recent issues and avoid contamination.
  • Employing broader repository and language coverage (see Multi-SWE-bench and SWE-bench-java (2408.14354, 2504.02605)).
  • Integrating robust test augmentation pipelines, multi-resolution patch verification (e.g., PatchDiff, UTBoost), and transparent evaluation protocols (2503.15223, 2506.09289).
  • Moving toward agent training at scale using larger, diverse, verified datasets (e.g., SWE-Gym (2412.21139), R2E-Gym (2504.07164), SWE-smith (2504.21798), Skywork-SWE (2506.19290)).

A plausible implication is that future benchmarks will need to focus on contamination resistance, robust test oracles, and continual refresh rather than gold-standard curation alone to sustain their value to the research community.

7. Tabular Summary of Key Properties

Property SWE-bench Verified
# Instances 500
Selection Manual curation from 12 Python repos
Eval Protocol Issue + codebase → patch → test suite pass
Main Metric % resolved (tests passed per instance)
Agentic support Yes (many top entries are agent-based)
Code diversity Large, real-world codebases, multi-file
Known limits Test oracle weakness, data contamination
Update policy Static (as of 2025), being superseded by live/auto benchmarks

References to Key Results and Challenges

SWE-bench Verified remains a seminal benchmark for automated software engineering, but its limitations—particularly regarding contamination, test sufficiency, and representativeness—have shaped the next phase of research into live, dynamic, and fully automated benchmarks. It provides a foundation, a set of reference challenges, and a historical record against which progress and future innovation can be measured.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)