Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-Bench+: Next-Gen Code Agent Benchmarks

Updated 4 January 2026
  • SWE-Bench+ is a framework that extends traditional benchmarks by integrating mutation-based task realism, expansive language support, and automated quality assurance.
  • It employs mutation operators and enhanced test augmentation to expose and correct performance overestimations in code agents on real-world repositories.
  • Its scalable design, covering diverse languages and repository types, enables representative, high-fidelity evaluation of software engineering agents.

SWE-Bench+ is an umbrella term for a new generation of software engineering agent benchmarks derived from—and designed to correct the shortcomings of—the original SWE-Bench framework, by introducing mutation-based task realism, rigorous quality assurance, and expanded linguistic, structural, and interactional coverage. These benchmarks address systematic issues of agent capability overestimation and dataset contamination, offering more faithful, scalable, and technically challenging evaluation protocols for code agents operating on real-world repositories (Garg et al., 10 Oct 2025, Aleithan et al., 2024).

1. Foundations and Evolution of SWE-Bench+

The initial SWE-Bench benchmark measured code agent performance on 2,294 issue–pull request pairs from 12 Python repositories, requiring models to edit codebases to resolve GitHub-tracked issues using execution-based validation (Jimenez et al., 2023). Although SWE-Bench established critical principles for repository-level bug fixing and patch validation, analyses identified key flaws: solution-leakage (agents exploiting solutions present in issue descriptions), inadequate and non-exhaustive test coverage, and potential data leakage from pre-training overlap (Aleithan et al., 2024, Wang et al., 19 Mar 2025).

SWE-Bench+ extends the SWE-Bench paradigm to:

2. Mutation-Based Task Construction and Realism

SWE-Bench+ introduces a formal benchmark mutation methodology. Let QorigQ_{\text{orig}} denote idealized (formal) issue queries, and T={τ1,,τk}T = \{\tau_1, \ldots, \tau_k\} a set of communication-pattern templates mined from IDE–chat telemetry (e.g., "Paste Error Only", "Direct Fix This") (Garg et al., 10 Oct 2025). Mutation operators μτ\mu_{\tau} transform each (qorig,patch)(q_{\text{orig}}, \text{patch}) into qmutq_{\text{mut}}, yielding a set of realistic, terse, sometimes under-specified queries. The semantic fidelity constraint

Sim(qorig,qmut)=cos(ϕ(qorig),ϕ(qmut))0.75\mathrm{Sim}(q_{\text{orig}}, q_{\text{mut}}) = \cos(\phi(q_{\text{orig}}), \phi(q_{\text{mut}})) \geq 0.75

(where ϕ\phi is a code-aware embedding) guarantees that mutated queries remain technically aligned with the original issue in over 95% of cases.

Such benchmark mutation unambiguously demonstrates that formal SWE-Bench tasks can inflate agent resolve rates: for example, models on SWE-Bench Verified show up to 54% relative overestimation, with realistic-mutation success rates up to 36.5% lower. This effect is consistent across Python, TypeScript, and internal C# task sets (Δ ranging from 10–54%) (Garg et al., 10 Oct 2025).

3. Expanded Scope: Languages, Repositories, and Benchmarks

Multiple derivative and complementary benchmarks extend SWE-Bench+ coverage:

  • SWE-bench-java-verified reimplements the pipeline for Java, covering 6 repositories and 91 high-quality issues, validating the portability of SWE-Bench methodology (Zan et al., 2024).
  • SWE-Sharp-Bench adapts the validated "pass→fail→pass" methodology for C#, with 150 complex patches from 17 repositories, enabling rigorous Python vs C# vs Java comparisons. SWE-Sharp-Bench reveals a significant performance gap, with C# tasks resolved at ~40% vs Python’s ~70%, attributable to larger, multi-hunk patches and more distributed codebases (Mhatre et al., 4 Nov 2025).
  • SWE-Bench++ scales benchmark generation to 11 languages and 3,971 repositories (11,133 instances), using a pipeline with environment synthesis, test oracle extraction, and quality assurance, along with hint-guided trajectory synthesis for failed tasks (Wang et al., 19 Dec 2025).
  • SWEE-Bench and SWA-Bench employ SetUpAgent for broad, historically accurate coverage, countering distributional bias and issue contamination present in highly popular SWE-Bench repositories. Sweeping agent success rate reductions (up to 40%) highlight increased realism and complexity in SWEE/SWA tasks (Vergopoulos et al., 10 Mar 2025).
Benchmark Language(s) #Tasks %Resolve (best agent)
SWE-Bench Verified Python 500 ~66.6 (Claude 4)
SWE-Sharp-Bench C# 150 ~47.3 (GPT-5)
SWE-bench-java Java 91 ~9.89 (DeepSeek-V2)
SWE-Bench++ 11 11,133 36.20 (Claude 4.5, @10)

4. Robust Evaluation Protocols and Quality Assurance

SWE-Bench+ addresses prior benchmarking flaws by introducing:

  • Solution Leak and Data Contamination Filtering: Time-based filtering collects only post–LLM-cutoff issues, and automated solution-leak detection excludes queries revealing full or partial patches in issue text/comments (Aleithan et al., 2024).
  • Rigorous Test Suite Augmentation: UTBoost, leveraging LLM-driven UTGenerator, synthesizes additional unit tests for instances detected as insufficiently covered by original human-written tests. For SWE-Bench Lite and Verified, test suite augmentation changed agent leaderboard ranks in 40.9% and 24.4% of cases, respectively (Yu et al., 10 Jun 2025).
  • Differential Patch Testing: PatchDiff exposes behavioral discrepancies missed by standard testing, identifying ~29.6% of plausible patches as behaviorally divergent, with up to 6.2 percentage points inflation in reported resolve rates on SWE-Bench Verified (Wang et al., 19 Mar 2025).
  • Structured Labeling (Clarity, Coverage, Effort): SPICE uses context-aware navigation and multi-pass LLM consensus strategies to annotate issue clarity, test coverage, and effort at scale, supporting cost-effective dataset creation. SPICE-Bench, with 6,802 labeled instances from 291 projects, enables explainable, reproducible benchmarking (Bhatia et al., 12 Jul 2025).

5. Quantitative Findings and Cross-Benchmark Agent Performance

Empirical analysis across SWE-Bench+ variants reveals systematic performance overestimation in traditional benchmarks and marked agent weaknesses on realistic, complex, or multilingual tasks:

  • Mutation-Induced Delta (Δ): In mutation-based evaluation, top-performing models experienced Δ up to +53.8% on TypeScript and +36.5% on Python, evidencing systematic overestimation from formal queries (Garg et al., 10 Oct 2025).
  • Leakage and Insufficient Testing: Solution leakage impacted 32.67% of successes; weak test suites caused an additional 31.08% of patches to be incorrectly labeled as passed. Revised, strictly validated success rates fell from 12.47% to 3.97% for SWE-Agent+GPT-4 (Aleithan et al., 2024).
  • Multilingual and Structural Challenge: Agents perform up to 40% worse on SWEE-Bench/SWA-Bench compared to SWE-Bench due to lower issue clarity, higher patch complexity, and reduced pretraining coverage (Vergopoulos et al., 10 Mar 2025).
  • Augmentation/Correction Impact: Automated test augmentation and parser corrections uncovered 405 erroneous patch labels across SWE-Bench Lite/Verified, substantially perturbing leaderboard positions (Yu et al., 10 Jun 2025).
  • Tool Synthesis and Self-Evolving Agents: Live-SWE-agent demonstrates that on-the-fly tool creation boosts resolve rates by up to 22.6 percentage points for strong LLMs, outperforming static, offline self-evolving agents (Xia et al., 17 Nov 2025).

6. Limitations, Implications, and Strategic Recommendations

SWE-Bench+ remains limited in coverage of all programming ecosystems (notable gaps in Go, Rust, and JavaScript for some benchmarks), and suffers from residual weaknesses in test suite robustness and some persistent information loss in mutated queries (Garg et al., 10 Oct 2025, Aleithan et al., 2024). Single-turn mutation models do not yet capture multi-turn developer–agent interactions, with future work required on dialog-based mutation and more dynamic task templates.

Key implications include:

  • Evaluation Fidelity: Mutation-based realism and augmented testing collectively calibrate agent assessment to practical effectiveness, not merely compliance with idealized, leak-prone benchmarks.
  • Cross-Language Calibration: Consistent protocols and semantic similarity metrics support rigorous comparisons across Python, Java, C#, and TypeScript tasks.
  • Scalable Dataset Creation: Automated, template-driven pipelines (SWE-Bench++, SPICE, SetUpAgent) enable large-scale, representative benchmarks, mitigating overfitting and distributional mismatch.
  • Benchmark Maintenance: Leaderboards and agent rankings should be continuously re-evaluated as test suites and data augmentation protocols evolve.
  • Generalization Potential: The mutation, augmentation, and patch-differentiation strategies articulated in SWE-Bench+ can be generalized to other SE benchmarks (HumanEval, CodeContests) and integrate with active learning or human-in-the-loop protocols.

7. Future Directions and Open Research Challenges

A plausible implication is that SWE-Bench+ does not constitute a fixed benchmark but an extensible framework, adaptable to new languages, tooling (e.g., CI/CD integration), and multi-turn interaction styles, while continuously evolving through automated harvesting and task mutation. Promising directions include:

  • Multi-turn mutation simulating real developer–agent dialog.
  • Cross-lingual embedding-based calibration for true multilingual difficulty alignment.
  • Automated detection of solution leaks and test weaknesses at large scale.
  • Integration of self-evolving agent scaffolds and tool-synthesis feedback loops in agent benchmarking and evaluation (Xia et al., 17 Nov 2025).
  • Open-source, crowd-maintained repositories for new test cases, issue types, and differentiated evaluation modes.

SWE-Bench+ thus establishes a rigorous, scalable, and technically robust paradigm for evaluating software engineering agents on repository-level coding challenges, catalyzing progress toward genuinely autonomous, high-reliability code generation systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Bench+.