SWE-Bench+: Next-Gen Code Agent Benchmarks

Updated 4 January 2026

SWE-Bench+ is a framework that extends traditional benchmarks by integrating mutation-based task realism, expansive language support, and automated quality assurance.
It employs mutation operators and enhanced test augmentation to expose and correct performance overestimations in code agents on real-world repositories.
Its scalable design, covering diverse languages and repository types, enables representative, high-fidelity evaluation of software engineering agents.

SWE-Bench+ is an umbrella term for a new generation of software engineering agent benchmarks derived from—and designed to correct the shortcomings of—the original SWE-Bench framework, by introducing mutation-based task realism, rigorous quality assurance, and expanded linguistic, structural, and interactional coverage. These benchmarks address systematic issues of agent capability overestimation and dataset contamination, offering more faithful, scalable, and technically challenging evaluation protocols for code agents operating on real-world repositories (Garg et al., 10 Oct 2025, Aleithan et al., 2024).

1. Foundations and Evolution of SWE-Bench+

The initial SWE-Bench benchmark measured code agent performance on 2,294 issue–pull request pairs from 12 Python repositories, requiring models to edit codebases to resolve GitHub-tracked issues using execution-based validation (Jimenez et al., 2023). Although SWE-Bench established critical principles for repository-level bug fixing and patch validation, analyses identified key flaws: solution-leakage (agents exploiting solutions present in issue descriptions), inadequate and non-exhaustive test coverage, and potential data leakage from pre-training overlap (Aleithan et al., 2024, Wang et al., 19 Mar 2025).

SWE-Bench+ extends the SWE-Bench paradigm to:

Task Diversity & Realism: Tasks reflect developer–assistant interactions and terse, underspecified queries encountered in IDE telemetry, not only formal GitHub issue wording (Garg et al., 10 Oct 2025).
Multilingual, Multi-Framework Scope: Inclusion of benchmarks for Java (SWE-bench-java), C# (SWE-Sharp-Bench), TypeScript, Go, and other languages enables cross-ecosystem agent comparisons (Zan et al., 2024, Mhatre et al., 4 Nov 2025).
Automated, Scalable Generation: Frameworks such as SWE-Bench++ and SetUpAgent continuously harvest and verify tasks from thousands of repos, maximizing representativity and recency (Wang et al., 19 Dec 2025, Vergopoulos et al., 10 Mar 2025).
Enhanced Evaluation Rigor: Multiple advances improve the reliability of agent scoring, including exhaustive and augmented test suites (UTBoost), differential patch testing (PatchDiff), and clarity/test-coverage labeling (SPICE) (Yu et al., 10 Jun 2025, Bhatia et al., 12 Jul 2025).
Benchmark Mutation: Mutation operators create realistic user-style benchmarks, enabling quantification of agent performance drops under practical query scenarios (Garg et al., 10 Oct 2025).

2. Mutation-Based Task Construction and Realism

SWE-Bench+ introduces a formal benchmark mutation methodology. Let $Q_{\text{orig}}$ denote idealized (formal) issue queries, and $T = \{\tau_1, \ldots, \tau_k\}$ a set of communication-pattern templates mined from IDE–chat telemetry (e.g., "Paste Error Only", "Direct Fix This") (Garg et al., 10 Oct 2025). Mutation operators $\mu_{\tau}$ transform each $(q_{\text{orig}}, \text{patch})$ into $q_{\text{mut}}$ , yielding a set of realistic, terse, sometimes under-specified queries. The semantic fidelity constraint

$\mathrm{Sim}(q_{\text{orig}}, q_{\text{mut}}) = \cos(\phi(q_{\text{orig}}), \phi(q_{\text{mut}})) \geq 0.75$

(where $\phi$ is a code-aware embedding) guarantees that mutated queries remain technically aligned with the original issue in over 95% of cases.

Such benchmark mutation unambiguously demonstrates that formal SWE-Bench tasks can inflate agent resolve rates: for example, models on SWE-Bench Verified show up to 54% relative overestimation, with realistic-mutation success rates up to 36.5% lower. This effect is consistent across Python, TypeScript, and internal C# task sets (Δ ranging from 10–54%) (Garg et al., 10 Oct 2025).

3. Expanded Scope: Languages, Repositories, and Benchmarks

Multiple derivative and complementary benchmarks extend SWE-Bench+ coverage:

SWE-bench-java-verified reimplements the pipeline for Java, covering 6 repositories and 91 high-quality issues, validating the portability of SWE-Bench methodology (Zan et al., 2024).
SWE-Sharp-Bench adapts the validated "pass→fail→pass" methodology for C#, with 150 complex patches from 17 repositories, enabling rigorous Python vs C# vs Java comparisons. SWE-Sharp-Bench reveals a significant performance gap, with C# tasks resolved at ~40% vs Python’s ~70%, attributable to larger, multi-hunk patches and more distributed codebases (Mhatre et al., 4 Nov 2025).
SWE-Bench++ scales benchmark generation to 11 languages and 3,971 repositories (11,133 instances), using a pipeline with environment synthesis, test oracle extraction, and quality assurance, along with hint-guided trajectory synthesis for failed tasks (Wang et al., 19 Dec 2025).
SWEE-Bench and SWA-Bench employ SetUpAgent for broad, historically accurate coverage, countering distributional bias and issue contamination present in highly popular SWE-Bench repositories. Sweeping agent success rate reductions (up to 40%) highlight increased realism and complexity in SWEE/SWA tasks (Vergopoulos et al., 10 Mar 2025).

Benchmark	Language(s)	#Tasks	%Resolve (best agent)
SWE-Bench Verified	Python	500	~66.6 (Claude 4)
SWE-Sharp-Bench	C#	150	~47.3 (GPT-5)
SWE-bench-java	Java	91	~9.89 (DeepSeek-V2)
SWE-Bench++	11	11,133	36.20 (Claude 4.5, @10)

4. Robust Evaluation Protocols and Quality Assurance

SWE-Bench+ addresses prior benchmarking flaws by introducing:

Solution Leak and Data Contamination Filtering: Time-based filtering collects only post–LLM-cutoff issues, and automated solution-leak detection excludes queries revealing full or partial patches in issue text/comments (Aleithan et al., 2024).
Rigorous Test Suite Augmentation: UTBoost, leveraging LLM-driven UTGenerator, synthesizes additional unit tests for instances detected as insufficiently covered by original human-written tests. For SWE-Bench Lite and Verified, test suite augmentation changed agent leaderboard ranks in 40.9% and 24.4% of cases, respectively (Yu et al., 10 Jun 2025).
Differential Patch Testing: PatchDiff exposes behavioral discrepancies missed by standard testing, identifying ~29.6% of plausible patches as behaviorally divergent, with up to 6.2 percentage points inflation in reported resolve rates on SWE-Bench Verified (Wang et al., 19 Mar 2025).
Structured Labeling (Clarity, Coverage, Effort): SPICE uses context-aware navigation and multi-pass LLM consensus strategies to annotate issue clarity, test coverage, and effort at scale, supporting cost-effective dataset creation. SPICE-Bench, with 6,802 labeled instances from 291 projects, enables explainable, reproducible benchmarking (Bhatia et al., 12 Jul 2025).

5. Quantitative Findings and Cross-Benchmark Agent Performance

Empirical analysis across SWE-Bench+ variants reveals systematic performance overestimation in traditional benchmarks and marked agent weaknesses on realistic, complex, or multilingual tasks:

Mutation-Induced Delta (Δ): In mutation-based evaluation, top-performing models experienced Δ up to +53.8% on TypeScript and +36.5% on Python, evidencing systematic overestimation from formal queries (Garg et al., 10 Oct 2025).
Leakage and Insufficient Testing: Solution leakage impacted 32.67% of successes; weak test suites caused an additional 31.08% of patches to be incorrectly labeled as passed. Revised, strictly validated success rates fell from 12.47% to 3.97% for SWE-Agent+GPT-4 (Aleithan et al., 2024).
Multilingual and Structural Challenge: Agents perform up to 40% worse on SWEE-Bench/SWA-Bench compared to SWE-Bench due to lower issue clarity, higher patch complexity, and reduced pretraining coverage (Vergopoulos et al., 10 Mar 2025).
Augmentation/Correction Impact: Automated test augmentation and parser corrections uncovered 405 erroneous patch labels across SWE-Bench Lite/Verified, substantially perturbing leaderboard positions (Yu et al., 10 Jun 2025).
Tool Synthesis and Self-Evolving Agents: Live-SWE-agent demonstrates that on-the-fly tool creation boosts resolve rates by up to 22.6 percentage points for strong LLMs, outperforming static, offline self-evolving agents (Xia et al., 17 Nov 2025).

6. Limitations, Implications, and Strategic Recommendations

SWE-Bench+ remains limited in coverage of all programming ecosystems (notable gaps in Go, Rust, and JavaScript for some benchmarks), and suffers from residual weaknesses in test suite robustness and some persistent information loss in mutated queries (Garg et al., 10 Oct 2025, Aleithan et al., 2024). Single-turn mutation models do not yet capture multi-turn developer–agent interactions, with future work required on dialog-based mutation and more dynamic task templates.

Key implications include:

Evaluation Fidelity: Mutation-based realism and augmented testing collectively calibrate agent assessment to practical effectiveness, not merely compliance with idealized, leak-prone benchmarks.
Cross-Language Calibration: Consistent protocols and semantic similarity metrics support rigorous comparisons across Python, Java, C#, and TypeScript tasks.
Scalable Dataset Creation: Automated, template-driven pipelines (SWE-Bench++, SPICE, SetUpAgent) enable large-scale, representative benchmarks, mitigating overfitting and distributional mismatch.
Benchmark Maintenance: Leaderboards and agent rankings should be continuously re-evaluated as test suites and data augmentation protocols evolve.
Generalization Potential: The mutation, augmentation, and patch-differentiation strategies articulated in SWE-Bench+ can be generalized to other SE benchmarks (HumanEval, CodeContests) and integrate with active learning or human-in-the-loop protocols.

7. Future Directions and Open Research Challenges

A plausible implication is that SWE-Bench+ does not constitute a fixed benchmark but an extensible framework, adaptable to new languages, tooling (e.g., CI/CD integration), and multi-turn interaction styles, while continuously evolving through automated harvesting and task mutation. Promising directions include:

Multi-turn mutation simulating real developer–agent dialog.
Cross-lingual embedding-based calibration for true multilingual difficulty alignment.
Automated detection of solution leaks and test weaknesses at large scale.
Integration of self-evolving agent scaffolds and tool-synthesis feedback loops in agent benchmarking and evaluation (Xia et al., 17 Nov 2025).
Open-source, crowd-maintained repositories for new test cases, issue types, and differentiated evaluation modes.

SWE-Bench+ thus establishes a rigorous, scalable, and technically robust paradigm for evaluating software engineering agents on repository-level coding challenges, catalyzing progress toward genuinely autonomous, high-reliability code generation systems.