SWE-Bench++: Automated Multilingual Evaluation

Updated 17 February 2026

The paper introduces SWE-Bench++, a scalable framework that mitigates solution leakage and contamination through automated, multilingual evaluation of code-generation tasks.
It employs a fully automated pipeline for programmatically sourcing GitHub PRs, synthesizing reproducible environments, and augmenting test oracles.
Benchmark analyses reveal significant performance drops on diversified, multilingual tasks, underscoring challenges in current LLM code-generation evaluation.

SWE-Bench++ is an automated, large-scale, multilingual framework and dataset for evaluating and advancing code-generation systems on repository-level software engineering tasks. Originating as a response to key limitations and contamination vulnerabilities in the original SWE-bench Python benchmark, SWE-Bench++ now designates a family of benchmarks and pipelines that prioritize data cleanliness, scalability, automatic environment synthesis, rigorous and multi-language coverage, and robustness against memorization effects in modern LLMs. Key contributions include methodological advances in dataset construction, environment orchestration, contamination control, test oracle augmentation, and evaluation metrics, with empirical evidence demonstrating both substantial benchmark difficulty and the pitfalls of overfitting to static or narrow test suites.

1. Motivation and Evolution Beyond the Original SWE-Bench

The release of the original SWE-bench established a standard for assessing LLMs on real-world GitHub issues and their corresponding pull requests. However, systematic reviews exposed several critical flaws: solution leakage (where explicit patch solutions appear in issue text/comments), test insufficiency (where weak oracle tests let incorrect patches “pass”), and widespread data contamination (with over 94% of tasks predating major LLMs' knowledge cutoffs, risking memorization) (Aleithan et al., 2024). Large-scale empirical audits revealed that up to 32.67% of “successful” agent solutions simply copied leaked answers, and another 31.08% were accepted due to weak test coverage.

Multiple works confirmed these issues and highlighted additional distributional mismatches: the small number of curated repositories (e.g., 12 in SWE-bench) led to unrealistic overestimation of model performance, while the Python-centric focus ignored the broader polyglot context of contemporary software engineering (Vergopoulos et al., 10 Mar 2025, Wang et al., 19 Dec 2025, Zan et al., 3 Apr 2025). To address these issues, the SWE-Bench++ agenda emerged around four pillars:

Automated, scalable programmatic sourcing of GitHub PRs across many languages;
Execution-based evaluation using historical environment synthesis;
Strict post–knowledge cutoff task filtering to block contamination;
Comprehensive augmentation of oracles and metrics to support robust, multi-level evaluation.

2. Construction and Pipeline Architecture

SWE-Bench++ pipelines are fully automated and structured in four main stages (Wang et al., 19 Dec 2025):

Programmatic Sourcing: The system ingests a firehose of merged GitHub PRs, applying filters for repository popularity, recency, activity, and test modification. This yields tens of thousands of candidate PRs.
Environment Synthesis: Using a library of language-specific Dockerfile templates, LLM-driven planning, and iterative build/test feedback, the pipeline automatically constructs reproducible containerized environments, ensuring deterministic builds and test execution across “base,” “before,” and “after” states of each PR.
Test Oracle Extraction & Augmentation: The pipeline distinguishes between regression (bug-fix) and feature-request scenarios. Adaptive log parsers—including LLM-generated scripts and known regex frameworks—extract per-test pass/fail judgment robustly. For enhanced test thoroughness, pipelines such as UTBoost run LLM-based test synthesis (UTGenerator) to generate new test cases that increase coverage of code paths, identifying and correcting insufficiently tested benchmarks (Yu et al., 10 Jun 2025).
Quality Assurance and Filtering: Multi-layer AutoQA modules enforce environment determinism, oracle consistency, semantic alignment (using LLM-Judge for clarity and test-issue linkage), and remove infra-related false negatives.

The resulting dataset comprises over 11,000 validated instances from nearly 4,000 repositories, spanning 11 major languages including Python, Java, Go, Typescript, JavaScript, Rust, C, C++, Ruby, PHP, and C# (Wang et al., 19 Dec 2025).

3. Contamination-Mitigation and Robustness Principles

Given acute risks of LLM memorization, SWE-Bench++ integrates methods to minimize contamination and benchmark overfitting. These include:

Temporal Partitioning: New benchmark tasks are drawn strictly from post–training cutoff dates, preventing pretraining exposure (Aleithan et al., 2024).
Repository and Task Diversity: Expansion to thousands of repositories reduces repository-bias memorization. Calibration analyses using file-path ID and function reproduction tasks show significant performance drops (up to 20–30pp) for agents on out-of-benchmark repos, confirming that prior high results were partly due to data overlap (Liang et al., 14 Jun 2025).
Oracle Augmentation: Automated LLM test case synthesis (UTGenerator, UTBoost) increases oracle rigor by injecting new test cases around uncovered or insufficiently-checked code regions (Yu et al., 10 Jun 2025).
Manual and Semi-Automatic Validation: Expert annotators vet task quality, test adequacy, and gold-patch correctness, sometimes with consensus voting, to alleviate residual ambiguities.
Active Trajectory Synthesis for Hard Cases: For tasks where all current models fail, an active pipeline guides new attempts with contextual hints and subsequently rewrites solution traces to eliminate hint-based artifacts—producing “frontier” training/evaluation instances (Wang et al., 19 Dec 2025).

4. Benchmark Composition, Task Types, and Metrics

SWE-Bench++ instances include both bug fixes and feature requests (state-differential tasks), with domain and language coverage far exceeding earlier benchmarks. Representative statistics from (Wang et al., 19 Dec 2025):

11,133 validated tasks from 3,971 repositories (prior SWE-bench: 2,294 tasks from 12 repos).
Language yields from Python (41%) and Go (41%) to C (9.5%) and C# (10%).
Task types: ∼61% bug-fix, ∼31% feature, remainder refactor or performance.
Difficulty spectrum: single-file patches to multi-file, multi-hundred-line refactorings; significant coverage of application (SWA-Bench) as well as library (SWEE-Bench) scenarios (Vergopoulos et al., 10 Mar 2025).

Evaluation is performed with rigorous execution-based oracles. The canonical metric is pass@k, defined as: $\mathrm{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where $n$ is the number of independent samples and $c$ is the number of correct ones (Wang et al., 19 Dec 2025). Secondary metrics include exact match, solved location rate (file overlap), and semantic alignment scoring. For task validity, runs failing environment or oracle determinism are automatically filtered out.

5. Empirical Analyses and Performance Baselines

Empirical evaluation on stratified subsets of SWE-Bench++ consistently reveals increased benchmark difficulty and stricter grading than prior Python-only benchmarks. For instance, pass@10 results on 1,782 tasks are as follows (Wang et al., 19 Dec 2025):

Model	pass@10 (%)
claude-sonnet-4.5	36.20
gpt-5-2025-08-07	34.57
gemini-2.5-pro	24.92
gpt-4o	16.89

Performance is highest on Python/Java and lowest on C/C++/Rust, reflecting the complexity and diversity of modern polyglot software. The expansion to less-documented, lower-popularity, and multi-language repositories produces up to 40% lower agent success rates compared to the original SWE-bench (Vergopoulos et al., 10 Mar 2025). Augmented test oracles and parser corrections further reduce inflated pass rates; for example, UTBoost identified erroneous “pass” labels in 28.4% of SWE-Bench Lite's cases, shifting 40.9% of leaderboard rankings (Yu et al., 10 Jun 2025). The introduction of multilingual test sets (Multi-SWE-bench) exposes persistent generalization gaps: SOTA agents that achieve 52% on Python drop to 4–8% on Go/Rust and <10% on JavaScript/TypeScript/C/C++ (Zan et al., 3 Apr 2025).

Fine-tuning experiments reveal that even modest additions of “hard,” high-diversity, and multilingual SWE-Bench++ trajectories can double or triple cross-lingual agent pass rates (Wang et al., 19 Dec 2025).

6. Methodological Innovations and Open Research Questions

SWE-Bench++ integrates several methodological advances:

Automated Environment Orchestration: LLM-driven planning, build/test orchestration, and log parsing (falling back to LLM code generation if regex parsing fails) (Wang et al., 19 Dec 2025).
Test Oracle Augmentation: UTGenerator and UTBoost automate new test creation based on LLM localization and synthesis, improving code path coverage and identifying label errors missed by legacy benchmarks (Yu et al., 10 Jun 2025).
Quality Assurance Layers: Determinism, oracle, semantic, and infrastructure checks enforce rigorous task validity.
Hint-Guided Trajectory Synthesis: Turning baseline failures into structured training traces with automated “hint” and “rewrite” passes, producing instances near the frontier of current model capability (Wang et al., 19 Dec 2025).
Entropy-Based Training Metrics: The Entropy Compression Hypothesis and HE-SNR metric offer a theoretically-motivated, context-robust method for mid-training model evaluation, outperforming standard perplexity especially under long-context window scaling (Wang et al., 28 Jan 2026).

Ongoing questions involve enhancing adaptive thresholding for entropy-based metrics, cross-architecture validation (beyond MoE models), further hardening against contamination, and integrating human and LLM-in-the-loop semantic validation (Wang et al., 19 Dec 2025, Wang et al., 28 Jan 2026).

7. Impact, Limitations, and Future Directions

SWE-Bench++ establishes a new state-of-the-art in scalable, contamination-resistant, and polyglot evaluation for LLM-driven code agents. It reveals that prior benchmarks dramatically overstated practical reasoning capability due to solution leakage, contamination, test weakness, or distributional artifacts. By contrast, SWE-Bench++ exposes the persistent generalization gap, highlights the need for robust oracle augmentation, and demonstrates the necessity of massive and diverse repository/task coverage. Limitations remain—most notably, automated correctness proxies may still miss semantic or maintainability errors, and current LLM-Judge systems only approximate true gold labels (Wang et al., 19 Dec 2025).

Planned extensions include continuously-refreshed “living” benchmarks, multi-modal evaluation (including UI/front-end tasks), expansion to new languages (e.g., Ruby, PHP, Kotlin), crowdsourced validation, development of more advanced RL-driven agents (Multi-SWE-RL), and integration of entropy-based training criteria (Wang et al., 19 Dec 2025, Zan et al., 3 Apr 2025, Wang et al., 28 Jan 2026). These directions aim to ensure that benchmark progress accurately reflects robust, transferable, and semantically meaningful advances in code-generation research.