SWE-rebench V2: Scalable SE Benchmark
- SWE-rebench V2 is a language-agnostic benchmark that automates task construction, validation, and distribution for reinforcement learning-based software engineering evaluation.
- It employs a multi-stage pipeline with stages including data collection, interactive setup, Dockerized validation, LLM ensemble filtering, and rich metadata enrichment.
- The benchmark scales to tens of thousands of tasks across diverse languages, providing reproducible test environments and diagnostic metrics to assess agent performance.
SWE-rebench V2 is a large-scale, language-agnostic benchmark and automated pipeline for constructing, validating, and distributing executable software engineering tasks suitable for reinforcement learning (RL) and advanced evaluation of software engineering agents. Designed to overcome the limitations of prior benchmarks in terms of scale, diversity, reproducibility, and discriminative test oracles, SWE-rebench V2 unifies containerized task generation, reproducible test environments, and rich metadata annotation to support robust agent training and evaluation across diverse programming languages and repository types (Badertdinov et al., 27 Feb 2026, Vergopoulos et al., 10 Mar 2025).
1. Motivation and Benchmarking Challenges
Prevailing software engineering benchmarks, such as SWE-Bench, have disproportionately targeted high-resource languages (notably Python), contained a limited number of repositories, and typically relied on curated, sometimes insufficient, test oracles. As agent performance improved—with leaderboard resolve rates reaching saturation—there emerged an acute need for evaluation on more representative, real-world tasks and for test suites possessing higher semantic discriminative power (Yu et al., 28 Feb 2026). RL agent training, in particular, necessitates thousands of stable and diverse environments with reliable, reproducible reward signals. The heterogeneity of build tools, dependency managers, and test runners across languages further complicated dataset construction, rendering language-agnostic, scalable automation a key priority (Badertdinov et al., 27 Feb 2026).
2. Automated Pipeline Architecture
SWE-rebench V2 employs a fully automated pipeline structured as five primary stages under a unified “executable contract”:
- Preliminary Data Collection: Mining 29.5 million pull requests from the GitHub Archive, linking issues and PRs, and filtering for the presence of test cases and suitable licenses. Repo-level thresholds (e.g., ≥25 stars for high-resource languages) ensure quality and variety, yielding ~21,000 repositories and ~580,000 candidate PRs.
- Interactive Setup Synthesis: Leveraging a setup agent (mini-SWE-agent, Qwen3-480B) to synthesize repo-specific installation and test procedures. This stage iteratively inspects README, build files, and CI configurations to generate a self-contained
install_config.jsonfor each repository and programmatically produces per-language Docker images. - Execution-based Validation: A multi-layer Docker build applies PR patches in sequence, extracting fail-to-pass tests by running test suites before and after patch application. Only instances exhibiting at least one fail→pass test are retained.
- LLM Ensemble Filtering: Three LLM judges (gpt-oss-120B, GLM-4.7, DeepSeek V3.2) apply consensus or average scoring criteria to retain only well-specified, soundly formulated tasks. Calibration against human “SWE-bench Verified” annotations produces an ensemble precision of ~83%.
- Metadata Enrichment: Each instance is annotated for diagnostic flags (B1–B7, e.g., test-suite coupling, implicit naming, external dependency), PR category, estimated difficulty, and extracted test interfaces via further LLM calls.
The process outputs a rigorously validated and richly annotated task set, supporting both RL training and discriminative agent evaluation (Badertdinov et al., 27 Feb 2026).
3. Dataset Scale, Content, and Statistics
The resultant SWE-rebench V2 resource comprises:
- Curated, issue-based containerized set: 32,079 tasks from 3,617 repositories, spanning 20 programming languages.
- PR-derived expansion: 120,000+ additional tasks leveraging installation instructions and test recipes but omitting per-task containerization.
- Language distribution (top 5): Python (21.6%), Go (20.6%), JavaScript/TypeScript (~18%), Rust (~6%), Java (~5%).
- Task properties:
- Median diff hunk: 3 files / 34 lines; 90th percentile: 9 files / 181 lines.
- Diagnostic flags prevalent: TEST_SUITE_COUPLING ~18%, IMPLICIT_NAMING ~12%, EXTERNAL_DEPENDENCY ~9%, AMBIGUOUS_SPEC ~14%—enabling researchers to filter or target tasks according to robustness requirements.
Containerized tasks are distributed as pre-built, multi-layer Docker images, with all install/test configs and parsers included. The artifacts are made available through HuggingFace (“nebius/SWE-rebench-V2”, “nebius/SWE-rebench-V2-PRs”), and pipeline code is open-sourced for full reproducibility (Badertdinov et al., 27 Feb 2026).
| Subset | Task Count | Languages | Repositories |
|---|---|---|---|
| Curated Issue-based | 32,079 | 20 | 3,617 |
| PR-derived | 120,000+ | 20 | 3,617 |
4. Formal Metrics, Diagnostics, and Evaluation
SWE-rebench V2 formalizes rigorous instance-level metrics drawing from agent evaluation and test suite informativeness:
- For each codebase , with test suite , code patch , and execution environment , the behavior of a test is
A “fail→pass” test is one where .
- Patch complexity is characterized via
capturing the number of files and total lines added/removed.
A diagnostic study involving seven models (Claude Opus-4.5, GLM-4.7, MiniMax-M2.1, Gemini 3 Flash, DeepSeek V3.2, GPT-5.2, gpt-oss-120B) across five languages establishes baseline pass@k metrics and identifies principal agent failure modes (test-suite coupling, naming mismatches, hidden dependencies). Performance correlates negatively with patch complexity (e.g., Spearman’s ρ ≈ –0.40 for ΔLines; ), substantiating that larger and more complex patches are systematically harder for agents (Vergopoulos et al., 10 Mar 2025).
Typical agent pass rates:
| Language | Model | pass@1 |
|---|---|---|
| Python | Opus-4.5 | 36.1% |
| Python | gpt-oss-120B | 8.9% |
| Go | DeepSeek | 12.2% |
5. Limitations and Extensions: Test Suite Quality and Adversarial Strengthening
Despite its diversity and scale, SWE-rebench V2 inherits the inherent fragility of test-based oracles for semantic correctness. As shown in subsequent adversarial evaluation, one in five previously “solved” agent patches are in fact semantically incorrect, passing due only to weak or incomplete test coverage (Yu et al., 28 Feb 2026).
SWE-ABS, an adversarial test augmentation framework, addresses these deficiencies through a two-stage pipeline:
- Stage I: Coverage-driven augmentation—identifies patch-relevant code via program slicing (Tree-sitter, intraprocedural k-hop), generates LLM-driven targeted tests, generalizes over-specialized assertions, and iteratively maximizes coverage.
- Stage II: Mutation-driven adversarial testing—generates semantically diverse mutants using LLMs, filters by relevance/equivalence via LLM ensembles, and guides new test creation to distinguish gold from faulty patches.
Metrics such as coverage ratio , strengthen rate , adversarial mutation rate , and agent score drop quantify test suite improvement. On SWE-Bench Verified, SWE-ABS strengthens 50.2% of instances (vs. 2% for UTBoost), rejects ~19.8% of previously passing patches, and lowers the top system’s resolve rate from 78.80% to 62.20% (Yu et al., 28 Feb 2026).
Recommendations for SWE-rebench V2 include: enforcing minimum coverage thresholds (), minimum adversarial mutation yields (), parallelizing the pipeline for scale, containerization for reproducibility, and incremental test strengthening via integration with CI/CD workflows.
6. Comparative Analysis: Advances over Prior SWE Benchmarks
Earlier benchmarks such as SWE-Bench (Vergopoulos et al., 10 Mar 2025) were constrained by:
- Manual environment setup and test extraction, typically covering only 12 highly popular Python libraries.
- Distributional mismatch, as results on curated projects proved unrepresentative of real-world codebases.
- Lack of historical accuracy (patched dependency states) and updateability.
SWE-rebench V2 (also referenced as SWEE-Bench) systematically overcomes these issues:
- Scaling to thousands of repositories and tens of thousands of tasks with median star count reduced from 16,000 (SWE) to 365 (SWEE).
- Automated, historically reproducible environments for each instance via the setup agent.
- Task properties (issue text, patch complexity, test suite size) closely matched to real-world distributions.
- Substantial agent performance drop (10–40% relative) attributed to increased task difficulty and reduced curation bias—clarifying current limitations of code agents.
7. Usage, Integration, and Future Directions
SWE-rebench V2 is released as a set of containerized tasks optimized for RL training and agent evaluation, accompanied by 120,000+ PR-derived tasks for massive training without per-instance Dockerization. Best practices include filtering by diagnostic metadata to match agent capabilities, beginning RL with code-clarified tasks, and leveraging extracted interfaces for prompt or code scaffold enrichment.
Notable limitations include focus on single-container tasks (excluding multi-service architectures), lack of ablation for metadata filter effects, and initial underrepresentation of cross-component and non-functional requirements. Future extensions are outlined: increasing setup agent retry budgets, supporting complex task types, enriching reward signals with performance/memory metrics, and onboarding more long-tail languages with automated prompts (Badertdinov et al., 27 Feb 2026).
By integrating large-scale, language-agnostic collection, adversarial test strengthening, and reproducible execution, SWE-rebench V2 provides a robust, evolving substrate for next-generation software engineering agent research and evaluation (Yu et al., 28 Feb 2026, Badertdinov et al., 27 Feb 2026, Vergopoulos et al., 10 Mar 2025).