Multi-SWE-bench: Multilingual Benchmark
- The paper introduces Multi-SWE-bench, a multilingual benchmark suite assessing LLM performance on real-world software issue resolution across diverse programming languages.
- It employs rigorous annotation protocols, reproducible testing, and a unified evaluation framework to ensure task clarity and accurate performance metrics.
- Evaluation reveals significant performance drops in non-Python languages, highlighting challenges and motivating RL-driven strategies for autonomous debugging.
Multi-SWE-bench is a multilingual, repository-level benchmark suite for evaluating LLMs and agentic frameworks on real-world software issue resolution tasks. It addresses the critical deficiency of prior benchmarks—most notably SWE-bench—which focused almost exclusively on Python and thus could not measure cross-language generalization, real-world robustness, or multilingual competence in software engineering agents. Multi-SWE-bench coordinates its task collection, annotation protocols, infrastructure, and evaluation criteria to provide a reliable and reproducible standard for measuring LLM-driven bug fixing, feature addition, and code optimization across Java, TypeScript, JavaScript, Go, Rust, C, and C++. Its open-source production pipeline and rapidly growing RL training dataset ecosystem (Multi-SWE-RL) lay the groundwork for scalable, cross-paradigm autonomous software engineering research (Zan et al., 3 Apr 2025).
1. Scope, Motivation, and Design Rationale
The inception of Multi-SWE-bench is motivated by the limitations observed in prior benchmarks such as SWE-bench (Zan et al., 3 Apr 2025), which relies solely on Python repositories. While SWE-bench demonstrated the viability of evaluating LLMs on repository-level GitHub issue resolution—a process that simulates practical bug fixing and feature implementation—its language restriction precludes meaningful generalization to real-world software practice, which is inherently multi-lingual and multi-paradigmatic.
Multi-SWE-bench expands this paradigm to seven major programming languages (Java, TypeScript, JavaScript, Go, Rust, C, C++), each selected for ecosystem prevalence, practical impact, and technical diversity. This design enables comprehensive evaluation across language boundaries, divergent build systems, runtime constraints, and type systems (dynamic/static, low-/high-level). Importantly, the benchmark adopts strict annotation and filtering protocols. From an initial pool of 2,456 PR-based candidates, 1,632 high-quality instances were retained after dual annotator verification and expert QA, ensuring high signal in issue clarity, test coverage, and absence of severe repository-level ambiguities (Zan et al., 3 Apr 2025).
2. Task Collection, Annotation, and Dataset Composition
Task construction in Multi-SWE-bench follows a five-phase protocol:
- Repository selection: Only repositories with ≥500 GitHub stars, active maintenance, and continuous integration are considered to ensure industrial relevance and runnable builds.
- PR crawling: All merged PRs linked to issues and modifying test files are candidate tasks. Metadata such as base commits, diff patches, and issue text are extracted.
- Environment determination: Automated Dockerfile generation is performed by parsing repository CI workflows and documentation to ensure assemblable, platform-stable test harnesses for each language and project.
- Test-case filtering: Each PR undergoes a three-state test transition analysis (base, test.patch, fix.patch) to detect true fail-to-pass transitions (demonstrable bug fix or feature addition) and to exclude non-reproducible or flaky cases.
- Manual verification: 68 trained annotators apply a standardized questionnaire on issue description clarity, test coverage adequacy, and repository health, with cross-language QA over 1,632 instances.
Instance-level statistics reveal substantial task variety. Language distribution is: Java 128, TypeScript 224, JavaScript 356, Go 428, Rust 239, C 128, C++ 129. Each entry includes an issue description, ground-truth fix and test diffs, Docker build context, and logs. Patch and test suite sizes are reported per repository, with increased length and complexity for “hard” instances (Zan et al., 3 Apr 2025).
3. Evaluation Methodologies and Metrics
Multi-SWE-bench supports three canonical evaluation frameworks, all ported to the multilingual setting:
- MagentLess (“Agentless”): A deterministic workflow comprising fault localization, followed by code repair without iterative patch selection.
- MSWE-agent: A multi-turn, agent-based pipeline enabling iterative exploration, error analysis, patch creation, and controlled regression testing.
- MopenHands: An open-ended agent interface integrated with multi-language prompts and “git diff” support, allowing free-form codebase manipulation.
All three methodologies support a uniform metric regime:
- Resolved Rate (RR): Percentage of tasks in which all previously failing tests pass post-patch, with no new regressions.
- Exact Match (EM): Fraction of tasks for which the generated patch is bitwise identical to the ground-truth fix.
- Success Location (SL): File-level localization accuracy, i.e., whether the model touches any ground-truth file.
- Code Execution Success Rate (CESR): Patch-execution reliability over multiple samples.
Resolution rates are substantially lower for new languages than for Python even when using state-of-the-art LLMs. For instance, using GPT-4o and MagentLess, RR is 36.2% for Python but only 2.2% for TypeScript and 1.4% for JavaScript (Zan et al., 3 Apr 2025).
4. Comparative Results and Empirical Findings
Performance on Multi-SWE-bench reveals several consistent trends and failure modes:
- Language-dependency: All tested models and frameworks exhibit a sharp performance drop when transitioning from Python to other languages, with average RRs ≤15% in non-Python cases. Java is the highest among new languages, but C, C++, TypeScript, and JavaScript remain challenging (most <7%).
- Framework-specific strengths: MagentLess tends to yield higher file-level localization rates, while agent-based methods such as MopenHands excel in edit steps, yielding better overall RR across most languages.
- Complexity sensitivity: Success probability decreases with increased instance complexity, measured by patch size or number of files touched. For “hard” instances (estimated ≥1 h to solve), RR approaches zero for all new languages except Python (Zan et al., 3 Apr 2025).
- Issue type dependency: Bug fix tasks are resolved more frequently than new features or feature optimizations. This trend is observed across all languages and agent types (see Table 11 in (Zan et al., 3 Apr 2025)).
Resource consumption analysis shows deterministic workflows are more token- and cost-efficient, but agentic methods provide better debugging coverage at higher cost.
5. Expansion, Open-Source Community, and Multi-SWE-RL
To drive rapid data growth and foster broader agent development, the Multi-SWE-bench creators have released the full benchmark pipeline and an RL-targeted extension called Multi-SWE-RL. The latter is a collection of 4,723 containerized, reinforcement-signal–rich instances, supporting research in RL-based autonomous bug fixing and patch generation (Zan et al., 3 Apr 2025). The data schema encapsulates detailed signals such as test pass/fail, compile status, and code coverage deltas.
Community contributions are encouraged through open-source CI/CD pipelines, HuggingFace datasets, and transparent documentation for adding new tasks. Contributor incentives include quarterly paper releases and authorship opportunities.
Related efforts, such as SWE-bench Multimodal (visual/JS-centric tasks) (Yang et al., 2024), SWE-bench-java (Zan et al., 2024), the automated dataset generation of SWE-Bench++ (Wang et al., 19 Dec 2025), and SPICE (automated clarity/test/evaluation labeling) (Bhatia et al., 12 Jul 2025), enrich the multilingual and modal breadth of evaluation and training resources in this ecosystem.
6. Limitations, Benchmark Mutation, and Future Directions
Despite its breadth, Multi-SWE-bench reflects several inherited limitations:
- Overestimation on formal prompts: Mutation analysis of the TypeScript subset, using IDE telemetry to mutate GitHub issue descriptions into realistic, terse user-style queries, demonstrates that formal benchmarks systematically overestimate agent capabilities—often by >50% for leading models (Garg et al., 10 Oct 2025).
- Modality and coverage constraints: Multi-SWE-bench is primarily text/code-based. Multi-modal reasoning and vision–code integration are treated separately (e.g., SWE-bench Multimodal (Yang et al., 2024)).
- Oracle dependency: All evaluation depends on repository-supplied test suites, which may provide incomplete functional specification.
- Manual labor and scalability: Expert annotation is expensive; attempts like SPICE (Bhatia et al., 12 Jul 2025) and programmatic pipelines in SWE-Bench++ (Wang et al., 19 Dec 2025) aim to automate labeling and reduce curation cost while preserving label fidelity.
- Language and task extension: Increasing coverage of underrepresented languages (e.g., PHP, Ruby, Kotlin), expanding to infrastructure and UI tasks, and integration of security and style checks remain key open directions.
- Reinforcement learning infrastructure: The Multi-SWE-RL community and pipeline are positioned to drive research toward fully autonomous RL-driven debugging and software agent learning.
A plausible implication is that sustained, collaborative expansion—combining open-source data production, annotation automation, mutation frameworks, and RL-infrastructure—will continue to advance the empirical measurement and practical capabilities of multilingual software engineering agents.