SWE-bench Multilingual Benchmark

Updated 31 July 2025

SWE-bench Multilingual is a benchmark toolkit leveraging curated, real-world GitHub issue–pull request pairs to evaluate LLMs and agentic systems across diverse codebases.
It employs dynamic task harvesting, Dockerized environments, and semantic filtering to ensure contamination-free evaluations and robust performance measurement.
The framework supports reinforcement learning research, multi-agent orchestration, and advanced security analyses to drive innovation in multilingual software engineering.

SWE-bench Multilingual refers to a class of benchmark datasets, evaluation frameworks, and associated agentic workflows designed to measure the capacity of LLMs and code agents to resolve software engineering issues across multiple programming languages and diverse natural language contexts. This family of benchmarks has evolved to address the shortcomings of prior evaluations that were predominantly monolingual (often Python-centric), static, or prone to contamination. Modern SWE-bench Multilingual resources emphasize curated, high-quality, real-world issue–pull-request pairs, multilingual codebases, dynamic updating, robust execution-based validation, fine-grained structural and security analysis, and support for reinforcement learning at scale. The benchmarks are closely intertwined with research on agentic software engineering, unified code representation, and scalable evaluation infrastructures.

1. Motivation and Evolution of Multilingual SWE-bench

The original SWE-bench dataset was introduced as a collection of real-world GitHub issue–pull request pairs, primarily for Python repositories, to rigorously evaluate LLM and agentic systems on downstream software engineering tasks. However, critical limitations quickly emerged: solution leakage due to answers in issue comments, weak test cases inflating agent resolution rates, heavy linguistic and structural bias toward Python, and rapid saturation due to static, publicly released benchmarks (Aleithan et al., 9 Oct 2024, Badertdinov et al., 26 May 2025, Adamenko et al., 15 Jul 2025). These limitations motivated the creation of more robust, scalable, and multilingual variants.

The principal objectives of SWE-bench Multilingual and its derivatives are:

To measure cross-lingual generalization of LLMs on diverse engineering tasks;
To enable robust, reproducible, and contamination-free benchmarking by continuous task collection and validation;
To support the development and evaluation of RL-based agents and complex, multi-agent systems on non-Python codebases;
To serve as a reliable foundation for large-scale empirical studies in code security, agentic reasoning, and multimodal context retrieval in realistic software environments.

2. Benchmark Construction and Multilingual Scope

The Multi-SWE-bench benchmark (Zan et al., 3 Apr 2025) is the most explicit instantiation of a multilingual SWE-bench. The construction pipeline consists of five key phases:

Repository Selection: Curated selection of high-quality open-source repositories with threshold criteria (e.g., >500 stars, active maintenance, and reliable CI/CD integration).
Pull Request Identification: Extraction of merged pull requests strictly linked to issue reports, modifying or introducing test files, and achieving merge into the main branch.
Environment Reproducibility: Automated extraction of environment dependencies from CI/CD, documentation, and configuration, with construction of a Dockerized, executable environment pinned to the correct revision.
Semantic Filtering: Application of strict semantic transition tests—e.g., only selecting PRs whose test suite transitions from FAIL to PASS via test and fix patches and eliminating cases with anomalous transitions (e.g., ANY→PASS without explicit failure recovery).
Human Annotation: Dual, expert-annotator review of each instance covering validity criteria (such as Q2.1, Q3.1, Q4.1 in the source), yielding a set of 1,632 high-quality, multilingual issue-resolving tasks across seven programming languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++.

A supporting language entropy metric $H(L) = -\sum_{i=1}^n p_i \log(p_i)$ (where $p_i$ is the code fraction per language) quantifies linguistic diversity within a repository and serves as a correlational factor in agent evaluation.

3. Agent Evaluation Frameworks and Methodological Adaptations

Multi-SWE-bench is used to comprehensively assess a suite of agentic methods, each tailored for multilingual codebase navigation:

MagentLess (Agentless): Fixed workflow with multi-stage prompt engineering, adapted to non-Python languages by using full file contexts (via Tree-sitter), language-specific repository pruning, and elimination of Python-specific code selection logic.
MSWE-agent (SWE-agent): Multi-turn agent–computer interface, revised for multilingual environments by truncating oversize observations, adding build artifact ignores, and patching language-specific CLI issues.
MopenHands (OpenHands): Handles code modification with updated prompting, artifact ignoring, and semantic diff interpretation to ensure proper handling of tokenization and indentation across languages.

Empirical evaluations reveal substantially diminished resolved rates for non-Python languages—agent performance is particularly sensitive to the precision of fault localization, with mislocalization being a principal failure mode. Statistical analyses (e.g., resolved rates, fault localization accuracy, token consumption) are designed to be cross-method, cross-language, and methodologically reproducible.

4. Dynamic Benchmarks and Contamination Mitigation

Static benchmarks (e.g., original SWE-bench, SWE-bench-Verified) suffer from rapid contamination and overfitting as LLM training cycles absorb public datasets (Aleithan et al., 9 Oct 2024, Badertdinov et al., 26 May 2025, Adamenko et al., 15 Jul 2025). Dynamic and contamination-free pipeline innovations are introduced in benchmarks such as SWE-MERA (Adamenko et al., 15 Jul 2025) and SWE-rebench (Badertdinov et al., 26 May 2025):

Continuous Task Harvesting: Automated, continuously running mining procedures extract new GitHub issue–PR pairs, with timestamped release and validation, ensuring each new subset is uncontaminated and relevant to the latest LLMs.
Automated Environment and Quality Validation: LLM-driven recipes parse and validate build/test environments, while secondary LLMs produce structured scores for issue/test correctness, completeness, and complexity, filtering low-quality or underspecified tasks.
Multi-level Metrics: pass@1, pass@6, localization accuracy, token overflows, and granular agent trajectory tracking are reported with binomial confidence intervals, supporting robust year-on-year model comparison and dynamic leaderboard maintenance.
Language-Generalizable Infrastructure: Pipelines are constructed to be language-agnostic, with only dependency and test execution modules requiring adaptation per language ecosystem (e.g., npm, Maven, CMake).

5. Multimodal Repository Representation and Real-World Deployment

Prometheus (Chen et al., 26 Jul 2025) introduces a paradigm shift in multilingual repository representation. Rather than treating codebases as flat file structures, Prometheus constructs a unified knowledge graph, with heterogeneous file, AST, and text nodes and five directed edge types (HAS_FILE, HAS_AST, PARENT_OF, HAS_TEXT, NEXT_CHUNK), persisted in a Neo4j backend. This enables:

Language-Integrated Graph Construction: Parsing with Tree-sitter overcomes language barriers by supporting a wide range of syntaxes in a single graph model.
Agent Orchestration: Dedicated agents handle issue classification, bug reproduction (including Dockerized environment reconstruction), context retrieval via graph queries, patch generation, and patch validation.
Universal Context Retrieval: Cross-language and multimodal capabilities allow Prometheus to retrieve and reason about code, documentation, and comments across Java, JavaScript, Rust, C/C++, Go, PHP, and Ruby repositories.
Benchmark Results: On SWE-bench Multilingual (300 tasks), Prometheus achieves a 13.7% resolution rate with an average API cost of $0.38 per issue, resolving 10 unique tasks not addressed by prior state-of-the-art agents.

6. Reinforcement Learning, Open-Source Communities, and Scalability

The Multi-SWE-RL initiative (Zan et al., 3 Apr 2025) is established as an open-source ecosystem supporting RL research by releasing large, containerized, and reproducible training datasets for agentic learning. Its contributions include:

RL-ready Environments: 4,723 Dockerized issue-resolving tasks across seven languages, validated for robust and scalable RL agent deployment.
Documentation and Contribution Platforms: Full production pipelines, annotation guidelines, and public boards enable transparent dataset governance and expansion.
Empirical Scaling Laws: Multiple works (e.g., Skywork-SWE (Zeng et al., 24 Jun 2025), SWE-Dev (Wang et al., 9 Jun 2025)) demonstrate log-linear scaling of agent resolve rate with data volume, providing quantitative motivation for continually growing RL corpora.

7. Security, Evaluation, and Future Directions

Recent studies using SWE-bench Multilingual have established agent-driven code generation as a significant source of new vulnerabilities (Sajadi et al., 30 Jun 2025). LLM patches introduce security risks at nearly 9× the rate of human developers, especially with loosely specified issues or broad context modifications. Proactive risk assessment is advocated by combining code-level metrics (file count, LOC, complexity) and issue-level completeness scores to identify high-risk agent trajectories.

Future directions for SWE-bench Multilingual include:

Extension to additional languages and programming paradigms (e.g., multimodal/multilingual code–text pairs, domain-specific languages).
Dynamic and continually updated benchmarks (e.g., SWE-MERA) for more accurate and current model appraisal.
Advanced unified knowledge representations and agent orchestration frameworks, leveraging persistent graph backends and language-agnostic pipeline elements.
Integration of human feedback, ELO rating leaderboards, and fine-grained, execution-based, multi-faceted performance metrics in both training and evaluation phases.

In conclusion, SWE-bench Multilingual represents a critical infrastructure for evaluating, training, and understanding agentic LLMs for software engineering across diverse programming and natural language environments. It bridges the limitations of prior monolingual, static, and contaminated benchmarks by advancing scalable, reproducible, and multilingual research in agentic reasoning, LLM evaluation, and RL-based software task automation.