Multilingual Code Generation

Updated 4 January 2026

Multilingual code generation is the automatic synthesis of semantically valid code from instructions in various natural and programming languages.
It employs cross-lingual transfer, robust benchmarking, and translation pipelines to evaluate and optimize model performance.
Innovations such as LASER-based projection and IR alignment lead to higher pass@1 rates and reduced errors, especially in low-resource settings.

Multilingual code generation is the automatic synthesis of semantically and functionally correct software artifacts across both multiple programming languages (PLs) and natural languages (NLs). This area evaluates and trains LLMs tasked with mapping instructions—often expressed in diverse human languages—into valid code in a corresponding programming language, typically under constraints of functional correctness. In contrast to the early English-centric paradigm, the current focus is on maximally parallel, cross-lingual generalization, covering low-resource NLs and PLs, fine-grained structural scopes, multimodal input encodings, and robust benchmarking protocols. State-of-the-art research encompasses benchmark design, dataset construction, architectural advances, cross-lingual transfer, performance characterization, and error analysis.

1. Problem Definition, Scope, and Historical Evolution

Multilingual code generation formally requires mapping $p \in \mathcal{L}$ (where %%%%1%%%% ranges over NLs such as English, Chinese, Arabic, etc.) to $C \in \mathcal{C}$ (a code artifact in one of many PLs: Python, Java, C++, etc.), maximizing the probability $P(C|p)$ such that $C$ passes a defined set of unit tests or verification procedures (Peng et al., 2024, Moumoula et al., 24 Sep 2025). Early systems restricted $p$ to English and $C$ to Python, but recent advances demand:

Dozens to hundreds of NLs (e.g., mHumanEval: 204 NLs; HumanEval-XL: 23 NLs (Raihan et al., 2024, Peng et al., 2024)),
Dozens of PLs (e.g., McEval: 40 PLs (Chai et al., 2024)),
Cross-product combinatorics (e.g., HumanEval-XL: $80 \times 23 \times 12 = 22,080$ prompt-code pairs),
Fine-grained structural granularity (M2G-Eval: class, function, block, and line (Xu et al., 27 Dec 2025)),
Input modalities beyond text (WebMMU: multimodal sketches with multilingual embedded labels (Awal et al., 22 Aug 2025)).

Benchmarks have evolved from English-only, single-PL function synthesis (HumanEval) to multilingual, multi-PL, fully parallel, execution-based testbeds (Peng et al., 2024, Raihan et al., 2024, Chai et al., 2024, Cassano et al., 2022). This transition traces the field’s shift toward evaluating true cross-lingual semantic generalization.

2. Benchmarking and Datasets

2.1 Fully and Massively Multilingual Suites

HumanEval-XL establishes a parallel suite across 23 NLs × 12 PLs, with 22,080 independently validated prompts and an average of 8.33 test cases per problem (Peng et al., 2024). It ensures distributional balance and parallelism for cross-lingual assessment.
mHumanEval extends HumanEval to 204 NLs (using FLORES-200 coverage) and 25 PLs, with each NL-PL combination round-tripped through multiple translation and quality controls (BERTScore, CometKiwi, expert verification). It includes 33,456 prompts and 15 languages with expert human translations (Raihan et al., 2024).
McEval spans 40 PLs, including mainstream, scripting, and low-resource PLs, each problem hand-authored and annotated (Chai et al., 2024).
MultiPL-E and xCodeEval provide rule-based, test-driven translation from canonical Python benchmarks to up to 19 and 11 PLs, respectively, emphasizing parallelism in both prompts and test suites (Cassano et al., 2022, Khan et al., 2023).

Benchmark	NLs	PLs	# Prompts	Test Protocol
HumanEval-XL	23	12	22,080	pass@1; execution
mHumanEval	204	25	836,400	pass@1; execution
McEval	1	40	16,000+	pass@1; execution
MultiPL-E	1	19	2k–20k	pass@k; execution
xCodeEval	1	11	25M solutions	pass@k; execution

All modern benchmarks require exact functional correctness under unit tests (“pass@k”), with k=1 (greedy) most commonly reported.

2.2 Dataset Creation

Translation pipelines increasingly combine large-scale machine translation (MT) with roundtrip back-translation, BERTScore ∈ [0,1] similarity thresholds (typically ≥0.95) (Peng et al., 2024, Raihan et al., 2024), human verification, and heuristics to ensure semantic parity across NLs. Discrepancies (term ambiguity, numerics, function names) are manually resolved.
For code, rule-based translators target function signatures, type annotation normalization, data structure adaptation, and idiomatic comment conversion. All PLs maintain shared underlying test logic to guarantee functional equivalence (Cassano et al., 2022).
Recent benchmarks incorporate domain-general and domain-specific tasks, multi-granularity spans (classes, blocks, lines), multi-step reasoning (WebMMU), and novel difficulty stratification (Xu et al., 27 Dec 2025, Awal et al., 22 Aug 2025).

3. Model Architectures and Cross-Lingual Transfer Mechanisms

3.1 Data and Objective Scaling

Multilingual pretraining on up to 850B code tokens from 23+ PLs (CodeGeeX (Zheng et al., 2023)); joint BPE vocabularies; explicit language tags; no language-specific parameter partitioning.
Numeric scaling in model size is essential: performance in code generation and translation requires >10B parameters for balanced multi-language competence (Zheng et al., 2023).

3.2 Cross-Lingual Methods

Projection-based zero-shot transfer: Encode the NL prompt in a dense, language-agnostic space (e.g., LASER), learn a linear projector to map embeddings into the LLM’s code token space (trained on English only), and inject these into the transformer input layer (Li et al., 2024). This method halves logical error rates and doubles pass-all rates on non-English prompts without in-language training data.
Intermediate representation (IR) grounding: Align code from multiple PLs to a shared interlingua (LLVM IR or custom JSON signatures), providing a structural bridge for generalization and translation (Paul et al., 2024, Moumoula et al., 24 Sep 2025). Continued pretraining on (code, IR) pairs delivers +1–6 absolute points in pass@1 on low-resource PLs.
Multi-agent coordination (XL-CoGen): Orchestrate agents for IR synthesis, direct code generation, validation, code translation, and automated repair. Data-driven empirical bridging languages maximize transfer efficiency (Moumoula et al., 24 Sep 2025).
Edit-based co-evolution: For codebase co-maintenance, generate cross-lingual edit sequences (anchor-contextualized “replace/insert/delete” ops) rather than full methods, enabling high-fidelity translation of code changes (Zhang et al., 2023).

4. Empirical Performance, Bias Characterization, and Error Analysis

4.1 Performance Metrics and Trends

pass@1 (and higher k) remains the dominant metric; it is defined as

$\mathrm{pass}@k = 1 - \frac{\binom{N - c}{k}}{\binom{N}{k}}$

with $c$ successful generations in $N$ samples (Peng et al., 2024). Mean pass@1 for GPT-4 on HumanEval-XL is ~78% (Python), ~80% (JS); on mHumanEval across all NLs/PLs, SOTA models retain $>0.7$ in high-resource and drop to 0.6 (or below) in low-resource pairs (Peng et al., 2024, Raihan et al., 2024).

Multilingual models trained on equalized code corpora outperform monolinguals above 2.7B parameters and generalize out-of-domain (e.g., monolingual models that never saw PHP or Ruby achieve $\sim$ 4–18% pass@10 on those PLs) (Athiwaratkun et al., 2022).

4.2 Bias and Generalization Gaps

Multi-NL bias: Switching from English to Chinese in X-HumanEval-X yields an average Pass@1 drop from 37.3% to 31.8% (min 13%) across nine base LCMs (Wang et al., 2024). Closed-source models (GPT-4o, Claude3.5) retain $>0.7$ pass@1 even in extreme low-resource NLs, while code-fine-tuned (English-only) LLMs collapse on non-English (Raihan et al., 2024, Wang et al., 2024).
Multi-PL bias: Python is universally the “easiest” (e.g., GPT-4, pass@1 ~78% in HumanEval-XL), whereas C++, Scala, Go, and low-resource PLs exhibit gaps up to 20.9% compared to Python (Peng et al., 2024, Wang et al., 2024, Chai et al., 2024).
Performance correlates strongly with resource level (per Joshi taxonomy) and model size; per-language pass@1 drops 1–6pp (GPT-4) vs 7–15pp (smaller models) from Class 5→3 NLs (Peng et al., 2024).

4.3 Error Analysis

Typical error classes: logic errors dominate, followed by syntax/semantic errors (especially in strict PLs), and incomplete generations (Zheng et al., 2023, Peng et al., 2024).
Morphological, script, and idiomatic divergence in NLs induce sharp failures unless explicitly covered in pretraining or fine-tuning (Raihan et al., 2024, Wang et al., 2022).
Prompt translation at inference time mitigates, but does not eliminate, multi-NL bias. Multi-step (pivoted) translation further closes the gap (Wang et al., 2024).
Larger models benefit from few-shot prompting in unfamiliar languages, reducing syntax and compilation errors (Athiwaratkun et al., 2022).

5. Architectural Innovations and Cross-Lingual Transfer Solutions

LASER-based zero-shot projection enables strong performance in the absence of in-language code data: compared to original performance, error rates drop by 5–12pp and completion rates rise sharply with only an English-aligned projection layer (Li et al., 2024).
IR alignment (ICRoder) leverages a shared semantic substrate, delivering pass@1 improvements ranging from +0.41 to +5.58 in low-resource PLs and boosting prompt robustness (Paul et al., 2024).
Multi-agent pipelines (XL-CoGen) combine IR, code generation, empirical bridging, and one-shot repair, achieving +13 to +30pp pass@1 over single-stage or monolingual agent baselines, and exceeding direct fine-tuning on low-resource PLs by up to 13pp (Moumoula et al., 24 Sep 2025).
Edit-based code co-evolution (Codeditor) achieves 65–71% exact-match on cross-lingual codebase change translation, above all generation-based baselines (Zhang et al., 2023).

6. Fine-Grained and Multitask Evaluation

M2G-Eval establishes a four-level scope (class, function, block, line) and shows that current models’ similarity to reference implementations monotonically drops as granularity broadens (class-level ≈28%, line-level ≈60%) (Xu et al., 27 Dec 2025).
High Pearson cross-language correlation coefficients (ρ≈0.8–0.95) confirm that state-of-the-art models encode language-agnostic programming concepts rather than purely memorizing syntax (Xu et al., 27 Dec 2025).
Block/function-level performance improves with richer cross-file context, implying value of IR and retrieval-augmented architectures.

7. Practical Recommendations and Future Research Directions

Fine-tune on curated, diverse, parallel code–NL corpora spanning both high- and low-resource NLs and PLs to mitigate bias (Wang et al., 2024, Raihan et al., 2024).
Employ instruction tuning incorporating both multiple NLs and PLs (MEIC) to halve NL and PL generalization gaps (Wang et al., 2024).
For quick mitigation in deployment, apply multi-step professional translation of prompts; however, this is no substitute for explicit multilingual alignment in the training corpus (Wang et al., 2024).
Pursue deeper incorporation of human-verified translations, increase test-case coverage, and expand into interactive/multimodal tasks (e.g., UI design-to-code (Awal et al., 22 Aug 2025)).
Develop broader multilingual instruction–answer corpora, leverage synthetic code–NL pairs, and integrate dynamic and AST-based metrics for richer evaluation (Raihan et al., 2024, Chai et al., 2024).
Hierarchical, structure-aware planning methods and cross-lingual retrieval modules are likely to address persistent compositionality and context-handling bottlenecks (Xu et al., 27 Dec 2025).

In summary, multilingual code generation has transitioned from English-centric, translation-focused paradigms to a field characterized by broad linguistic diversity, fine-grained structural benchmarks, and a suite of data-driven, architectural, and functional innovations that collectively define the capabilities and limitations of modern code LLMs (Peng et al., 2024, Raihan et al., 2024, Wang et al., 2024, Chai et al., 2024).