Multilingual Large Language Models

Updated 1 December 2025

Multilingual Large Language Models (mLLMs) are transformer architectures pretrained on heterogeneous multilingual corpora to enable robust cross-lingual transfer across various NLP tasks.
Their designs include encoder-only, decoder-only, and encoder–decoder variants optimized with objectives like MLM, CLM, and TLM to suit classification, generation, and translation tasks.
Challenges such as data imbalance, capacity dilution, and inherent biases call for innovative evaluation benchmarks and debiasing techniques to ensure equitable language coverage and ethical outcomes.

Multilingual LLMs (mLLMs) are parameter-sharing neural transformers pretrained on heterogeneous textual corpora spanning scores to hundreds of human languages. They are engineered to achieve cross-lingual transfer, enabling competence in generative, discriminative, and reasoning tasks across diverse linguistic typologies, often under highly imbalanced data regimes. mLLMs have become foundational in natural language processing for applications including universal translation, cross-lingual QA, and global content moderation.

1. Core Architectures and Training Objectives

The state-of-the-art mLLMs are all based on the Transformer architecture but diverge into three canonical variants:

Encoder-only (e.g., mBERT, XLM-R): Masked Language Modeling (MLM) objective, strong on classification and sequence labelling tasks.
Decoder-only (e.g., BLOOM, LLaMA, GPT-3): Causal Language Modeling (CLM) with autoregressive next-token prediction, optimized for generation and large-scale prompting workflows.
Encoder–decoder (e.g., mT5, mBART): Denoising or translation objectives for text-to-text tasks, coupling bidirectional context modeling and autoregressive decoding.

Principal training losses:

$\mathcal{L}_{\text{MLM}} = -\sum_{i\in M}\log P(x_i | x_{\backslash M})$
$\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^T \log P(x_t | x_{<t})$
For cross-lingual modeling, auxiliary objectives include Translation Language Modeling (TLM) and contrastive InfoNCE-style alignment losses (Xu et al., 1 Apr 2024).

Mixture-of-Experts (MoE) architectures and adaptive capacities are prominent for scaling parameter footprint without proportional increase in compute, providing language-specialized routing in multilingual settings (Zhu et al., 17 Nov 2024).

2. Construction and Imbalance of Training Corpora

Training corpora for mLLMs are constructed from a mixture of web crawls (CCNet, Common Crawl, mC4), Wikipedia dumps, digitized books, code repositories, and parallel corpora (OPUS, WikiMatrix, MultiUN). Even large models suffer severe class imbalance:

Model (Corpus)	#Langs	English	Top-5	Low-res share
mBERT	104	20–30%	Top-5>50%	<10%
BLOOM (350GB)	100+	30%	Chinese 16%, French 13%	21%

Imbalance is quantified by the relative imbalance index:

$I = 1 - \frac{H(N)}{\log L}$

where $H(N)$ is the entropy of the language count distribution. High $I$ is detrimental to coverage of low-resource languages (Xu et al., 1 Apr 2024).

Pretraining data size per language is the principal driver of downstream performance for seen languages; median F1 on SIB-200 increases monotonically with language’s proportional representation in pretraining data (Nezhad et al., 29 Apr 2024). For unseen languages, script type and language family are decisive due to their mediation of cross-lingual transfer (Nezhad et al., 29 Apr 2024).

3. Multilingual Representation and Cross-Lingual Alignment

Cross-lingual alignment in mLLMs is achieved by creating universal latent spaces:

Static mapping: Orthogonal transformation of monolingual embedding spaces.
Contrastive alignment: InfoNCE losses on parallel data at sentence or document level:

$\mathcal{L}_{\text{InfoNCE}} = -\sum_{i=1}^N \log \frac{\exp(q_i \cdot k_i / \tau)}{\sum_{j=1}^N \exp(q_i \cdot k_j / \tau)}$

TLM objective: Mask and jointly predict tokens across bilingual input sequences.

Universal representations are evaluated by cross-lingual retrieval (Acc@1, Acc@5), bilingual dictionary induction (BLI), and average bilingual dictionary scores (Xu et al., 1 Apr 2024). Internal analyses show that with training and scale, multilingual LLMs learn a “Lingua Franca,” a shared semantic subspace in which same-meaning sentences from different languages have high embedding similarity, as quantified by Semantic Alignment Development Score (SADS), which rises with both training and scale (Zeng et al., 15 Oct 2024). The density of key linguistic neurons concentrates in first and last layers, with training pushing the model toward more language-agnostic processing signatures and compressing key regions for each language (Zeng et al., 15 Oct 2024).

4. Bias, Fairness, and Representational Disparities

Bias in mLLMs manifests in several forms:

Language bias: Imbalanced data and English dominance yield sharp disparities—low-resource languages show higher perplexity, lower accuracy, and impaired tokenization (over-segmentation) (Xu et al., 1 Apr 2024).
Demographic bias and stereotype leakage: Stereotypical group–trait associations learned in high-resource (source) languages leak into low-resource (target) languages, as verified by cross-lingual regression coefficients $\alpha_{src\to tgt}$ in stereotype association tasks. For example, GPT-3.5 shows 7 significant leakage flows, with Hindi the most susceptible, and positive as well as negative stereotype transfer across scripts and families (Cao et al., 2023).
Evaluation bias: Metrics such as BERTScore, BLEURT may favor languages with richer pretraining.

Bias measurement utilizes disparity scores in group outputs, Stereotype Association Scores (SAS), and variance of model behavior across languages and demographics (Xu et al., 1 Apr 2024). Debiasing approaches include adversarial training (removing sensitive attribute encoders), counterfactual data augmentation, subspace projection, and self-debiasing prompts. However, entrenched data and architectural biases are not eliminated by current techniques.

5. Multilingual Optimization Strategies and the Curse of Multilinguality

mLLMs are inherently subject to capacity dilution as the number of supported languages increases:

The curse of multilinguality: Given model capacity $C$ and $L$ languages, per-language capacity decays as $C/L$ . Empirically, accuracy initially rises with more languages (positive transfer) then falls (capacity dilution), especially for low-resource targets (Gurgurov et al., 15 Jun 2024).
Mitigation:
- Modular approaches: Per-language adapters (e.g., X-MOD) sidestep destructive interference, restoring near-monolingual performance for new languages with small extra parameter sets.
- Data balancing: Temperature-scaled sampling or Unimax/proportional budget allocation increases exposure for low-resource languages during batch formation (Liu et al., 23 Oct 2024).
- Curriculum learning: Dynamic scheduling gradually increases the low-resource language share as convergence progresses (Liu et al., 23 Oct 2024).
- Mixture-of-Experts: Token-level routing to language-aware experts allows aggregate model capacity to scale without memory or compute explosion (Zhu et al., 17 Nov 2024).

Specialized approaches such as weight pruning exploit the latent “translation features” that mediate alignment, pruning weights unrelated to high-magnitude translation directions and yielding consistent gains on non-English inference tasks without retraining (Kim et al., 25 Sep 2024).

6. Evaluation Frameworks, Benchmarks, and Practical Deployment

mLLMs are evaluated by a multi-dimensional benchmark suite:

Task Type	Benchmarks
Cross-lingual classification	XNLI, XTREME, MLQA, TyDiQA
Machine translation	FLORES-101/200, WMT
Reading comprehension	Belebele, XQuAD, AfriQA
Societal/Reasoning	HellaSwag, XStoryCloze, ARC
Summarization	XLSum, WikiLingua
Safety/Fairness	XSAFETY, MultiJail, RTP-LX

Accuracy, BLEU, COMET, MLC (Multilingual Consistency), fairness gaps, and propensity for stereotype leakage are core metrics (Han et al., 24 Jun 2025, Cao et al., 2023).

Practical pipeline includes robust document filtering (language ID, deduplication, PII removal), balanced tokenizer design (BPE, unigram LM, byte-based for rare-emoji/token robustness), in-batch sampling, and inclusive downstream evaluation (Liu et al., 23 Oct 2024). Real-world deployments span customer service bots (–20% handle time, +12pt CSAT), conversational search, and multilingual speech recognition (Liu et al., 23 Oct 2024, Denisov et al., 16 Apr 2024).

7. Open Challenges and Research Directions

Several challenges remain in achieving equitable and reliable mLLMs:

Low-resource coverage: Despite advances, 88.38% of languages are still low-resource, with over a billion speakers affected by inadequate coverage (Liu et al., 23 Oct 2024).
Effective and inclusive evaluation: Development of aligned cross-lingual benchmarks (e.g., MuBench’s 61 language/3.92M fully aligned items) and metrics such as MLC is enabling finer analysis of both accuracy and representational consistency (Han et al., 24 Jun 2025).
Cultural and ethical alignment: mLLMs risk propagating Western-centric views and stereotypes. Evaluation of cultural appropriateness and norm adherence, as well as RLHF with in-culture human-in-the-loop, is required for trust and societal impact (Xu et al., 1 Apr 2024, Zhu et al., 17 Nov 2024).
Efficiency and sustainability: Resource-efficient adaptation routes, such as parameter-efficient fine-tuning, curriculum adaptation, and weight merging, are vital for democratizing access to under-served linguistic communities (Huang et al., 21 Dec 2024).
Multimodality and cross-modal transfer: Integration of speech, vision, and structured data, as in models like BLOOMZMMS, augments text representations and enables zero-shot generalization across modalities and tasks (Denisov et al., 16 Apr 2024).
Security and robust defense: Multilingual collaborative defense frameworks, with soft prompt optimization, are necessary to counter vulnerabilities such as cross-lingual jailbreaking and safety misalignment in rare or unseen languages (Li et al., 17 May 2025).

Future progress in mLLMs depends on dynamic, script-/family-aware subword vocabularies, parameter-efficient cross-lingual modules, inclusive datasets, and rigorous, multilayered evaluation frameworks that jointly address linguistic, cultural, and safety requirements (Xu et al., 1 Apr 2024, Zhu et al., 17 Nov 2024, Liu et al., 23 Oct 2024).