Multilingual Large Language Models (MLLMs)
Last updated: June 13, 2025
Certainly. Below is a meticulously edited and well-sourced article on Multilingual LLMs ° (MLLMs), based strictly on the evidence, evaluation, and documentation in "On the Multilingual Capabilities of Very Large-Scale English LLMs" (Armengol-Estapé et al., 2021 ° ). All statements are factually grounded and synthesized for critical clarity.
Introduction
Recent advances in LLMs °, such as GPT-3 °, have demonstrated extraordinary few-shot and zero-shot capabilities across a range of natural language processing tasks °. However, most studies and benchmarks have focused on English, and little systematic evidence exists regarding the multilingual capabilities of such models—especially when their training corpora ° are overwhelmingly English-dominant. This article provides a technical and empirical analysis of the multilingual proficiency of GPT-3, focusing on Catalan, a language almost absent from the model’s pretraining data °. The findings illuminate fundamental properties of multilingual transfer, scaling laws, and practical limitations relevant to practitioners deploying LLMs in low-resource linguistic environments.
Multilingual Pretraining: Data Composition
GPT-3 was trained on 197 billion tokens, of which approximately 93% are English text (about 181B tokens). Catalan, by contrast, comprises only a tiny fraction—around 35 million words, less than 0.018% of the total. This extreme imbalance sets a challenging scenario for evaluating zero-shot multilinguality °.
For context, other multilingual models have substantially more Catalan data:
Model | Total Words (M) | Catalan Words (M) |
---|---|---|
mBERT ° | Unclear† | ≈200 |
XLM-R ° | 295,008 | 1,752 |
GPT-3 | 196,755 | 35 |
†mBERT’s Catalan size is estimated.
Despite the limited exposure, GPT-3 exhibits surprising abilities in Catalan, serving as an upper bound for expectations in other underrepresented languages °.
Experimental Tasks and Methodology
Two core tasks were used to probe GPT-3’s latent multilingual skills:
1. Extractive Question Answering (QA)
- Setup: Zero-shot condition with all prompts, context passages, and questions entirely in Catalan.
- Model Variants: Ada (350M parameters), Babbage (1.3B), Curie (6.7B), and Davinci (175B).
- Dataset: Catalan XQuAD (translated from the original, aligning with SQuAD evaluation protocols), featuring 240 paragraphs and 1,060 question-answer pairs.
- Metrics: SQuAD-style F1 and Exact Match ° (EM), using standard AllenAI implementations.
- Baselines: Fine-tuned mBERT and XLM-R (Catalan), the state-of-the-art for supervised multilingual QA.
2. Natural Language Generation (NLG)
- Setup: GPT-3 (Davinci) is prompted with 20 Catalan news headlines to synthesize 60 sample sentences; 60 native Catalan sentences from the same news corpus serve as controls.
- Evaluation: 9 native Catalan speakers rate sentences for fluency and correctness (1–5 scale), with 3 raters per sentence. Copying or memorization is explicitly ruled out by manual inspection.
Results
Extractive QA Performance
Model | F1 | EM |
---|---|---|
GPT-3: Ada | 5.26 | 0.38 |
GPT-3: Babbage | 10.08 | 1.13 |
GPT-3: Curie | 16.66 | 5.00 |
GPT-3: Davinci | 38.43 | 17.74 |
XLM-RoBERTa ° | 67.10 | 46.42 |
mBERT | 67.15 | 46.51 |
- Interpretation: GPT-3 Davinci, despite being zero-shot and trained with minimal Catalan data, achieves over half the F1 score of fully supervised, state-of-the-art multilingual encoders °. The performance drop-off for smaller GPT-3 variants is steep, highlighting the decisive impact of massive model scale.
- Scaling Law: F1 and EM increase non-linearly with model size, following previously observed scaling laws—even when the language is nearly absent from the dataset.
Natural Language Generation
Source | Average Score | Std. Dev. | % Above Human Avg. |
---|---|---|---|
Human | 4.49 | 0.57 | 53.33 |
GPT-3 | 3.83 | 1.05 | 33.33 |
- Statistics: The average rating for GPT-3’s sentences lagged native ones by only 0.66 points (out of 5), a statistically significant difference (p = 0.00026), but a surprisingly small gap given the language’s scarcity in pretraining.
- Distribution: The bulk of GPT-3’s outputs are rated 4–5 (good to excellent), with about a third scoring above the mean for human-written samples. However, 13% fall below a “3”, usually due to code-switching ° (Catalan-English mixing) or generating gibberish.
- Qualitative Patterns: GPT-3 demonstrates the ability to mimic dialectal variation (e.g., Valencian Catalan). Lower-rated outputs show evidence of code-switching, ungrammaticality, or semantically odd phrasing—common issues in extremely low-resource and zero-shot contexts.
Technical Insights
- Generative Strength: GPT-3 (Davinci) can synthesize highly fluent and occasionally dialect-specific text even in a nearly unseen language, reflecting pattern extraction from a tiny sample of data.
- Understanding Limitation: While generative outputs impress, performance in tasks demanding deeper comprehension (like QA) is strictly limited compared to supervised or systematically multilingual models.
- Tokenization Inefficiency: As the base tokenizer is English-centric, more tokens are needed for Catalan inputs, increasing both processing cost and (potentially) error rate for low-resource languages.
Model Size and Scaling Laws
- Critical Role of Scale: Usable, meaningful capability in Catalan emerges only in the largest (175B parameter) Davinci model. Scaling directly translates to gains in both generation and understanding, with diminishing returns ° for smaller deployments.
- Few-shot Potential: The authors hypothesize that providing in-context examples (few-shot learning) would further enhance non-English performance—though the benchmarks here remain purely zero-shot.
Implications for Other Languages
- Transfer and Relatedness: While Catalan is a Romance language ° with specific proximity to English (and thus may have benefitted from shared roots), the results imply that any language with at least minimal corpus presence can see non-trivial zero-shot benefits, at least for generative tasks. More distant languages—with less cognate overlap—may see weaker results.
- Tokenization and Efficiency: For non-Indo-European or other underrepresented languages, tokenizer optimization ° and targeted vocabulary strategies may be needed to realize similar gains.
Practical Recommendations for Deployment
- Only Large Models Deliver Multilingual Utility: Multilingual generation and understanding in nearly-absent languages require deploying the largest available model, with corresponding computational cost.
- Task-Specific Suitability: For comprehension-heavy applications, such as QA, supervised fine-tuned multilingual encoders remain superior. GPT-3-style models are best leveraged for generative tasks, creative writing, or low-resource prototyping—provided post-generation filtering is used to screen outputs for code-switching or errors.
- Fine-tuning and In-Context Learning: Extending these findings, practical deployments could benefit from light supervision or prompt tuning ° in the target language, if any data or annotation is available.
Conclusion
GPT-3, despite being trained on overwhelmingly English text, demonstrates robust zero-shot generative ability and baseline language understanding in Catalan—a language virtually absent from its pretraining data. The critical factors are (1) sheer model scale and (2) the presence—however minimal—of the target language in the training corpus °. However, serious limitations remain for NLU ° (extractive QA °) and for language with greater typological distance from English. For practitioners and researchers, these results underscore the importance of extreme scale for cross-lingual generalization, the need for tokenizer and data optimization ° for low-resource languages, and the ongoing utility of supervised multilingual encoders for language understanding tasks.
References
- Brown, T.B., et al. (2020). LLMs are Few-Shot Learners °.
- Artetxe, M., et al. (2019). Cross-lingual Transfer ° Evaluation of Multilingual Encoders.
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers ° for Language Understanding.
- Conneau, A., et al. (2019). Unsupervised Cross-lingual Representation Learning ° at Scale.
- Kaplan, J., et al. (2020). Scaling Laws for Neural LLMs.
[For reproduction, evaluation scripts, and further details—including full rating tables, prompts, and code—see the original supplementary documentation associated with the paper.]