Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models
Abstract: Despite their popularity in non-English NLP, multilingual LLMs often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert LLMs (X-ELM), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.
- Analyzing the mono- and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explain the cross-lingual capabilities of English pretrained models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205.
- Parsing with multilingual bert, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334.
- XLM-E: Cross-lingual language model pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
- XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
- Tim Dettmers and Luke Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. CoRR, abs/1907.04840.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4555–4567.
- Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2943–2952. PMLR.
- Fahim Faisal and Antonios Anastasopoulos. 2022. Phylogeny-inspired adaptation of multilingual models to new languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 434–452.
- Larger-scale transformers for multilingual masked language modeling. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021).
- DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
- Scaling expert language models with unsupervised domain discovery.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
- Exploring the benefits of training expert language models over instruction tuning. In International Conference on Machine Learning.
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Branch-train-merge: Embarrassingly parallel training of expert language models.
- XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 8–14.
- Hesham Mostafa and Xin Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4646–4655. PMLR.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
- Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1221–1236.
- Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495.
- Mad-x: An adapter-based framework for multi-task cross-lingual transfer.
- Machel Reid and Mikel Artetxe. 2022. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- On negative interference in multilingual models: Findings and a meta-learning treatment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450.
- Overcoming catastrophic forgetting in massively multilingual continual learning.
- Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Paws-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692.
- Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about making computer LLMs work better across many different languages, especially for languages with less data. It introduces a new way to train “expert” models that each specialize in certain languages, then combine them so the overall system understands many languages well. The goal is to break the “curse of multilinguality,” which is when one big model trained on lots of languages ends up performing worse on individual languages because they compete for space inside the model.
Key Questions
The paper asks a few simple questions:
- Can a team of smaller, specialized LLMs beat one big model trained on all languages at once?
- What’s the best way to group languages so each expert model learns more effectively?
- Can we add new languages later without making the model forget the old ones?
- Do improvements in language modeling also help on real tasks like reasoning or story completion?
How They Did It (Methods)
Think of this like building a team of specialists instead of one jack-of-all-trades:
- Branch-Train-Merge (BTM): Imagine starting with a shared base model (like a common starting recipe), then “branching” into several expert chefs. Each expert is trained independently on their own cuisine (language group). Later, at “merge” time, you pick the best expert to use or mix their outputs (like a taste test) to handle a new sentence.
- Two ways to group languages (clustering):
- TF-IDF clustering: This automatically groups documents by their word patterns. TF-IDF is basically a way to find which words are important in each text, then cluster similar texts together.
- Linguistic typology clustering: This groups languages by how similar they are linguistically (like family trees—Romance languages together, etc.). Here, each expert learns a small set of closely related languages.
- Inference (how they use the experts):
- Top-1 expert: Pick the single best expert for the language and use it.
- Ensemble: Combine several experts’ predictions, weighted by how similar the input text is to what each expert was trained on. This can be more accurate but also more computationally expensive.
- Hierarchical Multi-Round (HMR) training:
- First, train an expert on a cluster of related languages.
- Later, “branch” from that expert to make even more specialized experts for sub-groups or new languages.
- This is like teaching a general “Romance languages” expert, then creating a “Spanish-Italian” expert from it, and so on.
- Big advantage: You can add new languages without changing (and risking forgetting in) the existing experts.
- Measuring performance:
- They use perplexity, a score that tells how “surprised” the model is when predicting the next word. Lower perplexity means better predictions.
- They also test real tasks: XNLI (logic/entailment), XStoryCloze (story ending), and PAWS-X (paraphrase detection).
Main Findings
Here are the most important results:
- Expert models beat one big “dense” multilingual model when trained with the same compute budget. The best setup used 8 experts, each trained on small groups of related languages, and consistently achieved lower perplexity across all tested languages.
- Specializing by language families (typology) worked better than automatic TF-IDF grouping in most cases.
- These gains were balanced: both high-resource (lots of data) and low-resource languages improved, with low-resource languages often benefiting the most.
- Adding new languages is easier and safer:
- Using HMR training, they added four new languages (Azerbaijani, Hebrew, Polish, Swedish) by branching from related existing experts.
- This approach outperformed standard “keep training the big model” methods and avoided catastrophic forgetting (losing knowledge of earlier languages).
- Improvements transfer to real tasks:
- The expert models also did better on cross-lingual tasks like natural language inference, story completion, and paraphrase detection, in both zero-shot (no examples) and few-shot (a few examples) settings.
Why It Matters
This research shows a practical way to make multilingual LLMs:
- Fairer: It helps lower-resourced languages instead of favoring only the big ones.
- More flexible: You can add new languages later without breaking what works.
- More accessible: Training experts independently needs less synchronized hardware, which can reduce the barriers to building strong multilingual systems.
- More effective: The gains in perplexity (better language modeling) also lead to real-world task improvements.
In simple terms, instead of one model trying to speak every language equally well (and failing for many), this approach builds a team of language specialists who work together. That teamwork leads to better understanding, easier updates, and stronger performance for everyone.
Glossary
- autoregressive LM objective: A training objective where a LLM predicts the next token given previous tokens. "train for a fixed number of steps with an autoregressive LM objective."
- balanced k-means clustering: A variant of k-means that enforces roughly equal-sized clusters. "we then perform balanced k-means clustering on these representations to obtain approximately balanced subsets of the data"
- Branch-Train-Merge (BTM): A training paradigm where multiple experts are branched from a seed model, trained independently on data subsets, and merged for inference. "Branch-Train-Merge (BTM; \citealt{li2022branchtrainmerge}) alleviates this cost by dividing the total compute among smaller expert LLMs"
- c-BTM: A cluster-based generalization of BTM that scales experts with data/compute and uses clustering to define domains. "c-BTM \cite{gururangan2023scaling} generalizes the above approach with cluster-based representations of domains."
- catastrophic forgetting: Degradation on previously learned tasks/languages when a model is further trained on new data. "adapting x-elm to new languages is more efficient than continued training of a dense LM and does not risk catastrophic forgetting of previously seen languages"
- compute budget: The fixed amount of computational resources (e.g., tokens, steps, FLOPs) allocated for training. "when given the same compute budget, x-elm outperforms jointly trained multilingual models"
- Cross-lingual Expert LLMs (x-elm): An ensemble of independently trained language experts specialized to subsets of a multilingual corpus. "We propose Cross-lingual Expert LLMs (x-elm), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus."
- curse of multilinguality: Performance degradation in multilingual models due to competition among languages for limited capacity. "this phenomenon (termed the curse of multilinguality) can significantly harm low-resource languages"
- DEMix layers: Mixture-of-expert-style layers that route sequences to per-layer domain experts. "and on DEMix layers \citep{demix}, which routes sequences to per-layer feed-forward experts based on metadata."
- dense model: A single model where all parameters are updated jointly and used for every input. "Training a set of x-elms is more computationally efficient than a comparable dense model"
- ensemble routing method: A procedure to weight and combine expert outputs during inference based on input similarity. "adapting the c-BTM ensemble routing method."
- Euclidean distance: A standard distance metric in vector space used to compare TF-IDF embeddings to expert centroids. "calculating the Euclidean distance from "
- expert LLMs: Smaller models specialized to particular domains or languages, trained independently and later combined. "smaller expert LLMs that are trained independently on different domains"
- FLOP budgets: Training cost measured in floating-point operations, used to compare efficiency across methods. "a set of small expert models performs similarly to equivalently sized dense models at vastly reduced FLOP budgets."
- Hierarchical Multi-Round (HMR) training: Iterative expert training along a language hierarchy, branching from parent experts to sub-clusters. "Hierarchical Multi-Round training (HMR), an algorithm for efficiently training new experts specialized to unseen languages"
- in-context learning (ICL): Evaluating or prompting models to perform tasks using examples given in the prompt without gradient updates. "We test the performance of our x-elms on three tasks through an in-context learning (ICL) framework"
- k-means clustering: An algorithm that partitions data into k clusters by minimizing within-cluster variance. "we then perform balanced k-means clustering on these representations"
- language-adaptive pretraining (LAPT): Continuing pretraining to adapt a model to a new target language. "language-adaptive pretraining \cite[LAPT,] []{chau2020parsing}"
- lang2vec: A resource that represents languages via typological features and provides similarity metrics. "We build this hierarchy using the language similarity metrics in lang2vec \cite{littell2017uriel}"
- linguistic typology: The classification of languages based on structural features, used to group languages for expert training. "grouping languages by linguistic typology"
- Linguistic Typology Clustering: Clustering documents by language identity and typological similarity rather than surface features. "Linguistic Typology Clustering"
- mC4: A large multilingual corpus derived from CommonCrawl used for pretraining. "We train our x-elms on mC4, an open-source, multilingual pretraining corpus derived from CommonCrawl"
- Mixture-of-Experts (MoE): Architectures that route inputs to a subset of expert components to increase capacity efficiently. "Other MoE models have recently been applied to multilingual settings."
- PAWS-X: A multilingual paraphrase identification benchmark. "PAWS-X \cite{yang2019paws} is a binary classification task that requires the model to determine whether a pair of sentences are paraphrases."
- perplexity: A standard language-modeling metric measuring how well a model predicts a sample. "We separately calculate the perplexity on the mC4 validation sets of each pretraining language."
- seed LM: The initial pretrained model used to initialize experts before specialization. "We initialize (branch) experts from a seed LM"
- sparsely activated LLMs: Models where only a subset of parameters (experts) are active for a given input, improving efficiency. "Sparsely activated LLMs \citep{pmlr-v119-evci20a,pmlr-v97-mostafa19a,dettmers-sparse-from-scratch} route inputs through a subset of the total model parameters."
- temperature parameter: A scalar that controls the sharpness of a softmax distribution over expert weights. "T is a temperature parameter over the ensemble weight distribution."
- TF-IDF: Term Frequency–Inverse Document Frequency; a weighting scheme for representing documents used for clustering and routing. "either through automatic TF-IDF clustering of documents"
- Top-1 Expert: An inference strategy that selects a single best expert per example or language. "Top-1 Expert"
- World Atlas of Language Structures (WALS): A database of structural features across languages used for typological comparisons. "World Atlas of Language Structures"
- XGLM: A family of multilingual autoregressive LLMs used as seed models in this work. "we initialize our x-elms with an existing multilingual pretrained model, XGLM \cite{lin2022fewshot}"
- X-MOD: A modular multilingual architecture with language-specific components. "which proposes a new modular model architecture, X-MOD, that contains language-specific modules."
- XNLI: A multilingual natural language inference benchmark. "XNLI \cite{conneau2018xnli} is a multilingual natural language inference benchmark"
- XStoryCloze: A cross-lingual story completion benchmark extending StoryCloze to multiple languages. "XStoryCloze \cite{lin2022fewshot} is a manually translated benchmark extending StoryCloze"
- x-BTM: An extension of BTM to multilingual settings that trains experts on different language subsets and merges them for inference. "These x-elms are trained with x-BTM, a new extension of the Branch-Train-Merge paradigm"
Collections
Sign up for free to add this paper to one or more collections.