Papers
Topics
Authors
Recent
Search
2000 character limit reached

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Published 19 Jan 2024 in cs.CL | (2401.10440v2)

Abstract: Despite their popularity in non-English NLP, multilingual LLMs often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert LLMs (X-ELM), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Analyzing the mono- and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  2. Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explain the cross-lingual capabilities of English pretrained models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  3. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205.
  4. Parsing with multilingual bert, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334.
  5. XLM-E: Cross-lingual language model pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  6. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  8. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
  9. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  10. Tim Dettmers and Luke Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. CoRR, abs/1907.04840.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  12. Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4555–4567.
  13. Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2943–2952. PMLR.
  14. Fahim Faisal and Antonios Anastasopoulos. 2022. Phylogeny-inspired adaptation of multilingual models to new languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 434–452.
  15. Larger-scale transformers for multilingual masked language modeling. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021).
  16. DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
  17. Scaling expert language models with unsupervised domain discovery.
  18. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  19. Exploring the benefits of training expert language models over instruction tuning. In International Conference on Machine Learning.
  20. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  21. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Branch-train-merge: Embarrassingly parallel training of expert language models.
  23. XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  24. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 8–14.
  26. Hesham Mostafa and Xin Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4646–4655. PMLR.
  27. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  28. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1221–1236.
  29. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495.
  30. Mad-x: An adapter-based framework for multi-task cross-lingual transfer.
  31. Machel Reid and Mikel Artetxe. 2022. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  32. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  33. On negative interference in multilingual models: Findings and a meta-learning treatment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450.
  34. Overcoming catastrophic forgetting in massively multilingual continual learning.
  35. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
  36. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  37. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692.
  38. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
Citations (19)

Summary

  • The paper introduces x-elm, which mitigates multilingual model capacity issues by training expert models on segmented language data.
  • It employs the x-BTM method to divide and merge training processes, enhancing specialization and performance for low-resource languages.
  • Comparative experiments show that x-elm outperforms standard multilingual models with balanced perplexity gains and improved adaptability to new languages.

Introduction to Cross-lingual Expert LLMs (x-elm)

Current multilingual LLMs, which handle multiple languages within a single framework, suffer from a notable limitation in their performance, primarily due to competition for limited model capacity. This challenge, often referred to as the "curse of multilinguality," is particularly detrimental to low-resource languages. To address this issue, the paper introduces Cross-lingual Expert LLMs (x-elm), which presents an innovative strategy to enhance LLM performance.

The Curse of Multilinguality and x-elm Solution

The "curse of multilinguality" is a phenomenon where multilingual models, which cover a wide array of languages, exhibit decreased performance due to the limited capacity of the models. These models do not perform as well as their monolingual counterparts because the parameters of the model are stretched thin in trying to accommodate numerous languages. x-elm offers a solution by training individual expert LLMs on different subsets of a multilingual corpus. This approach allows each model to specialize in specific languages while also functioning collectively as an effective multilingual ensemble.

Breaking Down x-elm and Its Training Process

The x-elm framework relies on a method known as x-BTM, which is an extension of the Branch-Train-Merge paradigm. This method strategically divides a multilingual corpus based on linguistic typology or TF-IDF clustering, then trains these 'expert' LLMs independently, and ultimately merges them for performance inference. Unlike traditional dense models, x-elm experts are specialized, enabling efficient adaptation to new languages without forgetting previously learned languages. Moreover, this strategy can be especially beneficial in computational efficiency by using asynchronous training, which fits into lower hardware requirement settings, thereby democratizing the development of multilingual models.

Comparative Results and Future Implications

Experiments with x-elm have shown that it outperforms jointly trained multilingual models across a variety of languages and computing budgets. In particular, the perplexity gains for the languages are balanced across differing resource levels, and the adaptation of models to newer languages outperforms standard methods. These benefits extend to downstream tasks, demonstrating the robust cross-lingual capabilities of x-elms.

This paper lays a foundation for future research in sparse modeling tailored for multilinguality and could lead to improvements in clustering methods and expert allocation techniques. With further validation in diverse settings and scales, x-elm has the potential to not only mitigate the challenges of multilinguality but also to level the playing field for low-resourced languages in NLP applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview

This paper is about making computer LLMs work better across many different languages, especially for languages with less data. It introduces a new way to train “expert” models that each specialize in certain languages, then combine them so the overall system understands many languages well. The goal is to break the “curse of multilinguality,” which is when one big model trained on lots of languages ends up performing worse on individual languages because they compete for space inside the model.

Key Questions

The paper asks a few simple questions:

  • Can a team of smaller, specialized LLMs beat one big model trained on all languages at once?
  • What’s the best way to group languages so each expert model learns more effectively?
  • Can we add new languages later without making the model forget the old ones?
  • Do improvements in language modeling also help on real tasks like reasoning or story completion?

How They Did It (Methods)

Think of this like building a team of specialists instead of one jack-of-all-trades:

  • Branch-Train-Merge (BTM): Imagine starting with a shared base model (like a common starting recipe), then “branching” into several expert chefs. Each expert is trained independently on their own cuisine (language group). Later, at “merge” time, you pick the best expert to use or mix their outputs (like a taste test) to handle a new sentence.
  • Two ways to group languages (clustering):
    • TF-IDF clustering: This automatically groups documents by their word patterns. TF-IDF is basically a way to find which words are important in each text, then cluster similar texts together.
    • Linguistic typology clustering: This groups languages by how similar they are linguistically (like family trees—Romance languages together, etc.). Here, each expert learns a small set of closely related languages.
  • Inference (how they use the experts):
    • Top-1 expert: Pick the single best expert for the language and use it.
    • Ensemble: Combine several experts’ predictions, weighted by how similar the input text is to what each expert was trained on. This can be more accurate but also more computationally expensive.
  • Hierarchical Multi-Round (HMR) training:
    • First, train an expert on a cluster of related languages.
    • Later, “branch” from that expert to make even more specialized experts for sub-groups or new languages.
    • This is like teaching a general “Romance languages” expert, then creating a “Spanish-Italian” expert from it, and so on.
    • Big advantage: You can add new languages without changing (and risking forgetting in) the existing experts.
  • Measuring performance:
    • They use perplexity, a score that tells how “surprised” the model is when predicting the next word. Lower perplexity means better predictions.
    • They also test real tasks: XNLI (logic/entailment), XStoryCloze (story ending), and PAWS-X (paraphrase detection).

Main Findings

Here are the most important results:

  • Expert models beat one big “dense” multilingual model when trained with the same compute budget. The best setup used 8 experts, each trained on small groups of related languages, and consistently achieved lower perplexity across all tested languages.
  • Specializing by language families (typology) worked better than automatic TF-IDF grouping in most cases.
  • These gains were balanced: both high-resource (lots of data) and low-resource languages improved, with low-resource languages often benefiting the most.
  • Adding new languages is easier and safer:
    • Using HMR training, they added four new languages (Azerbaijani, Hebrew, Polish, Swedish) by branching from related existing experts.
    • This approach outperformed standard “keep training the big model” methods and avoided catastrophic forgetting (losing knowledge of earlier languages).
  • Improvements transfer to real tasks:
    • The expert models also did better on cross-lingual tasks like natural language inference, story completion, and paraphrase detection, in both zero-shot (no examples) and few-shot (a few examples) settings.

Why It Matters

This research shows a practical way to make multilingual LLMs:

  • Fairer: It helps lower-resourced languages instead of favoring only the big ones.
  • More flexible: You can add new languages later without breaking what works.
  • More accessible: Training experts independently needs less synchronized hardware, which can reduce the barriers to building strong multilingual systems.
  • More effective: The gains in perplexity (better language modeling) also lead to real-world task improvements.

In simple terms, instead of one model trying to speak every language equally well (and failing for many), this approach builds a team of language specialists who work together. That teamwork leads to better understanding, easier updates, and stronger performance for everyone.

Glossary

  • autoregressive LM objective: A training objective where a LLM predicts the next token given previous tokens. "train for a fixed number of steps with an autoregressive LM objective."
  • balanced k-means clustering: A variant of k-means that enforces roughly equal-sized clusters. "we then perform balanced k-means clustering on these representations to obtain approximately balanced subsets of the data"
  • Branch-Train-Merge (BTM): A training paradigm where multiple experts are branched from a seed model, trained independently on data subsets, and merged for inference. "Branch-Train-Merge (BTM; \citealt{li2022branchtrainmerge}) alleviates this cost by dividing the total compute among smaller expert LLMs"
  • c-BTM: A cluster-based generalization of BTM that scales experts with data/compute and uses clustering to define domains. "c-BTM \cite{gururangan2023scaling} generalizes the above approach with cluster-based representations of domains."
  • catastrophic forgetting: Degradation on previously learned tasks/languages when a model is further trained on new data. "adapting x-elm to new languages is more efficient than continued training of a dense LM and does not risk catastrophic forgetting of previously seen languages"
  • compute budget: The fixed amount of computational resources (e.g., tokens, steps, FLOPs) allocated for training. "when given the same compute budget, x-elm outperforms jointly trained multilingual models"
  • Cross-lingual Expert LLMs (x-elm): An ensemble of independently trained language experts specialized to subsets of a multilingual corpus. "We propose Cross-lingual Expert LLMs (x-elm), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus."
  • curse of multilinguality: Performance degradation in multilingual models due to competition among languages for limited capacity. "this phenomenon (termed the curse of multilinguality) can significantly harm low-resource languages"
  • DEMix layers: Mixture-of-expert-style layers that route sequences to per-layer domain experts. "and on DEMix layers \citep{demix}, which routes sequences to per-layer feed-forward experts based on metadata."
  • dense model: A single model where all parameters are updated jointly and used for every input. "Training a set of x-elms is more computationally efficient than a comparable dense model"
  • ensemble routing method: A procedure to weight and combine expert outputs during inference based on input similarity. "adapting the c-BTM ensemble routing method."
  • Euclidean distance: A standard distance metric in vector space used to compare TF-IDF embeddings to expert centroids. "calculating the Euclidean distance from cec_e"
  • expert LLMs: Smaller models specialized to particular domains or languages, trained independently and later combined. "smaller expert LLMs that are trained independently on different domains"
  • FLOP budgets: Training cost measured in floating-point operations, used to compare efficiency across methods. "a set of small expert models performs similarly to equivalently sized dense models at vastly reduced FLOP budgets."
  • Hierarchical Multi-Round (HMR) training: Iterative expert training along a language hierarchy, branching from parent experts to sub-clusters. "Hierarchical Multi-Round training (HMR), an algorithm for efficiently training new experts specialized to unseen languages"
  • in-context learning (ICL): Evaluating or prompting models to perform tasks using examples given in the prompt without gradient updates. "We test the performance of our x-elms on three tasks through an in-context learning (ICL) framework"
  • k-means clustering: An algorithm that partitions data into k clusters by minimizing within-cluster variance. "we then perform balanced k-means clustering on these representations"
  • language-adaptive pretraining (LAPT): Continuing pretraining to adapt a model to a new target language. "language-adaptive pretraining \cite[LAPT,] []{chau2020parsing}"
  • lang2vec: A resource that represents languages via typological features and provides similarity metrics. "We build this hierarchy using the language similarity metrics in lang2vec \cite{littell2017uriel}"
  • linguistic typology: The classification of languages based on structural features, used to group languages for expert training. "grouping languages by linguistic typology"
  • Linguistic Typology Clustering: Clustering documents by language identity and typological similarity rather than surface features. "Linguistic Typology Clustering"
  • mC4: A large multilingual corpus derived from CommonCrawl used for pretraining. "We train our x-elms on mC4, an open-source, multilingual pretraining corpus derived from CommonCrawl"
  • Mixture-of-Experts (MoE): Architectures that route inputs to a subset of expert components to increase capacity efficiently. "Other MoE models have recently been applied to multilingual settings."
  • PAWS-X: A multilingual paraphrase identification benchmark. "PAWS-X \cite{yang2019paws} is a binary classification task that requires the model to determine whether a pair of sentences are paraphrases."
  • perplexity: A standard language-modeling metric measuring how well a model predicts a sample. "We separately calculate the perplexity on the mC4 validation sets of each pretraining language."
  • seed LM: The initial pretrained model used to initialize experts before specialization. "We initialize (branch) kk experts from a seed LM"
  • sparsely activated LLMs: Models where only a subset of parameters (experts) are active for a given input, improving efficiency. "Sparsely activated LLMs \citep{pmlr-v119-evci20a,pmlr-v97-mostafa19a,dettmers-sparse-from-scratch} route inputs through a subset of the total model parameters."
  • temperature parameter: A scalar that controls the sharpness of a softmax distribution over expert weights. "T is a temperature parameter over the ensemble weight distribution."
  • TF-IDF: Term Frequency–Inverse Document Frequency; a weighting scheme for representing documents used for clustering and routing. "either through automatic TF-IDF clustering of documents"
  • Top-1 Expert: An inference strategy that selects a single best expert per example or language. "Top-1 Expert"
  • World Atlas of Language Structures (WALS): A database of structural features across languages used for typological comparisons. "World Atlas of Language Structures"
  • XGLM: A family of multilingual autoregressive LLMs used as seed models in this work. "we initialize our x-elms with an existing multilingual pretrained model, XGLM \cite{lin2022fewshot}"
  • X-MOD: A modular multilingual architecture with language-specific components. "which proposes a new modular model architecture, X-MOD, that contains language-specific modules."
  • XNLI: A multilingual natural language inference benchmark. "XNLI \cite{conneau2018xnli} is a multilingual natural language inference benchmark"
  • XStoryCloze: A cross-lingual story completion benchmark extending StoryCloze to multiple languages. "XStoryCloze \cite{lin2022fewshot} is a manually translated benchmark extending StoryCloze"
  • x-BTM: An extension of BTM to multilingual settings that trains experts on different language subsets and merges them for inference. "These x-elms are trained with x-BTM, a new extension of the Branch-Train-Merge paradigm"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 205 likes about this paper.