Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models (2401.10440v2)

Published 19 Jan 2024 in cs.CL

Abstract: Despite their popularity in non-English NLP, multilingual LLMs often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert LLMs (X-ELM), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Analyzing the mono- and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  2. Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explain the cross-lingual capabilities of English pretrained models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  3. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. arXiv preprint arXiv:2311.09205.
  4. Parsing with multilingual bert, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334.
  5. XLM-E: Cross-lingual language model pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  6. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  8. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
  9. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  10. Tim Dettmers and Luke Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. CoRR, abs/1907.04840.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  12. Abteen Ebrahimi and Katharina Kann. 2021. How to adapt your pretrained multilingual model to 1600 languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4555–4567.
  13. Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2943–2952. PMLR.
  14. Fahim Faisal and Antonios Anastasopoulos. 2022. Phylogeny-inspired adaptation of multilingual models to new languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 434–452.
  15. Larger-scale transformers for multilingual masked language modeling. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021).
  16. DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
  17. Scaling expert language models with unsupervised domain discovery.
  18. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  19. Exploring the benefits of training expert language models over instruction tuning. In International Conference on Machine Learning.
  20. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  21. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Branch-train-merge: Embarrassingly parallel training of expert language models.
  23. XLM-V: Overcoming the vocabulary bottleneck in multilingual masked language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  24. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 8–14.
  26. Hesham Mostafa and Xin Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4646–4655. PMLR.
  27. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  28. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1221–1236.
  29. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495.
  30. Mad-x: An adapter-based framework for multi-task cross-lingual transfer.
  31. Machel Reid and Mikel Artetxe. 2022. PARADISE: Exploiting parallel data for multilingual sequence-to-sequence pretraining. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  32. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  33. On negative interference in multilingual models: Findings and a meta-learning treatment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450.
  34. Overcoming catastrophic forgetting in massively multilingual continual learning.
  35. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
  36. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  37. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692.
  38. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
Citations (19)

Summary

  • The paper introduces x-elm, which mitigates multilingual model capacity issues by training expert models on segmented language data.
  • It employs the x-BTM method to divide and merge training processes, enhancing specialization and performance for low-resource languages.
  • Comparative experiments show that x-elm outperforms standard multilingual models with balanced perplexity gains and improved adaptability to new languages.

Introduction to Cross-lingual Expert LLMs (x-elm)

Current multilingual LLMs, which handle multiple languages within a single framework, suffer from a notable limitation in their performance, primarily due to competition for limited model capacity. This challenge, often referred to as the "curse of multilinguality," is particularly detrimental to low-resource languages. To address this issue, the paper introduces Cross-lingual Expert LLMs (x-elm), which presents an innovative strategy to enhance LLM performance.

The Curse of Multilinguality and x-elm Solution

The "curse of multilinguality" is a phenomenon where multilingual models, which cover a wide array of languages, exhibit decreased performance due to the limited capacity of the models. These models do not perform as well as their monolingual counterparts because the parameters of the model are stretched thin in trying to accommodate numerous languages. x-elm offers a solution by training individual expert LLMs on different subsets of a multilingual corpus. This approach allows each model to specialize in specific languages while also functioning collectively as an effective multilingual ensemble.

Breaking Down x-elm and Its Training Process

The x-elm framework relies on a method known as x-BTM, which is an extension of the Branch-Train-Merge paradigm. This method strategically divides a multilingual corpus based on linguistic typology or TF-IDF clustering, then trains these 'expert' LLMs independently, and ultimately merges them for performance inference. Unlike traditional dense models, x-elm experts are specialized, enabling efficient adaptation to new languages without forgetting previously learned languages. Moreover, this strategy can be especially beneficial in computational efficiency by using asynchronous training, which fits into lower hardware requirement settings, thereby democratizing the development of multilingual models.

Comparative Results and Future Implications

Experiments with x-elm have shown that it outperforms jointly trained multilingual models across a variety of languages and computing budgets. In particular, the perplexity gains for the languages are balanced across differing resource levels, and the adaptation of models to newer languages outperforms standard methods. These benefits extend to downstream tasks, demonstrating the robust cross-lingual capabilities of x-elms.

This paper lays a foundation for future research in sparse modeling tailored for multilinguality and could lead to improvements in clustering methods and expert allocation techniques. With further validation in diverse settings and scales, x-elm has the potential to not only mitigate the challenges of multilinguality but also to level the playing field for low-resourced languages in NLP applications.