Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models (2401.10440v2)

Published 19 Jan 2024 in cs.CL

Abstract: Despite their popularity in non-English NLP, multilingual LLMs often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert LLMs (X-ELM), which mitigate this competition by independently training LLMs on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

References (38)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces x-elm, which mitigates multilingual model capacity issues by training expert models on segmented language data.
It employs the x-BTM method to divide and merge training processes, enhancing specialization and performance for low-resource languages.
Comparative experiments show that x-elm outperforms standard multilingual models with balanced perplexity gains and improved adaptability to new languages.

Introduction to Cross-lingual Expert LLMs (x-elm)

Current multilingual LLMs, which handle multiple languages within a single framework, suffer from a notable limitation in their performance, primarily due to competition for limited model capacity. This challenge, often referred to as the "curse of multilinguality," is particularly detrimental to low-resource languages. To address this issue, the paper introduces Cross-lingual Expert LLMs (x-elm), which presents an innovative strategy to enhance LLM performance.

The Curse of Multilinguality and x-elm Solution

The "curse of multilinguality" is a phenomenon where multilingual models, which cover a wide array of languages, exhibit decreased performance due to the limited capacity of the models. These models do not perform as well as their monolingual counterparts because the parameters of the model are stretched thin in trying to accommodate numerous languages. x-elm offers a solution by training individual expert LLMs on different subsets of a multilingual corpus. This approach allows each model to specialize in specific languages while also functioning collectively as an effective multilingual ensemble.

Breaking Down x-elm and Its Training Process

The x-elm framework relies on a method known as x-BTM, which is an extension of the Branch-Train-Merge paradigm. This method strategically divides a multilingual corpus based on linguistic typology or TF-IDF clustering, then trains these 'expert' LLMs independently, and ultimately merges them for performance inference. Unlike traditional dense models, x-elm experts are specialized, enabling efficient adaptation to new languages without forgetting previously learned languages. Moreover, this strategy can be especially beneficial in computational efficiency by using asynchronous training, which fits into lower hardware requirement settings, thereby democratizing the development of multilingual models.

Comparative Results and Future Implications

Experiments with x-elm have shown that it outperforms jointly trained multilingual models across a variety of languages and computing budgets. In particular, the perplexity gains for the languages are balanced across differing resource levels, and the adaptation of models to newer languages outperforms standard methods. These benefits extend to downstream tasks, demonstrating the robust cross-lingual capabilities of x-elms.

This paper lays a foundation for future research in sparse modeling tailored for multilinguality and could lead to improvements in clustering methods and expert allocation techniques. With further validation in diverse settings and scales, x-elm has the potential to not only mitigate the challenges of multilinguality but also to level the playing field for low-resourced languages in NLP applications.

PDF Markdown

Tweets

https://twitter.com/TerraBlvns/status/1749851469757731130

https://twitter.com/fly51fly/status/1749414032518635631

https://twitter.com/TerraBlvns/status/1749851480788762969

https://twitter.com/ufal_cuni/status/1843676756517237080