When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages (2311.09205v1)

Published 15 Nov 2023 in cs.CL

Abstract: Multilingual LLMs are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on LLMing performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual LLMs for over 250 languages, including multiple language families that are under-studied in NLP. We assess how LLMing performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource LLMing performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance.

PDF Abstract

Multilingual LLMing Performance: Evaluating the Impact of Multilingual Data

This paper explores the nuanced effects of adding multilingual data to LLMs, particularly focusing on 252 languages ranging from high- to low-resource contexts. The authors investigate how multilingual pre-training influences the performance of LLMs across various linguistic datasets.

The research evaluates over 10,000 monolingual and multilingual LLMs with parameters up to 45M. Four key variables were manipulated: the size of the monolingual dataset, the size of the added multilingual dataset, linguistic similarity, and model size. Models are pre-trained using varying amounts of monolingual tokens in the target language, and a defined amount of multilingual data.

Key Findings

Performance Improvement for Low-Resource Languages: Low-resource languages showed performance improvements with the inclusion of moderate amounts of multilingual data. Particularly, improvements were analogous to a 33% increase in dataset size when the added languages were linguistically similar. Larger models benefitted more from added multilingual data, suggesting that model capacity influences the extent of improvement.
Syntactic Similarity as a Key Driver: The improvement in performance due to multilingual data is primarily driven by syntactic similarity between target and added languages, with geographic and lexical similarities playing a less significant role. This suggests that abstract linguistic attributes may be more beneficial for transfer learning in multilingual scenarios.
Negative Impact on High-Resource Languages: High-resource languages experienced a performance decline when additional multilingual data was incorporated. These degradations were compounded by model size constraints, indicating that the "curse of multilinguality" is significantly influenced by capacity limitations of the models.
Model Size and Capacity: Larger models demonstrated reduced performance degradation for high-resource languages and increased benefits for low-resource languages. This indicates that model architecture and size are crucial in addressing the challenges and advantages introduced by multilingual pre-training.

Practical Implications and Future Directions

This research provides essential insights into the considerations necessary for the effective deployment of multilingual LLMs. Practically, multilingual models should be constructed with a focus on linguistic proximity, especially when targeting performance enhancements in low-resource languages. This targeted approach may help optimize language-specific tasks while avoiding the inefficiencies of large-scale multilingual models.

The findings suggest future exploratory paths in the development of LLMs that efficiently incorporate multilingual data without adverse effects on performance. Further research could investigate larger models and how adaptive model architectures might mitigate capacity issues associated with the extensive use of multilingual data.

In conclusion, this paper underscores the delicate balance required in multilingual LLMing, where model capacity and linguistic similarity must be harmonized to achieve optimal performance across diverse languages. This work provides a foundation for future advancements in this domain, emphasizing practical strategies to enhance LLM efficacy across varied linguistic resources.