Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Targeted Multilingual Adaptation for Low-resource Language Families (2405.12413v1)

Published 20 May 2024 in cs.CL

Abstract: The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. C. M. Downey (6 papers)
  2. Terra Blevins (20 papers)
  3. Dhwani Serai (1 paper)
  4. Dwija Parikh (1 paper)
  5. Shane Steinert-Threlkeld (20 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com