Impact of Bilingual Data on Typologically Distant and Low-Resource Languages
Determine the impact of including bilingual documents during pretraining on the multilingual performance and cross-lingual capabilities of large language models for typologically distant languages and for low-resource languages, in contrast to the major Latin-script languages studied here.
References
Second, our experiments focus on major languages within the Latin script family, leaving open questions about the impact of bilingual data on typologically distant or low-resource languages.
— The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
(2601.00364 - Shao et al., 1 Jan 2026) in Limitations (Section 7)