Impact of Bilingual Data on Typologically Distant and Low-Resource Languages

Determine the impact of including bilingual documents during pretraining on the multilingual performance and cross-lingual capabilities of large language models for typologically distant languages and for low-resource languages, in contrast to the major Latin-script languages studied here.

Background

The paper investigates the role of mixed-language (bilingual) documents in pretraining multilingual LLMs by constructing a monolingual-only corpus (MonoWeb) and comparing it against an unfiltered web corpus (FineWeb). Through controlled pretraining and granular ablations that reintroduce either parallel or code-switching bilingual data, the authors show that machine translation critically depends on parallel data, while cross-lingual QA and reasoning are relatively robust without bilingual documents.

However, the experiments primarily cover major languages within the Latin script family (English paired with French, German, and Spanish) and models at 1.35B parameters. The authors explicitly state that the generalization of their findings to typologically distant or low-resource languages remains an open question, highlighting the need to assess whether the same data-type sensitivities and alignment phenomena hold beyond well-resourced Latin-script pairs.

References

Second, our experiments focus on major languages within the Latin script family, leaving open questions about the impact of bilingual data on typologically distant or low-resource languages.

— The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining (2601.00364 - Shao et al., 1 Jan 2026) in Limitations (Section 7)

Impact of Bilingual Data on Typologically Distant and Low-Resource Languages

Background

References

Related Problems