Scaling Multilingual Curation and Its Interaction with General-Purpose Mixtures

Determine how multilingual data curation strategies scale to larger token budgets and model sizes, and characterize how multilingual data curation interacts with general-purpose pretraining mixtures in large-scale language model training.

Background

Earlier sections present smaller-scale controlled bilingual experiments (60B tokens, 3B-parameter models) demonstrating that targeted data curation improves cross-lingual transfer and multilingual performance.

The authors note that understanding how these multilingual curation strategies extend to larger token budgets and larger model sizes, and how multilingual curation interacts with a broader general-purpose mixture (including English, code, STEM, and reasoning data), is a critical question for practical large-scale pretraining. They subsequently describe a 20T-token curated corpus and 1T-token training runs as an initial step toward addressing this question.

References

However, an open question is how such curation strategies can scale to larger token budgets and models, and how multilingual curation interacts with general purpose curation.

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset  (2602.15210 - DatologyAI et al., 16 Feb 2026) in Subsection 'Integrating multilingual curation into a general pretraining mix' (Section 'Main Findings'; label: section:big_boi_exps)