Scaling Multilingual Curation and Its Interaction with General-Purpose Mixtures
Determine how multilingual data curation strategies scale to larger token budgets and model sizes, and characterize how multilingual data curation interacts with general-purpose pretraining mixtures in large-scale language model training.
References
However, an open question is how such curation strategies can scale to larger token budgets and models, and how multilingual curation interacts with general purpose curation.
— ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
(2602.15210 - DatologyAI et al., 16 Feb 2026) in Subsection 'Integrating multilingual curation into a general pretraining mix' (Section 'Main Findings'; label: section:big_boi_exps)