Explain the convergence gap between finetuning from a multilingual checkpoint and pretraining from scratch
Determine the underlying causes of the observed convergence differences between monolingual models finetuned from a Unimax multilingual checkpoint and monolingual models pretrained from scratch across languages, specifically why finetuned models initially outperform but are eventually surpassed by from-scratch pretraining at larger token budgets.
References
Note that we leave the reason for these convergence differences to future work, without a clear hypothesis.
                — ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
                
                (2510.22037 - Longpre et al., 24 Oct 2025) in Section 6 (Pretrain or Finetune?)