Explain the convergence gap between finetuning from a multilingual checkpoint and pretraining from scratch

Determine the underlying causes of the observed convergence differences between monolingual models finetuned from a Unimax multilingual checkpoint and monolingual models pretrained from scratch across languages, specifically why finetuned models initially outperform but are eventually surpassed by from-scratch pretraining at larger token budgets.

Background

The paper compares two training strategies for optimizing performance on a target language: (1) pretraining a monolingual model from scratch and (2) finetuning a massively multilingual Unimax checkpoint on the target language. Empirically, finetuned models start with better loss but are overtaken by from-scratch pretraining after a language-dependent number of tokens (approximately 144B–283B in the reported experiments).

While the paper quantifies the crossover points and provides a compute-based heuristic for deciding between pretraining and finetuning, it does not identify the mechanism(s) driving the difference in convergence behavior. The authors explicitly leave the explanation for this phenomenon to future work, noting the lack of a clear hypothesis.

References

Note that we leave the reason for these convergence differences to future work, without a clear hypothesis.

— ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality (2510.22037 - Longpre et al., 24 Oct 2025) in Section 6 (Pretrain or Finetune?)

Explain the convergence gap between finetuning from a multilingual checkpoint and pretraining from scratch

Background

References

Related Problems