Cross-lingual and multilingual generalization of convergence dynamics

Determine whether training language models on non-English corpora or in multilingual settings exhibits the same convergence behavior across random seeds during training when measured by expected per-token Kullback–Leibler divergence, specifically whether the dynamics across training steps mirror the four-phase pattern of initial uniform, sharp-convergence, sharp-divergence, and slow-reconvergence observed in English-language setups.

Background

The paper defines language-model convergence across random seeds using the expected per-token Kullback–Leibler divergence and identifies a four-phase pattern throughout training: an initial uniform phase, a sharp-convergence phase, a sharp-divergence phase, and a slow-reconvergence phase. These dynamics are demonstrated on English-language data using the (Poly)Pythia autoregressive models and corroborated in later training stages for MultiBERT masked LLMs.

The analysis also shows that convergence is uneven across token categories (e.g., frequent tokens and function words converge more reliably), and larger models reconverge faster. However, all experiments are restricted to English, motivating the explicit question of whether the same convergence dynamics arise in other languages or multilingual contexts.

References

Third, our analysis is restricted to English-language data, leaving open questions about whether similar convergence dynamics occur in other languages and in multilingual settings.

— Convergence and Divergence of Language Models under Different Random Seeds (2509.26643 - Fehlauer et al., 30 Sep 2025) in Section: Limitations

Cross-lingual and multilingual generalization of convergence dynamics

Background

References

Related Problems