Cause of missing loss-slope change at learning-rate cooldown in Apertus-70B pretraining
Determine why the Apertus-70B model’s pretraining loss curve did not exhibit a significant slope change at the onset of the Warmup‑Stable‑Decay learning‑rate cooldown phase (at approximately 13.5 trillion consumed tokens), contrary to observations at smaller scale and for Apertus‑8B, and establish appropriate learning‑rate scaling rules for this large‑scale setting.
References
It remains unclear why this was the case; our main hypothesis is that the peak learning rate was set too low and that the model had not yet converged on the phase 4 data mixture. Due to the tight schedule of the project, we were unable to establish proper scaling rules for learning rate or experiment with more values at scale.
— Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
(2509.14233 - Hernández-Cano et al., 17 Sep 2025) in Section 2.6 “Final Run Retrospective” (Model Architecture and Pretraining Recipe), Cooldown paragraph