Cause of missing loss-slope change at learning-rate cooldown in Apertus-70B pretraining

Determine why the Apertus-70B model’s pretraining loss curve did not exhibit a significant slope change at the onset of the Warmup‑Stable‑Decay learning‑rate cooldown phase (at approximately 13.5 trillion consumed tokens), contrary to observations at smaller scale and for Apertus‑8B, and establish appropriate learning‑rate scaling rules for this large‑scale setting.

Background

During pretraining, Apertus‑70B did not show the expected change in loss-slope or a jump in benchmark performance when entering the cooldown phase of the Warmup‑Stable‑Decay schedule, unlike results observed at smaller scales and in the Apertus‑8B run. The authors hypothesize that the peak learning rate may have been too low or the model had not converged on the preceding data mixture.

Because of project constraints, the team did not derive learning‑rate scaling rules or run alternative peak learning‑rate experiments at scale, leaving the underlying cause and appropriate LR scaling unresolved. Clarifying this behavior is important for reliable large‑scale training and for setting compute‑optimal schedules for future runs.

References

It remains unclear why this was the case; our main hypothesis is that the peak learning rate was set too low and that the model had not yet converged on the phase 4 data mixture. Due to the tight schedule of the project, we were unable to establish proper scaling rules for learning rate or experiment with more values at scale.

— Apertus: Democratizing Open and Compliant LLMs for Global Language Environments (2509.14233 - Hernández-Cano et al., 17 Sep 2025) in Section 2.6 “Final Run Retrospective” (Model Architecture and Pretraining Recipe), Cooldown paragraph

Cause of missing loss-slope change at learning-rate cooldown in Apertus-70B pretraining

Background

References

Related Problems