Influence of learning rate scheduling on the Stage 1 vs. Stage 2 training gap

Ascertain how learning rate scheduling affects the observed narrowing of the bits-per-byte gap between runs that include Stage 1 subword-to-byte distillation and runs that begin directly with Stage 2 end-to-end training, and determine whether these results extrapolate to larger pretraining token budgets.

Background

Bolmo’s training uses a two-stage procedure: Stage 1 (frozen global model with subword-to-byte distillation) followed by Stage 2 (end-to-end training). Ablations show a performance benefit from Stage 1, with differences that evolve during training.

The authors explicitly state uncertainty about the role of learning rate scheduling in this behavior and caution against extrapolating the observed trends to larger token budgets without further investigation.

References

There are two main takeaways: (i) the 1B model benefits more from Stage 1 training than 7B, indicating that larger models may be more robust to catastrophic forgetting through large gradients at the start of training when starting directly with Stage 2, and (ii) the bits-per-byte gap narrows throughout the training trajectory but remains in favor of adding Stage 1; it is not clear how this behavior is influenced by the learning rate scheduling so we cannot easily extrapolate to higher token budgets.

Bolmo: Byteifying the Next Generation of Language Models (2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 6.2 (Ablations: Is Stage 1 Training Necessary?)