Learning-rate schedule as the cause of attention-rank differences at high TPP

Ascertain whether the observed differences in attention matrix pseudo-rank for OLMo-2-1B models trained at 140 tokens-per-parameter are caused by the use of a warmup-stable-decay learning rate schedule rather than a cosine decay schedule.

Background

In additional analyses, the authors compare attention matrix pseudo-ranks across training regimes and observe that for OLMo-2-1B models at 140 TPP with weight decay 0.1, the rank appears generally smaller than for 20 TPP and a fully trained reference model.

They hypothesize that these differences are attributable to the learning rate schedule (warmup-stable-decay versus cosine) rather than training duration alone, and point to emerging evidence that these schedules lead to distinct training dynamics despite similar validation loss profiles.

References

Hence, we conjecture that this is because the 140 TPP models were trained with a warmup-stable-decay learning rate schedule, whereas the 1x and 144x models were trained with a cosine learning rate schedule.

Weight Decay Improves Language Model Plasticity  (2602.11137 - Han et al., 11 Feb 2026) in Appendix A.2, Additional analyses on attention matrix rank