Learning-rate schedule as the cause of attention-rank differences at high TPP
Ascertain whether the observed differences in attention matrix pseudo-rank for OLMo-2-1B models trained at 140 tokens-per-parameter are caused by the use of a warmup-stable-decay learning rate schedule rather than a cosine decay schedule.
References
Hence, we conjecture that this is because the 140 TPP models were trained with a warmup-stable-decay learning rate schedule, whereas the 1x and 144x models were trained with a cosine learning rate schedule.
— Weight Decay Improves Language Model Plasticity
(2602.11137 - Han et al., 11 Feb 2026) in Appendix A.2, Additional analyses on attention matrix rank