Generalizability of findings to dense Transformer architectures

Determine whether the empirical findings on learning rate configuration for Mixture-of-Experts Transformer architectures in large-scale pre-training under the Warmup-Stable-Decay schedule generalize to dense Transformer architectures, including the fitted scaling law for the optimal learning rate with respect to model size and data size and the observed performance differences between the Fitting paradigm and μTransfer.

Background

The paper investigates how to set learning rates for large-scale pre-training, focusing on two paradigms: a Fitting paradigm that models a scaling law for the optimal learning rate across model size and data size under a Warmup-Stable-Decay schedule, and a Transfer paradigm that extends μTransfer to Mixture-of-Experts architectures and other dimensions.

All experiments and conclusions are derived from Mixture-of-Experts Transformer models. In the Limitations section, the authors explicitly note that whether these findings carry over to dense Transformer architectures has not been verified, making cross-architecture generalization an explicit unresolved question.

References

The generalizability of our findings to Dense architectures remains to be verified in future work.

How to Set the Learning Rate for Large-Scale Pre-training?  (2601.05049 - Zhou et al., 8 Jan 2026) in Section: Limitations