Generalizability of findings to dense Transformer architectures
Determine whether the empirical findings on learning rate configuration for Mixture-of-Experts Transformer architectures in large-scale pre-training under the Warmup-Stable-Decay schedule generalize to dense Transformer architectures, including the fitted scaling law for the optimal learning rate with respect to model size and data size and the observed performance differences between the Fitting paradigm and μTransfer.
Sponsor
References
The generalizability of our findings to Dense architectures remains to be verified in future work.
— How to Set the Learning Rate for Large-Scale Pre-training?
(2601.05049 - Zhou et al., 8 Jan 2026) in Section: Limitations