Explain the origin of the measured scaling-law exponents

Determine the theoretical mechanism by which the Scion optimizer induces the observed optimal hyperparameter scaling laws—specifically, η*(B,D) ∝ B^{0.62} · D^{-0.56}, B*(D) ∝ D^{0.45}, and η*(D) ∝ D^{-0.28}—and explain their apparent correspondence to square-root and quarter-power laws in large-scale language model pretraining.

Background

The paper empirically measures how the optimal learning rate η* and batch size B* scale with the training token horizon D when using the Scion optimizer. The authors find η(B,D) ∝ B^{0.62} * D^{-0.56}, B(D) ∝ D^{0.45}, and consequently η*(D) ∝ D^{-0.28}, consistent with square-root and quarter-power behaviors commonly associated with Adam-like scaling rules.

Despite confidence that these rules persist at larger scales, the authors explicitly state they do not know why these exponents arise in this form and call for a unifying theoretical explanation connecting norm-based optimization principles to the empirically observed scaling exponents.

References

While we are confident that the scaling rules in Sec.~\ref{sec:optimal-lr-bs} hold at even larger scales, we still don't know why they are induced in this form, very much resembling square-root and 1/4-power laws.

— Optimal Scaling Needs Optimal Norm (2510.03871 - Filatov et al., 4 Oct 2025) in Section 6 Conclusion and Discussion

Explain the origin of the measured scaling-law exponents

Background

References

Related Problems