Explain the origin of the measured scaling-law exponents
Determine the theoretical mechanism by which the Scion optimizer induces the observed optimal hyperparameter scaling laws—specifically, η*(B,D) ∝ B^{0.62} · D^{-0.56}, B*(D) ∝ D^{0.45}, and η*(D) ∝ D^{-0.28}—and explain their apparent correspondence to square-root and quarter-power laws in large-scale language model pretraining.
References
While we are confident that the scaling rules in Sec.~\ref{sec:optimal-lr-bs} hold at even larger scales, we still don't know why they are induced in this form, very much resembling square-root and 1/4-power laws.
— Optimal Scaling Needs Optimal Norm
(2510.03871 - Filatov et al., 4 Oct 2025) in Section 6 Conclusion and Discussion