Identify the optimal norm and assess generality beyond Scion
Ascertain which induced operator norm is truly optimal for guiding joint model and dataset scaling; determine whether the RMS-to-infinity norm of the output layer used in Scion is uniquely optimal, or whether other norms (e.g., RMS-to-RMS for output, 1-to-RMS for input) or other optimizers exhibit the same norm-transfer and optimality behavior.
References
Which norm is exactly optimal? We paid most of our attention to $\lVert \mathrm{out} \rVert{\mathrm{RMS} \to \infty}$, but are the observed phenomena really specific to this one only? And to the Scion optimizer only? We don't yet have answers to those questions, but we believe our study scratches the surface of exciting phenomena that remain to be fully understood.
— Optimal Scaling Needs Optimal Norm
(2510.03871 - Filatov et al., 4 Oct 2025) in Section 6 Conclusion and Discussion