Identify the optimal norm and assess generality beyond Scion

Ascertain which induced operator norm is truly optimal for guiding joint model and dataset scaling; determine whether the RMS-to-infinity norm of the output layer used in Scion is uniquely optimal, or whether other norms (e.g., RMS-to-RMS for output, 1-to-RMS for input) or other optimizers exhibit the same norm-transfer and optimality behavior.

Background

Although the paper focuses primarily on the RMS-to-infinity operator norm of the output layer, ablations show similar transfer phenomena for other norms (e.g., RMS-to-RMS on the output and 1-to-RMS on the input), raising the question of which norm is genuinely optimal.

The authors explicitly ask whether the observed phenomena are specific to this particular norm or even to the Scion optimizer, indicating uncertainty about the generality of the findings.

References

Which norm is exactly optimal? We paid most of our attention to $\lVert \mathrm{out} \rVert{\mathrm{RMS} \to \infty}$, but are the observed phenomena really specific to this one only? And to the Scion optimizer only? We don't yet have answers to those questions, but we believe our study scratches the surface of exciting phenomena that remain to be fully understood.

— Optimal Scaling Needs Optimal Norm (2510.03871 - Filatov et al., 4 Oct 2025) in Section 6 Conclusion and Discussion

Identify the optimal norm and assess generality beyond Scion

Background

References

Related Problems