Explain the mechanism of optimal norm transfer and characterize the constant-norm manifold

Determine the mechanism that causes optimal learning rate and batch size configurations to exhibit norm transfer—i.e., a constant RMS-to-infinity operator norm of the output layer under Scion across model and dataset scaling—and characterize the structure of the corresponding constant-norm manifold along the scaling axes.

Background

Norm transfer is observed as an invariant: at the optimal configuration, the operator norm of the output layer stays essentially constant across width and depth scaling of the model and across dataset-size scaling. This provides a necessary condition for optimality but is not sufficient, since multiple (η, B) pairs can reach the same constant norm.

The authors explicitly ask why this transfer occurs and note that it is puzzling what makes the trajectory remain on the constant-norm manifold, calling for an explanation and a characterization of the manifold’s structure.

References

Why does optimal norm transfer? It is puzzling what makes the optimal scaling trajectory remain on the constant norm manifold, as well as what defines its structure. We don't yet have answers to those questions, but we believe our study scratches the surface of exciting phenomena that remain to be fully understood.

— Optimal Scaling Needs Optimal Norm (2510.03871 - Filatov et al., 4 Oct 2025) in Section 6 Conclusion and Discussion

Explain the mechanism of optimal norm transfer and characterize the constant-norm manifold

Background

References

Related Problems