Universality and theory of the 0.32 data-scaling exponent for optimal learning rate
Investigate and theoretically explain the observed power-law exponent of approximately 0.32 relating the optimal learning rate to the number of training tokens under MuonH Frobenius-sphere optimization, and ascertain whether this exponent is universal across optimizers and architectures through rigorous empirical verification.
References
This "magic exponent" may be a universal property of gradient-based optimization in neural networks, independent of the specific optimizer. We leave a more rigorous empirical verification and the theoretical analysis of this coincidence as intriguing future work.
— Rethinking Language Model Scaling under Transferable Hypersphere Optimization
(2603.28743 - Ren et al., 30 Mar 2026) in Section 3.4 Data Scaling