Universality and theory of the 0.32 data-scaling exponent for optimal learning rate

Investigate and theoretically explain the observed power-law exponent of approximately 0.32 relating the optimal learning rate to the number of training tokens under MuonH Frobenius-sphere optimization, and ascertain whether this exponent is universal across optimizers and architectures through rigorous empirical verification.

Background

Empirical sweeps show that the optimal learning rate scales with training tokens as η* ∝ T^{-0.32} for models trained with MuonH under Frobenius-sphere constraints. This matches an exponent previously reported for AdamW, suggesting a possible universal behavior.

The authors highlight the need for both rigorous empirical validation of this universality and a theoretical account for why this exponent emerges across different optimizers.

References

This "magic exponent" may be a universal property of gradient-based optimization in neural networks, independent of the specific optimizer. We leave a more rigorous empirical verification and the theoretical analysis of this coincidence as intriguing future work.

— Rethinking Language Model Scaling under Transferable Hypersphere Optimization (2603.28743 - Ren et al., 30 Mar 2026) in Section 3.4 Data Scaling

Universality and theory of the 0.32 data-scaling exponent for optimal learning rate

Background

References

Related Problems