Rigorous convergence rates for norm-based and preconditioned optimizers

Establish rigorous convergence rates for the optimization methods surveyed in this work—including architecture-aware preconditioners and norm-based optimizers such as KFAC, EKFAC, Shampoo, SOAP, SPlus, and Muon—on non-convex deep neural network objectives, and identify assumptions and step-size regimes under which these rates hold.

Background

The paper synthesizes classical and modern optimizers, emphasizing curvature-aware and norm-based designs (e.g., Kronecker-factored and spectral-geometry methods). While empirical successes are noted, formal convergence theory in the non-convex deep learning regime lags behind.

The authors explicitly call for rigorous convergence results tailored to these methods, reflecting a broader gap between practical performance and theoretical guarantees.

References

While this thesis has emphasized practical effectiveness and intuitive understanding, establishing rigorous convergence rates for the methods discussed — particularly in non-convex settings characteristic of deep learning — remains largely open.

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale  (2512.18373 - Nagwekar, 20 Dec 2025) in Subsection “Theoretical Convergence Guarantees” within Section “Limitations of the Modular Norm Framework”