Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices

Published 23 Apr 2026 in math.OC | (2604.21616v2)

Abstract: In this short note, we establish, for the first time, the convergence rate of SOAP, an efficient and popular matrix-based optimizer for training deep neural networks. Our analysis extends to a more general variant of SOAP that admits arbitrary orthogonal projection matrices and requires only that these matrices be conditionally independent of the current stochastic gradient at each iteration. For example, they may be constructed from information available up to the preceding step.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents formal convergence bounds for the SOAP optimizer using arbitrary orthogonal projections.
It employs advanced matrix-based stochastic optimization techniques, providing rate analysis in both ℓ₁ and nuclear norms.
The results quantify convergence trade-offs, offering practical insights for tuning hyperparameters in large-scale deep learning models.

Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices

Overview and Motivation

This work rigorously analyzes the convergence rate of the Stochastic Orthogonal Adaptive Preconditioning (SOAP) optimizer, a matrix-based algorithm increasingly used in deep neural network training, particularly for large-scale models. Unlike classical vector-based optimizers such as Adam and AdaGrad, matrix-based optimizers like Shampoo, Muon, and SOAP exploit the intrinsic structure of model weights. This analysis addresses a significant gap: while Muon has established convergence guarantees, theoretical understanding for more complex matrix optimizers—including SOAP—has lagged.

The paper presents, for the first time, formal convergence bounds for SOAP and its generalizations. It extends the theoretical framework to accommodate arbitrary orthogonal projection matrices, provided these matrices are conditionally independent of the current stochastic gradient at each iteration. This broadens the applicability of the results to optimizers where projections may utilize past optimization states, including Splus and ARO.

Algorithmic Structure and Generalizations

SOAP operates by adaptively preconditioning the gradient with orthogonal projections and updating parameters in the projected space. The structure leverages left and right singular vectors or eigenvectors derived from historical gradients or momentum terms. The generalized SOAP variant allows these orthogonal projections to be generated from any history-dependent procedure, subject to conditional independence from the current gradient. Formally, at each iteration, SOAP projects the gradient and momentum using $P_{k-1}$ and $Q_{k-1}$ , updates a per-dimension adaptive preconditioner, and adjusts the parameter matrix in the projected space.

This generalization is significant: it not only encompasses the original SOAP algorithm but provides a unified framework for other projection-based matrix optimizers.

Main Theoretical Results

The core contribution is the establishment of convergence rates in two norms: entrywise $\ell_1$ and nuclear norm (sum of singular values), with explicit dependence on matrix dimensions and variance parameters. The results are stated as follows:

Entrywise $\ell_1$ Norm Rate:

$\frac{1}{K}\sum_{k=1}^K E\left[\|P_{k-1}^T \nabla f(X_k) Q_{k-1}\|_1\right] \leq 10 \sqrt{\frac{\hat\sigma_F^2 L (f(X_1) - f^*)}{K \sigma_{op}^2} + 4 \sqrt{mn} \sqrt[4]{\frac{\hat\sigma_F^2 L (f(X_1) - f^*)}{K}}}$

Nuclear Norm Rate:

$\frac{1}{K}\sum_{k=1}^K E\left[\|\nabla f(X_k)\|_*\right] \leq 10 \sqrt{\frac{\hat\sigma_F^2 L (f(X_1) - f^*)}{K \sigma_{op}^2} + 4 \sqrt{mn} \sqrt[4]{\frac{\hat\sigma_F^2 L (f(X_1) - f^*)}{K}}}$

These results require only that the projection matrices be generated in a conditionally independent manner with respect to the current gradient, markedly relaxing requirements compared to prior analyses.

Notably, while the $\ell_1$ norm rate matches known lower bounds for nonconvex stochastic optimization under ideal scaling, the nuclear norm rate is $\sqrt{\min\{m,n\}}$ times slower than AdamW-style Shampoo and at least $\sqrt{\max\{m,n\}}$ times slower than the theoretical lower bound. These rates make a strong, explicit claim about the limitations and trade-offs of matrix-adaptive projection—contrasting with the often superior per-iteration complexity of vector-based Adam variants.

Technical Approach

The analysis employs advanced stochastic optimization techniques for matrix-valued parameters, adapting the proof structure recently established for AdamW and RMSProp in $\ell_1$ norm metrics [Li-2025-nips] [lihuan-rmsprop-2024]. Smoothness, unbiased gradient estimation, and bounded variance are assumed. The key innovations include:

Allowance for arbitrary orthogonal projections, proved using conditional independence and expectations;
Use of unitarily invariant norms (Frobenius and nuclear) for norm bounds, avoiding dependence on specific projection matrices;
Explicit propagation of historical momentum and preconditioner estimates into the convergence recursion, adapted for the matrix setting;
Tight estimation of moments and cross-iteration effects using technical lemmas.

Additionally, the proof highlights that the nuclear norm, being unitarily invariant, eliminates the influence of orthogonal projections on the left-hand side of the convergence bound, while the $Q_{k-1}$ 0 norm retains dependence.

Practical and Theoretical Implications

From a practical standpoint, the results provide rigorous guarantees for a broad class of matrix-based optimizers—including SOAP, Splus, and ARO—with flexible projection strategies. These guarantees are essential for model designers calibrating optimizer hyperparameters, especially in contexts where weight matrices grow large and classical vector optimizers may become inefficient or unstable.

Theoretically, the work establishes that matrix-adaptive projection, while broadening optimizer capabilities, incurs speed limitations relative to one-sided and two-sided vector preconditioners as in Shampoo. It clarifies the optimal scaling regimes for stochastic nonconvex training in terms of matrix dimensions and gradient variance, and sets a benchmark for future algorithmic innovations seeking to bridge the gap between flexibility and convergence speed.

In the broader context of deep learning optimization, these findings may inform design decisions in large-scale architectures and motivate new projection schemes or hybrid optimizer strategies.

Future Directions

Potential future developments include:

Extension of the analysis to adaptive projection matrices learned from richer histories or higher-order information;
Investigation into the empirical performance gap in practical neural network pre-training scenarios, especially for extremely large parameter matrices;
Development of new matrix optimizers that approach the theoretical lower bound rates while retaining projection-generalized flexibility;
Exploration of projection strategies that exploit block-structure, sparsity, or network architecture information for further gains.

Conclusion

This analysis closes a substantial gap in the theoretical understanding of matrix-based optimizers by establishing formal convergence rates for SOAP and its generalizations with arbitrary orthogonal projections. The results provide a unified framework for projection-based matrix optimization methods, delineate precise trade-offs in convergence speed, and offer a foundation for future algorithmic and theoretical advancements in large-scale deep learning optimization (2604.21616).

Markdown Report Issue