SOAP Second-Order Optimization
- The paper introduces SOAP as a second-order optimizer that integrates Shampoo's Kronecker-factored preconditioning with Adam’s adaptive moment scaling.
- It rotates gradients into the principal curvature eigenspace to achieve near-Newton updates, effectively reducing gradient conflicts and accelerating convergence.
- Empirical results demonstrate improved test error, faster convergence in PINNs, learned image compression, and large language model training compared to first-order methods.
Second-order optimization with SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis) addresses a central bottleneck in deep learning: the inefficiency of standard first-order optimizers when optimizing composite or multi-objective loss landscapes with ill-conditioned or conflicting gradients. It achieves scalable, near-Newton optimization by blending Kronecker-factored preconditioning (from Shampoo) with per-coordinate adaptive moment methods (from Adam), executed in the slowly-varying principal curvature eigenspace of the parameter layer. SOAP demonstrates broad efficacy in physics-informed neural networks (PINNs), large-scale LLMs, learned image compression, and geometric variational problems, providing a general framework for gradient alignment and efficient second-order adaptation.
1. Theoretical Principles and Motivation
The SOAP optimizer builds on two strands of preconditioning:
- Adaptive diagonal methods (Adam, Adafactor), which rescale gradients according to elementwise second-moment estimates but lack the ability to rotate gradients to align with the true curvature structure.
- Matrix preconditioning (Shampoo), which approximates the Hessian (or Gauss–Newton) matrix via a Kronecker product of layerwise moment accumulations, producing richer curvature adaptation at the cost of O() eigendecomposition per layer.
SOAP operates by periodically constructing the eigenspace for each layer's Kronecker moments, transforming gradients into this basis, applying an Adam-like (coordinate-wise) update, and projecting back. Formally, for weight matrix %%%%1%%%% and gradient , SOAP maintains Kronecker-factored accumulators:
Then:
- Every steps, compute eigendecomposition ,
- Rotate:
- Adam adaptation in rotated space:
- Reverse rotation:
- Update:
Under Kronecker-Gauss–Newton assumptions, this approximates a block-diagonal Newton step, effects second-order "gradient whitening," and retains Adam's stability and hyperparameter simplicity (Vyas et al., 2024, Lu et al., 26 Sep 2025).
2. Gradient Alignment and Second-Order Effects
Multi-objective training (as in PINNs or rate-distortion learning) introduces gradient conflicts, where loss term gradients are nearly orthogonal or even point in opposing directions (type I: magnitude conflict; type II: direction conflict). SOAP mediates these via implicit gradient alignment:
- Inter-step alignment: Newton-like updates ensure (the alignment score tending to perfect), suppressing the "zig-zag" dynamics of first-order optimizers (Wang et al., 2 Feb 2025, Zhang et al., 28 Jan 2026).
- Intra-step alignment: At optima, the preconditioner co-aligns gradients from different loss terms, as quantified by the alignment score
SOAP maintains , in contrast to Adam or Shampoo which often yield or negative for multiple tasks (Wang et al., 2 Feb 2025, Zhang et al., 28 Jan 2026).
This alignment critically enhances convergence rates and solution quality in scenarios where gradient conflict impedes first-order methods.
3. Algorithmic Realizations and Pseudocode
The SOAP algorithm for a single layer can be summarized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for t in range(T): G = gradient(W) L = beta2*L + (1-beta2)*G@G.T R = beta2*R + (1-beta2)*G.T@G if t % f == 0: Q_L, Lambda_L = eigh(L) Q_R, Lambda_R = eigh(R) G_tilde = Q_L.T @ G @ Q_R M_tilde = beta1*M_tilde + (1-beta1)*G_tilde V_tilde = beta2*V_tilde + (1-beta2)*(G_tilde**2) N_tilde = M_tilde / (sqrt(V_tilde) + eps) N = Q_L @ N_tilde @ Q_R.T W -= lr*N |
Key points:
- Preconditioning frequency trades off curvature "freshness" against computational overhead. Performance degrades gracefully with infrequent updates, in contrast to Shampoo where infrequent updates lead to sharper degradation (Vyas et al., 2024).
- Memory requirements are dominated by Kronecker accumulators per layer, plus per-iteration for eigendecomposition.
- Variants (e.g., one-sided rotation for very large layers, low-precision accumulators, efficient QR-based eigensolvers) further ameliorate overhead (Vyas et al., 2024, Lu et al., 26 Sep 2025).
4. Empirical Performance Across Domains
SOAP demonstrates empirical advantages in multiple regimes:
- Physics-Informed Neural Networks (PINNs): Across 10 PDE benchmarks—including Kolmogorov turbulence (Re=10,000)—SOAP achieves 2× to 10× lower test error and faster wall-clock convergence versus Adam. For the Rayleigh–Taylor instability, SOAP achieved 0.52% error, a new benchmark (Wang et al., 2 Feb 2025).
- Learned Image Compression (LIC): On top models (ELIC, TCM, LALIC, DCAE), SOAP achieves Adam’s final R–D loss in 25–35% of the steps/wall-clock, with BD-Rate improvement of –2.1% to –3.7% and up to 95% reduction in activation kurtosis (outlier suppression), benefiting post-training quantization (Zhang et al., 28 Jan 2026).
- LLM Training: Compared to AdamW and Shampoo, SOAP reduces required steps by over 40% and wall-clock time by 35% in the large-batch regime, with ∼20% iteration reduction relative to Shampoo (Vyas et al., 2024). However, full block-diagonal Gauss–Newton preconditioning outperforms SOAP by an additional factor of 3–5× in iteration complexity (Abreu et al., 10 Oct 2025).
- Geometric Variational Optimization (Gait Design): SOAP as a "soap-bubble optimizer" exploits Lie-bracket curvature alongside Riemannian "surface tension," yielding efficient cycle shapes for kinematic systems such as Purcell’s swimmer (Ramasamy et al., 2016).
Empirical discrepancies between theory and practice are minor under the Kronecker structure assumption; SOAP and idealized Shampoo demonstrate nearly identical convergence on both vision and NLP workloads (Lu et al., 26 Sep 2025).
5. Connections to Gradient Whitening and Theoretical Equivalence
From the gradient whitening perspective, all second-order methods are approximating the Newton preconditioner , where is the gradient covariance (Gauss–Newton or Fisher information). SOAP realizes:
- Adam: Diagonal coordinate-wise whitening.
- Shampoo: Kronecker-factored block-wise preconditioning.
- SOAP: Rotated basis within the Kronecker eigenspace, applying Adam’s per-coordinate scaling—mathematically equivalent to Shampoo in the idealized, full-batch Kronecker setting (Lu et al., 26 Sep 2025).
This equivalence substantiates that SOAP need not—under Kronecker-structured curvature—outperform Shampoo in steady-state, but it can offer improved stability and robustness to infrequent preconditioner updates.
6. Limitations, Practical Considerations, and Future Directions
Major considerations include:
- Computational Overhead: SOAP incurs eigen-decomposition cost per layer. For extremely wide/tall layers, one-sided or block-diagonal variants are recommended (Vyas et al., 2024).
- Batch Size Sensitivity: At small batches (high stochasticity), increasing preconditioner frequency and decay rates () yields more stable estimates.
- Memory Scaling: per-layer accumulators pose challenges for massive networks; use of low-precision accumulators is suggested.
- Gap to Exact GN: Compared to layerwise Gauss–Newton or full-matrix methods, SOAP leaves 20–70% of possible speedup on the table. Incorporation of low-rank spectral corrections, hybrid Krylov inner solves, and adaptive damping are suggested to approach oracle performance (Abreu et al., 10 Oct 2025).
- Robustness: Outlier suppression and improved quantization robustness under SOAP are observed in learned compression, a unique benefit compared to Adam (Zhang et al., 28 Jan 2026).
Potential future extensions involve application to multitask learning, adversarial training, richer curvature approximation (e.g., K-FAC, structured low-rank), and auto-tuning of computational parameters (Wang et al., 2 Feb 2025, Abreu et al., 10 Oct 2025, Vyas et al., 2024).
7. Broader Significance and Open Problems
SOAP’s success in aligning conflicting gradients with scalable second-order adaptation establishes general principles for efficient optimization of multi-term loss functions, challenging domains with high curvature or anisotropy, and tasks where feature outliers impede deployability.
Open challenges include closing the gap to full-matrix methods, developing low-memory approximations that retain alignment, analyzing generalization under second-order preconditioning, and extending Kronecker-style second-order optimizers to structured (e.g., convolutional or sparse) weight matrices.
SOAP provides a critical benchmark and design pattern for practical second-order deep learning optimizers—combining robustness, computational feasibility, and gradient conflict resolution—anchored by Kronecker structure and adaptive statistics (Wang et al., 2 Feb 2025, Vyas et al., 2024, Abreu et al., 10 Oct 2025, Lu et al., 26 Sep 2025).