Papers
Topics
Authors
Recent
Search
2000 character limit reached

SOAP: Second-Order Optimization with Adam Eigenbasis

Updated 1 April 2026
  • The paper presents SOAP, which integrates Shampoo's curvature modeling with Adam's adaptivity in the eigenbasis, offering efficient second-order optimization.
  • SOAP employs a two-stage procedure combining Kronecker-factored preconditioning with eigen-decomposition and adaptive moment estimation, ensuring robust convergence.
  • Empirical studies show SOAP reduces iterations and wall-clock time compared to AdamW and Shampoo in tasks like language modeling and image compression.

Shampoo with Adam in the Preconditioner’s Eigenbasis (SOAP) is a second-order optimization algorithm for deep neural network training that blends Kronecker-factored curvature modeling, as in Shampoo, with Adam-style adaptivity by operating the adaptive moment logic directly in the second-order preconditioner’s eigenbasis. SOAP has been developed to exploit richer curvature information than Adam while maintaining computational efficiency and robustness across large-scale neural network applications.

1. Algorithmic Structure and Update Rule

SOAP operates through a two-stage procedure per parameter block. For a layer parameterized by weight matrix WRm×nW \in \mathbb{R}^{m \times n} with loss gradient G=WG = \nabla_W \ell, the “vectorized” form wRmnw \in \mathbb{R}^{mn} and gradient g=vec(G)g = \text{vec}(G) are preconditioned as follows:

  • Preconditioner Construction: SOAP tracks Kronecker-factored second-moment estimators

Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G

where LRm×mL \in \mathbb{R}^{m \times m} and RRn×nR \in \mathbb{R}^{n \times n} accumulate gradient covariances along rows and columns, respectively.

  • Eigenbasis Rotation: The matrices are (infrequently) eigendecomposed

L=QLΛLQLT,R=QRΛRQRTL = Q_L \Lambda_L Q_L^T,\qquad R = Q_R \Lambda_R Q_R^T

providing orthogonal eigenbases QL,QRQ_L, Q_R and diagonal eigenvalue matrices ΛL,ΛR\Lambda_L, \Lambda_R.

  • Gradient Projection and Adam Moments: The gradient is rotated into the preconditioner’s eigenbasis,

G=WG = \nabla_W \ell0

vectorized as G=WG = \nabla_W \ell1. Adam-style moment estimates are maintained in this basis:

G=WG = \nabla_W \ell2

with bias correction as in conventional Adam.

  • Adaptive Preconditioned Step: The update in the eigenbasis is

G=WG = \nabla_W \ell3

and the adaptive step is rotated back to the parameter space:

G=WG = \nabla_W \ell4

so the parameter update is

G=WG = \nabla_W \ell5

This approach can be succinctly written in vectorized form as

G=WG = \nabla_W \ell6

where G=WG = \nabla_W \ell7 denotes the Kronecker product eigenbasis.

SOAP introduces an additional hyperparameter, the preconditioning frequency G=WG = \nabla_W \ell8, determining how often eigenbases G=WG = \nabla_W \ell9 are recomputed. Between recomputations, Adam moments are updated in the most recent basis (Vyas et al., 2024, Eschenhagen et al., 4 Jun 2025, Lu et al., 26 Sep 2025, Zhang et al., 28 Jan 2026).

2. Theoretical Foundations: Whitening and Curvature

SOAP is motivated by a whitening perspective. The ideal Newton step is

wRmnw \in \mathbb{R}^{mn}0

for the whitening matrix wRmnw \in \mathbb{R}^{mn}1. For computationally tractable approximations:

  • Adam is diagonal: wRmnw \in \mathbb{R}^{mn}2
  • Shampoo uses the Kronecker-product structure: wRmnw \in \mathbb{R}^{mn}3

Shampoo effectively performs a Kronecker product whitening, utilizing Kronecker-factored approximations wRmnw \in \mathbb{R}^{mn}4 and wRmnw \in \mathbb{R}^{mn}5 such that

wRmnw \in \mathbb{R}^{mn}6

This is optimal in Frobenius norm for Kronecker approximations (Lu et al., 26 Sep 2025, Eschenhagen et al., 4 Jun 2025).

SOAP further rotates the gradient to diagonalize the Kronecker preconditioner (the eigenbasis), then applies Adam’s diagonal scaling in that rotated space. Theoretical results establish that, under the exact Kronecker structure assumption, SOAP’s adaptive Adam step in the eigenbasis and Shampoo’s fixed-magnitude Kronecker step become identical, as proven in Theorem 1 of (Lu et al., 26 Sep 2025).

3. Pseudocode and Computational Properties

SOAP maintains the computational structure of Shampoo but incorporates additional matrix multiplications for basis rotations and Adam moment updates in the rotated space. Typical per-layer pseudocode (excluding bias-correction and 1D handling):

LRm×mL \in \mathbb{R}^{m \times m}1

The dominant cost per preconditioner update is wRmnw \in \mathbb{R}^{mn}7 for eigendecomposition; when wRmnw \in \mathbb{R}^{mn}8 is large, this amortizes to wRmnw \in \mathbb{R}^{mn}9 per step. Per-step matrix multiplications for projection/reconstruction cost g=vec(G)g = \text{vec}(G)0. Overall, for typical transformer layers on current GPUs, training with g=vec(G)g = \text{vec}(G)1 gives a throughput drop of g=vec(G)g = \text{vec}(G)2 compared to AdamW (Vyas et al., 2024). Extra storage for the preconditioner factors and momenta increases per-layer memory overhead only moderately.

4. Empirical Behavior and Practical Implications

Empirical studies show SOAP matches or outperforms Adam and Shampoo in convergence speed and/or wall-clock efficiency across large-scale language modeling and learned image compression:

  • In large-batch language modeling (transformers with 360 M–660 M parameters, batch 2M tokens), SOAP reduces the number of iterations by over 40% and wall-clock time by over 35% compared to AdamW, and by ∼20% compared to Shampoo (Vyas et al., 2024).
  • For learned image compression, SOAP yields 65–75% fewer steps and 51–65% less wall-clock time than Adam, with 2–4% BD-Rate improvements after convergence (Zhang et al., 28 Jan 2026).
  • On vision transformers and graph workloads, SOAP (known as EShampoo in (Eschenhagen et al., 4 Jun 2025)) attains or exceeds Shampoo with Adam-grafting in both iteration and total time to target accuracy.

Across diverse tasks, as g=vec(G)g = \text{vec}(G)3 increases (infrequent preconditioner updates), SOAP’s performance degrades much more slowly than that of naive Shampoo. For example, at g=vec(G)g = \text{vec}(G)4 on language modeling, SOAP still outperforms AdamW by ∼30%, whereas Shampoo loses advantage (Vyas et al., 2024).

Final validation or loss plateaus are often closely matched between SOAP, Shampoo, and Adam. When Shampoo’s raw Kronecker step is too aggressive and plateaus at higher loss, SOAP frequently recovers Adam-like final loss while maintaining Shampoo’s rapid convergence (Lu et al., 26 Sep 2025).

5. Practical Guidelines and Hyperparameter Selection

SOAP requires no learning rate grafting or unusual schedule tuning. Hyperparameters:

  • Learning rate g=vec(G)g = \text{vec}(G)5 and Adam’s g=vec(G)g = \text{vec}(G)6 are chosen as in AdamW; e.g., g=vec(G)g = \text{vec}(G)7 or g=vec(G)g = \text{vec}(G)8, g=vec(G)g = \text{vec}(G)9 or Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G0 for large batches, and Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G1.
  • Preconditioning frequency Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G2: Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G3 is equivalent to Shampoo; Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G4 is typical for large models; Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G5 can be increased to reduce overhead at some accuracy loss.
  • Trace normalization and damping as in Shampoo to stabilize scale.
  • 1D parameters (biases, LayerNorm) are preconditioned with vanilla Adam.

Integration is straightforward: a single optimizer swap suffices in existing codebases (Vyas et al., 2024, Zhang et al., 28 Jan 2026). Eigenbasis updates can be accelerated using power-iteration or warm-started QR routines, and Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G6 can be stored in reduced precision (float16 or bfloat16) (Vyas et al., 2024, Eschenhagen et al., 4 Jun 2025).

SOAP also supports adaptive scheduling of eigenbasis updates via a tolerance on the off-diagonal energy in the current basis, further reducing unnecessary preconditioner computations without increasing error (Eschenhagen et al., 4 Jun 2025).

6. Theoretical Equivalence and Gradient Alignment

SOAP is theoretically equivalent to running Adam (or Adafactor) in the rotated basis defined by the Shampoo preconditioner for layerwise gradients (Lu et al., 26 Sep 2025, Vyas et al., 2024). In the idealized Kronecker world, it is fully equivalent to Shampoo by performing elementwise adaptation in the approximately whitened basis. The per-step update magnitude in SOAP is provably bounded between the maximal and minimal eigenvalue scalings from full-matrix Adam, eliminating the Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G7 shrinkage observed in Shampoo (Eschenhagen et al., 4 Jun 2025).

From the gradient–whitening viewpoint, SOAP achieves two-stage approximate whitening: blockwise rotation (Shampoo), then diagonal whitening (Adam) in the rotated space. This construction allows SOAP to resolve intra-step and inter-step gradient conflicts, aligning rate and distortion gradients better than diagonal optimizers (Zhang et al., 28 Jan 2026). The result is more regular update directions, stabilizing deep neural network training and benefiting downstream applications such as quantization in compression models (outlier suppression by >50%) (Zhang et al., 28 Jan 2026).

7. Extensions, Limitations, and Empirical Robustness

SOAP’s main cost is preconditioner eigendecomposition; for large Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G8, this can dominate per-step compute. However, with moderate Lβ2L+(1β2)GGT,Rβ2R+(1β2)GTGL \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G9, the additional wall-clock time over AdamW is limited and continues to decrease as batch sizes increase. Adaptive criteria for basis updates can reduce this further (Eschenhagen et al., 4 Jun 2025).

Empirical evidence confirms robustness of SOAP to a wide range of LRm×mL \in \mathbb{R}^{m \times m}0, whereas Shampoo degrades sharply with infrequent updates. Adaptive eigenbasis updates allow for accurate step computation while reducing frequency of expensive decompositions. SOAP also eliminates the need for learning rate grafting and other heuristics introduced to compensate for Shampoo’s scaling errors, directly correcting scaling and staleness in the eigenvalues themselves.

SOAP is compatible with distributed training, and the preconditioner computations can be amortized over devices. The optimizer is applicable to a range of architectures, including transformers and convolutional nets, and tasks such as language modeling, vision, and compression, with no task-specific modifications required.


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Second-Order Optimization with SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis).