SOAP: Second-Order Optimization with Adam Eigenbasis

Updated 1 April 2026

The paper presents SOAP, which integrates Shampoo's curvature modeling with Adam's adaptivity in the eigenbasis, offering efficient second-order optimization.
SOAP employs a two-stage procedure combining Kronecker-factored preconditioning with eigen-decomposition and adaptive moment estimation, ensuring robust convergence.
Empirical studies show SOAP reduces iterations and wall-clock time compared to AdamW and Shampoo in tasks like language modeling and image compression.

Shampoo with Adam in the Preconditioner’s Eigenbasis (SOAP) is a second-order optimization algorithm for deep neural network training that blends Kronecker-factored curvature modeling, as in Shampoo, with Adam-style adaptivity by operating the adaptive moment logic directly in the second-order preconditioner’s eigenbasis. SOAP has been developed to exploit richer curvature information than Adam while maintaining computational efficiency and robustness across large-scale neural network applications.

1. Algorithmic Structure and Update Rule

SOAP operates through a two-stage procedure per parameter block. For a layer parameterized by weight matrix $W \in \mathbb{R}^{m \times n}$ with loss gradient $G = \nabla_W \ell$ , the “vectorized” form $w \in \mathbb{R}^{mn}$ and gradient $g = \text{vec}(G)$ are preconditioned as follows:

Preconditioner Construction: SOAP tracks Kronecker-factored second-moment estimators

$L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$

where $L \in \mathbb{R}^{m \times m}$ and $R \in \mathbb{R}^{n \times n}$ accumulate gradient covariances along rows and columns, respectively.

Eigenbasis Rotation: The matrices are (infrequently) eigendecomposed

$L = Q_L \Lambda_L Q_L^T,\qquad R = Q_R \Lambda_R Q_R^T$

providing orthogonal eigenbases $Q_L, Q_R$ and diagonal eigenvalue matrices $\Lambda_L, \Lambda_R$ .

Gradient Projection and Adam Moments: The gradient is rotated into the preconditioner’s eigenbasis,

$G = \nabla_W \ell$ 0

vectorized as $G = \nabla_W \ell$ 1. Adam-style moment estimates are maintained in this basis:

$G = \nabla_W \ell$ 2

with bias correction as in conventional Adam.

Adaptive Preconditioned Step: The update in the eigenbasis is

$G = \nabla_W \ell$ 3

and the adaptive step is rotated back to the parameter space:

$G = \nabla_W \ell$ 4

so the parameter update is

$G = \nabla_W \ell$ 5

This approach can be succinctly written in vectorized form as

$G = \nabla_W \ell$ 6

where $G = \nabla_W \ell$ 7 denotes the Kronecker product eigenbasis.

SOAP introduces an additional hyperparameter, the preconditioning frequency $G = \nabla_W \ell$ 8, determining how often eigenbases $G = \nabla_W \ell$ 9 are recomputed. Between recomputations, Adam moments are updated in the most recent basis (Vyas et al., 2024, Eschenhagen et al., 4 Jun 2025, Lu et al., 26 Sep 2025, Zhang et al., 28 Jan 2026).

2. Theoretical Foundations: Whitening and Curvature

SOAP is motivated by a whitening perspective. The ideal Newton step is

$w \in \mathbb{R}^{mn}$ 0

for the whitening matrix $w \in \mathbb{R}^{mn}$ 1. For computationally tractable approximations:

Adam is diagonal: $w \in \mathbb{R}^{mn}$ 2
Shampoo uses the Kronecker-product structure: $w \in \mathbb{R}^{mn}$ 3

Shampoo effectively performs a Kronecker product whitening, utilizing Kronecker-factored approximations $w \in \mathbb{R}^{mn}$ 4 and $w \in \mathbb{R}^{mn}$ 5 such that

$w \in \mathbb{R}^{mn}$ 6

This is optimal in Frobenius norm for Kronecker approximations (Lu et al., 26 Sep 2025, Eschenhagen et al., 4 Jun 2025).

SOAP further rotates the gradient to diagonalize the Kronecker preconditioner (the eigenbasis), then applies Adam’s diagonal scaling in that rotated space. Theoretical results establish that, under the exact Kronecker structure assumption, SOAP’s adaptive Adam step in the eigenbasis and Shampoo’s fixed-magnitude Kronecker step become identical, as proven in Theorem 1 of (Lu et al., 26 Sep 2025).

3. Pseudocode and Computational Properties

SOAP maintains the computational structure of Shampoo but incorporates additional matrix multiplications for basis rotations and Adam moment updates in the rotated space. Typical per-layer pseudocode (excluding bias-correction and 1D handling):

$L \in \mathbb{R}^{m \times m}$ 1

The dominant cost per preconditioner update is $w \in \mathbb{R}^{mn}$ 7 for eigendecomposition; when $w \in \mathbb{R}^{mn}$ 8 is large, this amortizes to $w \in \mathbb{R}^{mn}$ 9 per step. Per-step matrix multiplications for projection/reconstruction cost $g = \text{vec}(G)$ 0. Overall, for typical transformer layers on current GPUs, training with $g = \text{vec}(G)$ 1 gives a throughput drop of $g = \text{vec}(G)$ 2 compared to AdamW (Vyas et al., 2024). Extra storage for the preconditioner factors and momenta increases per-layer memory overhead only moderately.

4. Empirical Behavior and Practical Implications

Empirical studies show SOAP matches or outperforms Adam and Shampoo in convergence speed and/or wall-clock efficiency across large-scale language modeling and learned image compression:

In large-batch language modeling (transformers with 360 M–660 M parameters, batch 2M tokens), SOAP reduces the number of iterations by over 40% and wall-clock time by over 35% compared to AdamW, and by ∼20% compared to Shampoo (Vyas et al., 2024).
For learned image compression, SOAP yields 65–75% fewer steps and 51–65% less wall-clock time than Adam, with 2–4% BD-Rate improvements after convergence (Zhang et al., 28 Jan 2026).
On vision transformers and graph workloads, SOAP (known as EShampoo in (Eschenhagen et al., 4 Jun 2025)) attains or exceeds Shampoo with Adam-grafting in both iteration and total time to target accuracy.

Across diverse tasks, as $g = \text{vec}(G)$ 3 increases (infrequent preconditioner updates), SOAP’s performance degrades much more slowly than that of naive Shampoo. For example, at $g = \text{vec}(G)$ 4 on language modeling, SOAP still outperforms AdamW by ∼30%, whereas Shampoo loses advantage (Vyas et al., 2024).

Final validation or loss plateaus are often closely matched between SOAP, Shampoo, and Adam. When Shampoo’s raw Kronecker step is too aggressive and plateaus at higher loss, SOAP frequently recovers Adam-like final loss while maintaining Shampoo’s rapid convergence (Lu et al., 26 Sep 2025).

5. Practical Guidelines and Hyperparameter Selection

SOAP requires no learning rate grafting or unusual schedule tuning. Hyperparameters:

Learning rate $g = \text{vec}(G)$ 5 and Adam’s $g = \text{vec}(G)$ 6 are chosen as in AdamW; e.g., $g = \text{vec}(G)$ 7 or $g = \text{vec}(G)$ 8, $g = \text{vec}(G)$ 9 or $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 0 for large batches, and $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 1.
Preconditioning frequency $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 2: $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 3 is equivalent to Shampoo; $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 4 is typical for large models; $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 5 can be increased to reduce overhead at some accuracy loss.
Trace normalization and damping as in Shampoo to stabilize scale.
1D parameters (biases, LayerNorm) are preconditioned with vanilla Adam.

Integration is straightforward: a single optimizer swap suffices in existing codebases (Vyas et al., 2024, Zhang et al., 28 Jan 2026). Eigenbasis updates can be accelerated using power-iteration or warm-started QR routines, and $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 6 can be stored in reduced precision (float16 or bfloat16) (Vyas et al., 2024, Eschenhagen et al., 4 Jun 2025).

SOAP also supports adaptive scheduling of eigenbasis updates via a tolerance on the off-diagonal energy in the current basis, further reducing unnecessary preconditioner computations without increasing error (Eschenhagen et al., 4 Jun 2025).

6. Theoretical Equivalence and Gradient Alignment

SOAP is theoretically equivalent to running Adam (or Adafactor) in the rotated basis defined by the Shampoo preconditioner for layerwise gradients (Lu et al., 26 Sep 2025, Vyas et al., 2024). In the idealized Kronecker world, it is fully equivalent to Shampoo by performing elementwise adaptation in the approximately whitened basis. The per-step update magnitude in SOAP is provably bounded between the maximal and minimal eigenvalue scalings from full-matrix Adam, eliminating the $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 7 shrinkage observed in Shampoo (Eschenhagen et al., 4 Jun 2025).

From the gradient–whitening viewpoint, SOAP achieves two-stage approximate whitening: blockwise rotation (Shampoo), then diagonal whitening (Adam) in the rotated space. This construction allows SOAP to resolve intra-step and inter-step gradient conflicts, aligning rate and distortion gradients better than diagonal optimizers (Zhang et al., 28 Jan 2026). The result is more regular update directions, stabilizing deep neural network training and benefiting downstream applications such as quantization in compression models (outlier suppression by >50%) (Zhang et al., 28 Jan 2026).

7. Extensions, Limitations, and Empirical Robustness

SOAP’s main cost is preconditioner eigendecomposition; for large $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 8, this can dominate per-step compute. However, with moderate $L \leftarrow \beta_2 L + (1-\beta_2) GG^T,\qquad R \leftarrow \beta_2 R + (1-\beta_2) G^T G$ 9, the additional wall-clock time over AdamW is limited and continues to decrease as batch sizes increase. Adaptive criteria for basis updates can reduce this further (Eschenhagen et al., 4 Jun 2025).

Empirical evidence confirms robustness of SOAP to a wide range of $L \in \mathbb{R}^{m \times m}$ 0, whereas Shampoo degrades sharply with infrequent updates. Adaptive eigenbasis updates allow for accurate step computation while reducing frequency of expensive decompositions. SOAP also eliminates the need for learning rate grafting and other heuristics introduced to compensate for Shampoo’s scaling errors, directly correcting scaling and staleness in the eigenvalues themselves.

SOAP is compatible with distributed training, and the preconditioner computations can be amortized over devices. The optimizer is applicable to a range of architectures, including transformers and convolutional nets, and tasks such as language modeling, vision, and compression, with no task-specific modifications required.

Key References:

"SOAP: Improving and Stabilizing Shampoo using Adam" (Vyas et al., 2024)
"Understanding SOAP from the Perspective of Gradient Whitening" (Lu et al., 26 Sep 2025)
"Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner" (Eschenhagen et al., 4 Jun 2025)
"Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence" (Zhang et al., 28 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (4)

SOAP: Improving and Stabilizing Shampoo using Adam (2024)

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner (2025)

Understanding SOAP from the Perspective of Gradient Whitening (2025)

Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Second-Order Optimization with SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis).