Shampoo: Structure-Aware Deep Learning Optimizer

Updated 3 July 2026

Shampoo is a structure-aware preconditioned stochastic gradient optimizer that approximates full-matrix second-order methods by leveraging the Kronecker-factored covariance structure.
It reduces both memory and computational overhead compared to full-matrix approaches while providing rigorous convergence analyses and practical efficiency on large-scale deep learning tasks.
Extensions like SPlus, SOAP, and EShampoo enhance stability and scalability through adaptive refresh strategies, block-wise preconditioning, and efficient eigenvalue computations.

Shampoo is a family of structure-aware preconditioned stochastic gradient optimizers designed to efficiently approximate full-matrix second-order methods for deep learning. By exploiting the matrix and tensor structure of neural network parameter blocks, Shampoo achieves substantially greater curvature adaptation than diagonal or element-wise methods (e.g., Adam), while maintaining much lower memory and computational requirements than full-matrix approaches. Its foundation rests on Kronecker-factored second-moment preconditioners, with rigorous convergence analyses, high empirical efficiency across large-scale deep learning tasks, and a rapidly expanding ecosystem of variants addressing stability, scaling, and hardware constraints (Gupta et al., 2018, Shi et al., 2023, Xie et al., 13 Mar 2025, Eschenhagen et al., 4 Jun 2025, Modoranu et al., 2 Feb 2026, Gratton et al., 19 Apr 2026).

1. Mathematical Formulation and Core Algorithm

Shampoo generalizes AdaGrad by constructing a Kronecker product approximation to the empirical covariance of parameter gradients, applied separately to each tensor-shaped parameter block. For a weight matrix $W\in\mathbb{R}^{m\times n}$ and its gradient $G_t$ , Shampoo maintains two positive-definite, exponentially weighted accumulators: $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ with $\beta_2\in[0,1)$ a smoothing parameter (Gupta et al., 2018, Eschenhagen et al., 4 Jun 2025).

The update computes the symmetric inverse $p$ -root (often $p=4$ ) of each factor, and preconditions the gradient: $W_{t+1} = W_t - \eta\, L_t^{-1/4} G_t R_t^{-1/4}$ This corresponds, when vectorized, to applying $(L_t \otimes R_t)^{-1/2}$ to the flattened gradient, closely approximating the ideal full-matrix AdaGrad preconditioner with only $O(m^2+n^2)$ memory and $O(m^3+n^3)$ per-factor time, compared to $G_t$ 0 and $G_t$ 1 for full-matrix AdaGrad (Xie et al., 13 Mar 2025, Gratton et al., 19 Apr 2026).

In practice, to amortize the cubic cost of eigendecomposition, root-inverses are computed every $G_t$ 2 steps ("stale preconditioning"), and step sizes per layer are often grafted to match those of reference optimizers like AdamW (Shi et al., 2023, Eschenhagen et al., 4 Jun 2025).

2. Theoretical Guarantees and Convergence Rates

Shampoo is a specific instance in the class of adaptively preconditioned first-order methods, and its convergence in stochastic online and non-convex optimization is rigorously established. Theoretical results include:

Stochastic Convex Regret: Under convexity, the two-sided variant achieves $G_t$ 3 regret (Gupta et al., 2018).
Nonconvex Rate: In general nonconvex settings, the AdamW-style Shampoo achieves

$G_t$ 4

where $G_t$ 5 is the nuclear norm, $G_t$ 6 is the iteration number, and $G_t$ 7 encapsulates problem (Lipschitz, smoothness, and variance) parameters (Li et al., 12 Jan 2026). For practical $G_t$ 8, the $G_t$ 9 scaling is much smaller than the parameter dimension scaling appearing in vectorized or diagonal methods.

Structure Advantage: Recent unified analyses demonstrate that one-sided Shampoo (preconditioning only the row or column dimension) can exhibit sharper regret and finite-time convergence bounds than both two-sided Shampoo and full-matrix AdaGrad, with significantly reduced per-step complexity (Xie et al., 13 Mar 2025).

Shampoo's convergence analyses do not require bounded stochastic gradients or extremely small step sizes, unlike several adaptive methods (Gratton et al., 19 Apr 2026). Convergence rates are slightly slower than SGD ( $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 0 nuclear norm vs $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 1 Frobenius), but the practical efficiency gains (step size, curvature adaptation) offset this in high-dimensional regimes (Li et al., 12 Jan 2026).

3. Numerical, Stability, and Scalability Considerations

Accurate computation of matrix inverse $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 2-roots is numerically delicate—small eigenvalues in $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 3, $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 4 render the preconditioning operation unstable in low (e.g., float32) precision. Classical implementations require FP64 for root-inverse eigendecompositions on CPUs or fallback routines on accelerators (Mei et al., 2023). To reduce wall time, Shampoo implementations often:

Amortize eigendecomposition: Update roots every $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 5 steps, making preconditioning "stale". However, if $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 6 is too large, accumulated staleness can lead to divergence or degraded convergence (Frans et al., 8 Jun 2025, Eschenhagen et al., 4 Jun 2025).
Apply block-wise preconditioning: For very large layers, block the matrices and batch kernel operations to maximize hardware utilization, as in DASH, which achieves up to $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 7 step speedups compared to distributed Shampoo by stacking blocks into 3D tensors and leveraging batched EVD or polynomial inverse-root solvers (Modoranu et al., 2 Feb 2026).
Adaptive damping and refresh scheduling: Dynamic schemes, such as FOAM, adjust $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 8 and trigger re-computation adaptively, sensing staleness-oriented error to balance numerical stability and computational cost (Nam et al., 1 Jun 2026).

Table: Comparison of Key Practical Ingredients

Feature	Shampoo	DASH	FOAM	EShampoo/SOAP
FP64 need	Yes (roots)	No (FP32/FP16)	No	No (rotated basis)
Blocked kernels	Not required	Yes	Optional	Optional
Adaptive refresh	No	No	Yes	Yes (basis criterion)
Grafting/eigenvalue corr.	Often/Yes	Optional	Optional	Yes

Key practical recommendations include block-wise application only to large matrix blocks, periodic (but not too-infrequent) root updates, and, if supported, leveraging batched GPU-friendly root solvers (Modoranu et al., 2 Feb 2026, Shi et al., 2023).

4. Algorithmic Extensions and Variants

Several empirically and theoretically motivated extensions have addressed original Shampoo's stability, scale-sensitivity, and tuning complexity (Frans et al., 8 Jun 2025, Vyas et al., 2024, Eschenhagen et al., 4 Jun 2025):

SPlus [A Stable Whitening Optimizer for Efficient Neural Network Training]: Replaces historical eigenvalue normalization with a per-step sign-based update in a fixed eigenbasis, yielding updates with hard-bounded magnitude, shape-aware scaling for width-invariance, and Polyak–Ruppert iterate averaging for noise reduction. SPlus achieves $L_t = \beta_2 L_{t-1} + (1-\beta_2) G_t G_t^\top \in\mathbb{R}^{m\times m},\quad R_t = \beta_2 R_{t-1} + (1-\beta_2) G_t^\top G_t \in\mathbb{R}^{n\times n}$ 9 steps-to-Adam and $\beta_2\in[0,1)$ 0 time-to-Adam across diverse large-scale tasks, with robustness to inversion intervals up to 100 steps.
SOAP [Shampoo with Adam in the Preconditioner's Eigenbasis]: SOAP demonstrates that Shampoo with exponent $\beta_2\in[0,1)$ 1 is algebraically equivalent to running Adafactor in the preconditioner's eigenbasis and extends this to running AdamW in that basis, updating basis vectors infrequently but per-entry scaling adaptively. SOAP stabilizes performance under infrequent basis refreshes and matches well-tuned AdamW in only $\beta_2\in[0,1)$ 2\% of the wall-clock time.
EShampoo [Purifying Shampoo]: Proposes decoupling the update of preconditioner eigenvalues and eigenvectors. With direct eigenvalue correction per-step (in the rotated basis) and an adaptive criterion for eigenbasis refresh (based on relative off-diagonal error in the stale-projected covariance), it eliminates the need for layerwise norm grafting and increases wall-time efficiency.
DASH: Introduces batched block preconditioning, GPU-friendly Newton–Denman–Beavers and Chebyshev-based root estimation, and power iteration-based scaling for accelerated and scalable distributed training. Maintains Shampoo's per-iteration perplexity while drastically reducing wall-clock time.
FOAM: Dynamically controls damping and refresh frequency via an operator-norm staleness proxy, ensuring stable optimization with reduced eigendecomposition overhead (Nam et al., 1 Jun 2026).

5. Empirical Performance and Applications

Shampoo and its variants consistently outperform diagonal methods (Adam, AdaGrad) in step- and wall-time efficiency on standard benchmarks (ResNet/ImageNet, ViT, GPT-2/WikiText, Llama/C4), often reducing training steps or time by $\beta_2\in[0,1)$ 3– $\beta_2\in[0,1)$ 4 with similar or higher final accuracy (Shi et al., 2023, Vlassis et al., 27 Sep 2025, Frans et al., 8 Jun 2025, Modoranu et al., 2 Feb 2026). Key empirical outcomes:

Robustness to Quantization: Shampoo-trained models show the smallest accuracy drop under 4-bit quantization-aware training, with the highest parameter efficiency $\beta_2\in[0,1)$ 5 and minimal zero-shot degradation versus full-precision, outperforming AdamW and Muon even when the max-to-median ratio suggests the opposite (Vlassis et al., 27 Sep 2025).
Generalization and Compression: Shampoo has been shown to yield models with less activation outlier structure, lowering quantization and compression error, and supporting efficient deployment (Modoranu et al., 2 Feb 2026).
Distributed Scalability: DTensor-based sharding and AllGather primitives in distributed PyTorch implementations enable large-scale Shampoo training with at most $\beta_2\in[0,1)$ 6 per-step overhead for up to $\beta_2\in[0,1)$ 7 faster convergence in wall-clock time compared to baseline adaptive methods (Shi et al., 2023, Modoranu et al., 2 Feb 2026).

6. Interpretations, Connections, and Open Problems

Shampoo's core operation is best understood as a Frobenius-norm optimal Kronecker-approximation to the full empirical Fisher (as in full-matrix AdaGrad or Adam), updated with the square-root of the factors to maintain the scaling properties of the vectorized preconditioner (Eschenhagen et al., 4 Jun 2025). The update can be decomposed as a spectral descent (polar/singular value factorization) with two adaptive matrix scalings, yielding time-averaged semi-orthogonality in expectation rather than enforcing hard orthogonality (“whitening”) or variance adaptation (Eschenhagen et al., 10 Feb 2026). This mechanism is fundamentally distinct from both classical whitening and variance-scaling narratives.

Recent analyses highlight:

Structure-exploitation vs. full adaptivity: Carefully chosen structure (e.g., one-sided Shampoo) can yield both improved regret/convergence constants and lower cost than less-structured, fully-adaptive methods—a challenge to the dogma that more adaptivity always confers more efficiency (Xie et al., 13 Mar 2025).
Elimination of heuristics: Adaptive eigenbasis frequency and eigenvalue correction can supplant learning rate grafting and ad hoc preconditioner reuse (Eschenhagen et al., 4 Jun 2025).
Momentum and weight decay integration: Shampoo is commonly equipped with first- and second-moment momentum, decoupled weight decay (AdamW-style), bias correction, and flexible per-layer application (matrix blocks only vs. scalars) (Shi et al., 2023, Li et al., 12 Jan 2026).

Outstanding directions include tight integration of approximation quality into theoretical regret, combining adaptive basis scheduling with hardware-aware batch kernel calls, and extension to more general curvature targets or low-rank approximations (Eschenhagen et al., 4 Jun 2025).

Key references for further details: (Gupta et al., 2018, Shi et al., 2023, Xie et al., 13 Mar 2025, Gratton et al., 19 Apr 2026, Modoranu et al., 2 Feb 2026, Eschenhagen et al., 4 Jun 2025, Vlassis et al., 27 Sep 2025, Frans et al., 8 Jun 2025, Vyas et al., 2024, Eschenhagen et al., 10 Feb 2026, Nam et al., 1 Jun 2026, Li et al., 12 Jan 2026).