Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Shampoo Algorithms for Efficient NN Training

Updated 30 June 2025
  • Shampoo algorithms are adaptive optimization methods that employ Kronecker-factored preconditioning to enhance scalability and accelerate convergence in large-scale neural network training.
  • They use tensor-based approximations of second-order statistics to achieve provable regret bounds and robust convergence guarantees in diverse deep learning applications.
  • Variants such as Distributed Shampoo, 4-bit Shampoo, and SOAP offer memory efficiency and speed improvements, outperforming traditional first-order optimizers in practical settings.

The Shampoo family of algorithms comprises a set of adaptive optimization methods designed for efficient and effective training of large-scale neural networks through higher-order, structure-aware preconditioning. Shampoo algorithms address key limitations of traditional first-order optimizers by employing Kronecker-factored approximations of second-order statistics, yielding improved convergence properties, practical scalability, and robust empirical performance in both vision and language domains.

1. Structure-Aware Kronecker-Factored Preconditioning

The original Shampoo algorithm (Gupta et al., 2018) introduced a tensor-based preconditioning framework that replaces costly full-matrix preconditioners with sets of smaller matrices, each corresponding to a specific tensor dimension. For a parameter tensor WRn1××nkW \in \mathbb{R}^{n_1 \times \cdots \times n_k}, the method maintains one symmetric positive semidefinite matrix HiH^i of shape ni×nin_i \times n_i for each dimension ii. For a matrix-shaped WRm×nW \in \mathbb{R}^{m \times n}, Shampoo maintains row and column preconditioners: Lt=Lt1+GtGt,Rt=Rt1+GtGtL_t = L_{t-1} + G_t G_t^\top,\quad R_t = R_{t-1} + G_t^\top G_t where GtG_t is the gradient at iteration tt. The parameter update uses both preconditioners: Wt+1=WtηLt1/4GtRt1/4W_{t+1} = W_t - \eta L_t^{-1/4} G_t R_t^{-1/4} For tensors of order kk, the update generalizes via contractions and products of the per-mode preconditioners, each raised to 12k-\frac{1}{2k}, along the associated axis.

This structure-aware approach exploits intrinsic parameter organization in deep networks, providing richer adaptivity than diagonal methods while avoiding the prohibitive O(d2)O(d^2) (full-matrix) memory costs.

2. Theoretical Guarantees and Matrix Analysis

Shampoo's convergence analysis is grounded in the use of matrix trace inequalities, particularly those concerning Kronecker products and operator monotonicity. The main theoretical result provides an O(T)O(\sqrt{T}) regret bound in stochastic convex optimization: t=1Tft(Wt)t=1Tft(W)2rDTr(LT1/4)Tr(RT1/4)\sum_{t=1}^T f_t(W_t) - \sum_{t=1}^T f_t(W^*) \leq \sqrt{2r}D\operatorname{Tr}(L_T^{1/4})\operatorname{Tr}(R_T^{1/4}) where rr is the maximum rank of the gradients, DD is the maximal parameter distance in Frobenius norm, and LT,RTL_T, R_T accumulate the preconditioner statistics. For tensors, the regret bound scales as a product of preconditioner traces across dimensions.

The analysis draws on:

  • Properties of Kronecker products: (AB)s=AsBs(A \otimes B)^s = A^s \otimes B^s
  • Operator monotonicity of matrix powers (Löwner’s theorem)
  • The geometric mean inequality for positive semidefinite matrices (Ando et al.)

This ensures that Shampoo's conditioning is never worse than a full-matrix approach up to rank constants, with dimension-favored scaling for high-dimensional, low-rank structures typical in deep learning (Gupta et al., 2018).

3. Algorithmic Variants and Implementation Strategies

Multiple variants and implementation enhancements have been developed to adapt the Shampoo framework for distributed settings, memory efficiency, and robustness:

  • Distributed Shampoo employs block-diagonal preconditioning, assigning Kronecker-factored state per parameter or layer. In PyTorch, parameters are sharded using the DTensor structure, with an AllGather operation synchronizing updates at each iteration. Preconditioner roots are updated periodically, with blocking strategies to cap compute and memory usage (Shi et al., 2023).
  • 4-bit Shampoo leverages quantization: preconditioner eigenvector matrices are stored at 4 bits per value, with orthogonality rectified via Björck orthonormalization. This achieves \sim7x state compression and enables large-model training with negligible accuracy loss relative to 32-bit Shampoo. Direct quantization of preconditioners introduces catastrophic error; quantizing only the eigenbasis preserves both numerical stability and optimizer performance (Wang et al., 28 May 2024).
  • SOAP (Shampoo with Adam in the Preconditioner's eigenbasis) introduces a principled link between the 1/2-exponent Shampoo update and running Adafactor/Adam in the eigenbasis of the Shampoo preconditioner. This insight motivates a design in which inexpensive moment tracking (Adam) is performed in a slowly changing rotated space, allowing infrequent (and thus computationally cheap) eigencomputations without accuracy loss. SOAP outperforms both AdamW and classic Shampoo in large-batch LLM pretraining in terms of both iterations and wall-clock time, while requiring minimal extra hyperparameter tuning (just preconditioner frequency) (Vyas et al., 17 Sep 2024).
  • SPlus further enhances the Shampoo family by using bounded, sign-based normalization (instant-sign update), shape-aware scaling for learning rate consistency over width, and iterate-averaging (EMA) to reduce parameter noise at high learning rates. SPlus achieves superior stability with infrequent eigenbasis updates and enables practical deployment on large Transformer training benchmarks, consistently reaching Adam-level performance in fewer steps and less wall-clock time (Frans et al., 8 Jun 2025).

4. Heuristics, Theoretical Insights, and Structured Simplifications

Shampoo and its descendants initially relied on heuristics—such as learning rate grafting (rescaling the update Frobenius norm to match Adam) and stale preconditioning (amortizing expensive root computations by updating less frequently)—to ensure robust performance at scale (Eschenhagen et al., 4 Jun 2025). Recent analyses provide theoretical justification and propose mechanisms to eliminate such heuristics:

  • Grafting is shown to mitigate mis-scaling from stale or imprecisely-updated preconditioner eigenvalues.
  • Decoupling eigenvalue and eigenbasis updates, and correcting eigenvalues directly in the current basis, can remove the need for learning rate grafting.
  • An adaptive criterion for eigenbasis refresh, based on the off-diagonal Frobenius norm, allows per-layer update frequency control, striking a balance between computational overhead and approximation error.

The family also encompasses structured variants such as one-sided Shampoo (using a single-sided preconditioner), which has been shown to achieve equal or better theoretical and empirical convergence than full-matrix AdaGrad due to better balancing of preconditioner diameter and gradient accumulations. Unified analyses demonstrate that, contrary to prevailing assumptions, more structured (less expressive, cheaper) preconditioners can outperform less-structured, more costly approaches in practice, especially when the domain geometry aligns with the structured adaptation (Xie et al., 13 Mar 2025).

5. Empirical Performance and Practical Impact

Empirical studies consistently show that Shampoo and its variants outperform diagonal adaptive optimizers (Adam, AdaGrad) and momentum SGD in several respects:

  • Faster convergence: Achieves lower loss or error more rapidly, notably in deep image classification (CIFAR, ImageNet) and large-scale LLMing tasks.
  • Efficient scalability: With distributed implementations and memory reduction via quantization or blocking, Shampoo variants enable efficient training of high-parameter-count models on multi-GPU setups with minimal per-step overhead (typically \leq10% above Adam or SGD per batch in distributed settings (Shi et al., 2023)).
  • Robustness: Variants such as SPlus and SOAP offer superior step and wall-clock efficiency compared to Adam, while maintaining or exceeding stability and ease of hyperparameter tuning (Frans et al., 8 Jun 2025, Vyas et al., 17 Sep 2024).

Representative results include top-1 validation accuracy improvements on ImageNet/ResNet50; memory- and wall-clock-efficient scaling in distributed settings; and robustly faster convergence across diverse Transformer tasks. Low-bit variants enable training previously intractable model sizes due to optimizer state bottlenecks (Wang et al., 28 May 2024).

6. Broader Implications and Future Directions

The Shampoo family establishes Kronecker-factorization as a powerful, scalable paradigm for second-order neural network optimization. Key implications include:

  • Advanced preconditioning can be brought to large-scale deep learning with modest practical overhead and without the memory barriers of full-matrix approaches.
  • Structured preconditioners, when properly matched to problem geometry, can outperform naively less structured or "full" adaptation in both theory and practice (Xie et al., 13 Mar 2025).
  • Recent theoretical developments and practical algorithmic refinements have reduced dependence on brittle heuristics, yielding methods that are both principled and robust (e.g., eigenvalue correction, adaptive eigenbasis scheduling) (Eschenhagen et al., 4 Jun 2025).
  • Memory-efficient quantized versions democratize access to strong optimization tools on constrained hardware (Wang et al., 28 May 2024).

Ongoing research explores further reductions in computational overhead, adaptive adjustment of preconditioning granularity, and transferability to broad classes of large neural architectures, including the integration of Shampoo principles in LLM pretraining and finetuning.


Summary Table: Major Shampoo Variants

Variant Key Features Practical Benefit
Classic Shampoo Kronecker-factored per-dimension matrices Near full-matrix adaptation, scalable
Distributed Shampoo (PyTorch) DTensor sharding, AllGather sync Efficient, large-scale multi-GPU training
4-bit Shampoo 4-bit eigenbasis quantization, correction Large memory savings, minimal accuracy loss
SOAP Adam in preconditioned eigenbasis Fast convergence, fewer hyperparameters
One-sided Shampoo Left/Right preconditioner only Lower computation, improved theory/practice
SPlus Bounded sign update, shape scaling, EMA Stability, learning rate transfer, speed