Shampoo Family of Algorithms
Last updated: June 11, 2025
Significance and Background
The Shampoo ° family of algorithms represents a scalable approach to second-order optimization ° in deep learning, providing a middle ground between purely diagonal adaptive methods (such as Adam) and computationally expensive full-matrix preconditioners ° (Gupta et al., 2018 ° ). Shampoo achieves this by exploiting the tensor structure of parameter arrays, maintaining a separate matrix preconditioner ° for each dimension of the tensor. This design enables rich, curvature-aware updates with storage and computation requirements that are practical for large-scale models (Gupta et al., 2018 ° ).
Foundational Concepts
Structure-Aware Kronecker-Factored Preconditioning
Shampoo’s approach centers on Kronecker-factored ° preconditioners. For a parameter tensor , the optimizer maintains mode-specific preconditioners for each dimension . The updates are:
- For order-2 tensors (matrices):
- Preconditioners: ,
- Update:
- For higher-order tensors:
- Each is updated through appropriate contractions ° of the gradient tensor, and the preconditioned gradient is constructed by tensor-matrix products along each mode with the inverse root of (Gupta et al., 2018 ° ).
Convergence Guarantees
Shampoo achieves regret bounds ° in online convex optimization, matching the best first-order adaptive preconditioned methods, but with lower computational cost than full-matrix AdaGrad °. These guarantees are grounded in matrix trace inequalities and hold for convex loss functions ° under mild gradient assumptions (Gupta et al., 2018 ° ).
Key Algorithmic and Empirical Developments
Distributed Shampoo and Large-Scale Performance
Distributed Shampoo adapts the optimizer to multi-GPU and large-model settings through block-diagonal preconditioners ° and block-wise assignment using DTensor ° (sharded) structures. Parameter blocks are distributed across GPUs, and an AllGather ° primitive ensures synchronized parameter updates (Shi et al., 2023 ° ).
Key empirical findings:
- Distributed Shampoo introduces only an 8–10% wall-clock overhead per iteration compared to first-order optimizers, despite additional matrix operations °.
- Training ImageNet °-scale models (e.g., ResNet50), Shampoo achieves a 1.5 reduction in optimizer steps and a 1.35 reduction in wall time over SGD, without sacrificing accuracy.
- Amortizing root-inverse computations by updating them every 50 steps (instead of every iteration) incurs negligible accuracy loss (Shi et al., 2023 ° ).
Learning Rate Grafting
Due to differences in preconditioned step norms compared to Adam or SGD, practical Shampoo implementations often employ learning rate grafting: at each layer, the Shampoo update is rescaled to match the Frobenius norm ° of a reference optimizer’s step (typically Adam/SGD). This alignment aids in stable training and facilitates the reuse of learning-rate schedules (Shi et al., 2023 ° , Eschenhagen et al., 4 Jun 2025 ° ).
Memory-Efficient Optimization: 4-bit Shampoo
Preconditioner storage can become a bottleneck for very large models. 4-bit Shampoo addresses this by quantizing ° the eigenvector matrices of the preconditioners to 4 bits, while maintaining the eigenvalues at full precision °. This approach avoids the significant errors (especially under inverse roots) that result from directly quantizing the preconditioner matrix ° itself. Orthogonality rectification, such as Björck orthonormalization, is applied post-quantization ° to ensure numerical stability °.
- 4-bit Shampoo achieves comparable test accuracy ° to standard 32-bit Shampoo while reducing optimizer memory ° state by roughly 7.
- On benchmarks such as Swin-Tiny on CIFAR-100 °, 4-bit Shampoo attained 78.63% accuracy (vs. 79.34% for 32-bit Shampoo and 76.69% for AdamW °) with only 1–12% more memory than AdamW (Wang et al., 28 May 2024 ° ).
Hybrid and Refined Algorithms: SOAP, Purifying Shampoo, and SPlus
SOAP (ShampoO with Adam in Preconditioner's Eigenbasis)
SOAP ° establishes a formal connection between Shampoo preconditioning ° and Adam-family updates. Specifically, it shows that applying AdamW in the slowly evolving eigenbasis ° of the Shampoo preconditioner is functionally equivalent (under appropriate exponent choices) to a variant of Shampoo (Vyas et al., 17 Sep 2024 ° ). The main algorithmic features are:
- AdamW is run on gradients rotated into the current Shampoo eigenbasis, with the eigenbasis itself updated only every steps (the sole extra hyperparameter beyond standard Adam settings).
- Unlike Shampoo, which holds moments fixed between basis updates, SOAP continually updates the Adam moments in the current basis.
- On large-scale language-model pretraining (360M–660M parameters), SOAP delivered over 40% reduction in optimizer steps and 35% reduction in wall-clock time compared to AdamW, with an additional 20% improvement over Shampoo (Vyas et al., 17 Sep 2024 ° ).
Purifying Shampoo
Recent analysis has shown that two heuristics—learning rate grafting and stale preconditioning—are crucial for practical Shampoo performance at scale, even though they lack theoretical justification ° (Eschenhagen et al., 4 Jun 2025 ° ). Key observations:
- Learning rate grafting corrects for mis-scaled or stale eigenvalues, which would otherwise lead to incorrect update magnitudes.
- Decoupling the frequency of eigenvalue ° and eigenbasis updates reveals that if eigenvalues are updated at every step (even when the basis is updated infrequently), learning rate grafting becomes unnecessary.
- An adaptive criterion for eigenbasis updates, based on the relative Frobenius norm of off-diagonal terms in the rotated preconditioner, enables efficient and principled update scheduling without predefined frequencies. This approach achieves computational savings without loss of convergence speed (Eschenhagen et al., 4 Jun 2025 ° ).
SPlus: A Stable Whitening Optimizer
SPlus extends the Shampoo framework, resolving key issues related to instability, learning-rate scaling across network width, and high parameter noise °.
- Instead of normalizing with stale historical eigenvalues, SPlus uses instant sign normalization in the existing eigenbasis:
This bounds the spectral norm ° of the update, preventing divergence due to stale normalization factors (Frans et al., 8 Jun 2025 ° ).
- A shape-aware scaling factor ° maintains invariant update magnitudes across varying network widths, allowing direct learning rate transfer.
- Exponential moving average ° (EMA °) iterate-averaging of model parameters reduces noise from large learning rates ° and normalized updates.
- In Transformer benchmarks across LLMing, image classification, and diffusion models, SPlus reaches Adam’s validation loss ° in 44% the number of optimizer steps and 62% the wall-clock time, consistently outperforming Shampoo and SOAP (Frans et al., 8 Jun 2025 ° ).
Sidebar: Speculative Note
Claims about the future pervasiveness of “universal” optimizers—minimally tuned, theoretically optimal methods—are plausible but not explicitly evidenced in the source literature and are thus presented as informed speculation.
Applications and Empirical Results
Shampoo and its variants are now established choices in:
- Convolutional networks: Faster convergence and improved generalization compared to first-order methods on ResNet and Inception architectures for CIFAR-10/100 ° (Gupta et al., 2018 ° , Shi et al., 2023 ° ).
- Transformer models and large-batch tasks: Enhanced convergence and sample-efficiency in attention-based LLMing tasks, with robust distributed implementations made possible via DTensor structures (Shi et al., 2023 ° , Vyas et al., 17 Sep 2024 ° ).
- Memory-constrained scenarios: 4-bit Shampoo provides nearly full-fidelity second-order optimization on vision and transformer benchmarks with significant reduction in memory overhead °, enabling practical deployment in resource-limited environments (Wang et al., 28 May 2024 ° ).
- Robust, adaptive pipelines: Recent variants such as SOAP and SPlus demonstrate that with fewer heuristics and minimal hyperparameter tuning, step and wall-clock efficiency can be substantially improved beyond both standard Shampoo and Adam (Vyas et al., 17 Sep 2024 ° , Frans et al., 8 Jun 2025 ° ).
Variant | Compute Overhead ° | Memory Overhead | Tuning Needs | Notable Results | Source |
---|---|---|---|---|---|
Shampoo | Moderate | Moderate | Step size, exponents | Faster loss reduction than Adam/SGD | (Gupta et al., 2018 ° ) |
Distributed Shampoo (DTensor) | +8–10% | As above | Grafting, precondition freq | 1.35 wall-time reduction on ImageNet | (Shi et al., 2023 ° ) |
4-bit Shampoo | +10% | 1–12% vs. AdamW | Quantization type | 78.63% acc. vs. 79.34% (32b) on CIFAR-100 | (Wang et al., 28 May 2024 ° ) |
SOAP | +0–10% | As above | Only eigenbasis freq | ≥35% wall-time saving on LM pretraining | (Vyas et al., 17 Sep 2024 ° ) |
SPlus | Slightly lower | As above | Few (width scaling) | 44% Steps-To-Adam on average (3 tasks) | (Frans et al., 8 Jun 2025 ° ) |
Emerging Trends and Future Directions
Decoupled and Adaptive Update Schedules
The most recent works decouple the schedules for updating the eigenbasis and eigenvalues of the preconditioner, introducing adaptive criteria based on explicit error metrics. For example, an eigenbasis is only recomputed when the off-diagonal Frobenius norm in the projected preconditioner exceeds a threshold relative to the total norm. This error-driven scheduling reduces computation and increases robustness, without requiring manual tuning of update frequencies (Eschenhagen et al., 4 Jun 2025 ° ).
Compression and Scalability
By combining eigenvector quantization and block-wise preconditioners, current methods achieve orders-of-magnitude reductions in memory with negligible loss in optimizer fidelity. This enables second-order preconditioning on problems and hardware configurations ° where it was previously infeasible (Wang et al., 28 May 2024 ° ).
Alignment of Theory and Practice
Ongoing research seeks to minimize heuristic components (e.g., learning rate grafting) or to justify them rigorously, aligning optimizer theory and practical deployment. Strategies such as instant sign normalization, eigenvalue correction, and per-layer adaptive update schedules are moving the field toward genuinely plug-and-play second-order methods ° (Vyas et al., 17 Sep 2024 ° , Eschenhagen et al., 4 Jun 2025 ° , Frans et al., 8 Jun 2025 ° ).
Challenge/Trend | Present Solution(s) | Open Questions |
---|---|---|
Preconditioner Scalability | Kronecker, block-structure, quantization | Further compression, faster matrix roots |
Hyperparameter Complexity | Grafting, shape-aware scaling, SOAP/SPlus | Universally robust defaults |
Stability at Scale | SPlus instant sign, adaptive updates, EMA | Theoretical bounds ° for non-convex settings ° |
Generalization Beyond Matrices | Primarily for 2D parameters in transformers/CNNs | Extension to non-matrix tensor structures |
Concluding Remarks
The Shampoo family of algorithms exemplifies the maturation of second-order optimization in deep learning: it spans from theoretically motivated, structure-aware preconditioners to robust, memory-efficient, and adaptively scheduled variants suitable for state-of-the-art large-scale networks °. Recent developments stress adaptive frequency control, quantization for efficient memory use, and the systematic removal of tuning heuristics. Active research continues on making such optimizers more broadly applicable, theoretically justified, and computationally economical, with openly available implementations supporting their integration into modern deep learning workflows (Gupta et al., 2018 ° , Shi et al., 2023 ° , Wang et al., 28 May 2024 ° , Vyas et al., 17 Sep 2024 ° , Eschenhagen et al., 4 Jun 2025 ° , Frans et al., 8 Jun 2025 ° ).
References
- (Gupta et al., 2018 ° ) Shampoo: Preconditioned Stochastic Tensor Optimization
- (Shi et al., 2023 ° ) A Distributed Data-Parallel PyTorch Implementation ° of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
- (Wang et al., 28 May 2024 ° ) 4-bit Shampoo for Memory-Efficient Network Training
- (Vyas et al., 17 Sep 2024 ° ) SOAP: Improving and Stabilizing Shampoo using Adam
- (Eschenhagen et al., 4 Jun 2025 ° ) Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner
- (Frans et al., 8 Jun 2025 ° ) A Stable Whitening ° Optimizer for Efficient Neural Network Training
Speculative Note
The literature points to growing convergence on block-structured, adaptive second-order optimizers ° as practical defaults for deep learning, but fully “universal” optimizers—requiring minimal tuning and with strong guarantees across architectures—remain a research aspiration rather than a proven fact at this time.