Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
105 tokens/sec
GPT-4o
73 tokens/sec
Gemini 2.5 Pro Pro
63 tokens/sec
o3 Pro
25 tokens/sec
GPT-4.1 Pro
71 tokens/sec
DeepSeek R1 via Azure Pro
22 tokens/sec
2000 character limit reached

Shampoo Family of Algorithms

Last updated: June 11, 2025

Significance and Background

The Shampoo ° family of algorithms represents a scalable approach to second-order optimization ° in deep learning, providing a middle ground between purely diagonal adaptive methods (such as Adam) and computationally expensive full-matrix preconditioners ° (Gupta et al., 2018 ° ). Shampoo achieves this by exploiting the tensor structure of parameter arrays, maintaining a separate matrix preconditioner ° for each dimension of the tensor. This design enables rich, curvature-aware updates with storage and computation requirements that are practical for large-scale models (Gupta et al., 2018 ° ).

Foundational Concepts

Structure-Aware Kronecker-Factored Preconditioning

Shampoo’s approach centers on Kronecker-factored ° preconditioners. For a parameter tensor WRn1××nkW \in \mathbb{R}^{n_1 \times \dots \times n_k}, the optimizer maintains mode-specific preconditioners H(i)H^{(i)} for each dimension ii. The updates are:

  • For order-2 tensors (matrices):
    • Preconditioners: Lt=Lt1+GtGtL_t = L_{t-1} + G_t G_t^\top, Rt=Rt1+GtGtR_t = R_{t-1} + G_t^\top G_t
    • Update: Wt+1=WtηLt1/4GtRt1/4W_{t+1} = W_t - \eta\, L_t^{-1/4} G_t R_t^{-1/4}
  • For higher-order tensors:
    • Each Ht(i)H_t^{(i)} is updated through appropriate contractions ° of the gradient tensor, and the preconditioned gradient is constructed by tensor-matrix products along each mode with the inverse root of Ht(i)H_t^{(i)} (Gupta et al., 2018 ° ).

Convergence Guarantees

Shampoo achieves O(T)O(\sqrt{T}) regret bounds ° in online convex optimization, matching the best first-order adaptive preconditioned methods, but with lower computational cost than full-matrix AdaGrad °. These guarantees are grounded in matrix trace inequalities and hold for convex loss functions ° under mild gradient assumptions (Gupta et al., 2018 ° ).

Key Algorithmic and Empirical Developments

Distributed Shampoo and Large-Scale Performance

Distributed Shampoo adapts the optimizer to multi-GPU and large-model settings through block-diagonal preconditioners ° and block-wise assignment using DTensor ° (sharded) structures. Parameter blocks are distributed across GPUs, and an AllGather ° primitive ensures synchronized parameter updates (Shi et al., 2023 ° ).

Key empirical findings:

  • Distributed Shampoo introduces only an 8–10% wall-clock overhead per iteration compared to first-order optimizers, despite additional matrix operations °.
  • Training ImageNet °-scale models (e.g., ResNet50), Shampoo achieves a 1.5×\times reduction in optimizer steps and a 1.35×\times reduction in wall time over SGD, without sacrificing accuracy.
  • Amortizing root-inverse computations by updating them every 50 steps (instead of every iteration) incurs negligible accuracy loss (Shi et al., 2023 ° ).

Learning Rate Grafting

Due to differences in preconditioned step norms compared to Adam or SGD, practical Shampoo implementations often employ learning rate grafting: at each layer, the Shampoo update is rescaled to match the Frobenius norm ° of a reference optimizer’s step (typically Adam/SGD). This alignment aids in stable training and facilitates the reuse of learning-rate schedules (Shi et al., 2023 ° , Eschenhagen et al., 4 Jun 2025 ° ).

Memory-Efficient Optimization: 4-bit Shampoo

Preconditioner storage can become a bottleneck for very large models. 4-bit Shampoo addresses this by quantizing ° the eigenvector matrices of the preconditioners to 4 bits, while maintaining the eigenvalues at full precision °. This approach avoids the significant errors (especially under inverse roots) that result from directly quantizing the preconditioner matrix ° itself. Orthogonality rectification, such as Björck orthonormalization, is applied post-quantization ° to ensure numerical stability °.

Hybrid and Refined Algorithms: SOAP, Purifying Shampoo, and SPlus

SOAP (ShampoO with Adam in Preconditioner's Eigenbasis)

SOAP ° establishes a formal connection between Shampoo preconditioning ° and Adam-family updates. Specifically, it shows that applying AdamW in the slowly evolving eigenbasis ° of the Shampoo preconditioner is functionally equivalent (under appropriate exponent choices) to a variant of Shampoo (Vyas et al., 17 Sep 2024 ° ). The main algorithmic features are:

  • AdamW is run on gradients rotated into the current Shampoo eigenbasis, with the eigenbasis itself updated only every ff steps (the sole extra hyperparameter beyond standard Adam settings).
  • Unlike Shampoo, which holds moments fixed between basis updates, SOAP continually updates the Adam moments in the current basis.
  • On large-scale language-model pretraining (360M–660M parameters), SOAP delivered over 40% reduction in optimizer steps and 35% reduction in wall-clock time compared to AdamW, with an additional \sim20% improvement over Shampoo (Vyas et al., 17 Sep 2024 ° ).

Purifying Shampoo

Recent analysis has shown that two heuristics—learning rate grafting and stale preconditioning—are crucial for practical Shampoo performance at scale, even though they lack theoretical justification ° (Eschenhagen et al., 4 Jun 2025 ° ). Key observations:

  • Learning rate grafting corrects for mis-scaled or stale eigenvalues, which would otherwise lead to incorrect update magnitudes.
  • Decoupling the frequency of eigenvalue ° and eigenbasis updates reveals that if eigenvalues are updated at every step (even when the basis is updated infrequently), learning rate grafting becomes unnecessary.
  • An adaptive criterion for eigenbasis updates, based on the relative Frobenius norm of off-diagonal terms in the rotated preconditioner, enables efficient and principled update scheduling without predefined frequencies. This approach achieves computational savings without loss of convergence speed (Eschenhagen et al., 4 Jun 2025 ° ).

SPlus: A Stable Whitening Optimizer

SPlus extends the Shampoo framework, resolving key issues related to instability, learning-rate scaling across network width, and high parameter noise °.

  • Instead of normalizing with stale historical eigenvalues, SPlus uses instant sign normalization in the existing eigenbasis:

U=QLTsign(QLGQR)QRT×2m+n U = Q_L^T\, \mathrm{sign}(Q_L G Q_R) Q_R^T \times \frac{2}{m+n}

This bounds the spectral norm ° of the update, preventing divergence due to stale normalization factors (Frans et al., 8 Jun 2025 ° ).

Sidebar: Speculative Note

Claims about the future pervasiveness of “universal” optimizers—minimally tuned, theoretically optimal methods—are plausible but not explicitly evidenced in the source literature and are thus presented as informed speculation.

Applications and Empirical Results

Shampoo and its variants are now established choices in:

Variant Compute Overhead ° Memory Overhead Tuning Needs Notable Results Source
Shampoo Moderate Moderate Step size, exponents Faster loss reduction than Adam/SGD (Gupta et al., 2018 ° )
Distributed Shampoo (DTensor) +8–10% As above Grafting, precondition freq 1.35×\times wall-time reduction on ImageNet (Shi et al., 2023 ° )
4-bit Shampoo +10% 1–12% vs. AdamW Quantization type 78.63% acc. vs. 79.34% (32b) on CIFAR-100 (Wang et al., 28 May 2024 ° )
SOAP +0–10% As above Only eigenbasis freq ≥35% wall-time saving on LM pretraining (Vyas et al., 17 Sep 2024 ° )
SPlus Slightly lower As above Few (width scaling) 44% Steps-To-Adam on average (3 tasks) (Frans et al., 8 Jun 2025 ° )

Emerging Trends and Future Directions

Decoupled and Adaptive Update Schedules

The most recent works decouple the schedules for updating the eigenbasis and eigenvalues of the preconditioner, introducing adaptive criteria based on explicit error metrics. For example, an eigenbasis is only recomputed when the off-diagonal Frobenius norm in the projected preconditioner exceeds a threshold relative to the total norm. This error-driven scheduling reduces computation and increases robustness, without requiring manual tuning of update frequencies (Eschenhagen et al., 4 Jun 2025 ° ).

Compression and Scalability

By combining eigenvector quantization and block-wise preconditioners, current methods achieve orders-of-magnitude reductions in memory with negligible loss in optimizer fidelity. This enables second-order preconditioning on problems and hardware configurations ° where it was previously infeasible (Wang et al., 28 May 2024 ° ).

Alignment of Theory and Practice

Ongoing research seeks to minimize heuristic components (e.g., learning rate grafting) or to justify them rigorously, aligning optimizer theory and practical deployment. Strategies such as instant sign normalization, eigenvalue correction, and per-layer adaptive update schedules are moving the field toward genuinely plug-and-play second-order methods ° (Vyas et al., 17 Sep 2024 ° , Eschenhagen et al., 4 Jun 2025 ° , Frans et al., 8 Jun 2025 ° ).

Challenge/Trend Present Solution(s) Open Questions
Preconditioner Scalability Kronecker, block-structure, quantization Further compression, faster matrix roots
Hyperparameter Complexity Grafting, shape-aware scaling, SOAP/SPlus Universally robust defaults
Stability at Scale SPlus instant sign, adaptive updates, EMA Theoretical bounds ° for non-convex settings °
Generalization Beyond Matrices Primarily for 2D parameters in transformers/CNNs Extension to non-matrix tensor structures

Concluding Remarks

The Shampoo family of algorithms exemplifies the maturation of second-order optimization in deep learning: it spans from theoretically motivated, structure-aware preconditioners to robust, memory-efficient, and adaptively scheduled variants suitable for state-of-the-art large-scale networks °. Recent developments stress adaptive frequency control, quantization for efficient memory use, and the systematic removal of tuning heuristics. Active research continues on making such optimizers more broadly applicable, theoretically justified, and computationally economical, with openly available implementations supporting their integration into modern deep learning workflows (Gupta et al., 2018 ° , Shi et al., 2023 ° , Wang et al., 28 May 2024 ° , Vyas et al., 17 Sep 2024 ° , Eschenhagen et al., 4 Jun 2025 ° , Frans et al., 8 Jun 2025 ° ).


References


Speculative Note

The literature points to growing convergence on block-structured, adaptive second-order optimizers ° as practical defaults for deep learning, but fully “universal” optimizers—requiring minimal tuning and with strong guarantees across architectures—remain a research aspiration rather than a proven fact at this time.