Papers
Topics
Authors
Recent
2000 character limit reached

Distributed Shampoo Optimizer

Updated 31 December 2025
  • Distributed Shampoo is an optimizer for deep learning that employs block-diagonal preconditioning with Kronecker approximations to efficiently mimic full-matrix AdaGrad.
  • It leverages DTensor-based distributed data parallelism to shard preconditioner blocks across GPUs, reducing memory load and computational cost.
  • Empirical evaluations on ResNet-50/ImageNet show that Distributed Shampoo improves convergence and accuracy with modest per-step overhead.

Distributed Shampoo is an optimizer for large-scale neural network training that belongs to the AdaGrad family. It employs a block-diagonal preconditioner with each block constructed as a coarse Kronecker product approximation of full-matrix AdaGrad, balancing the trade-offs between memory, computational cost, and statistical effectiveness. Distributed Shampoo leverages advanced PyTorch primitives, specifically the DTensor infrastructure and distributed data-parallelism, enabling efficient multi-GPU training with minimal per-step overhead compared to diagonal adaptive methods, while delivering improved convergence and accuracy, as demonstrated in empirical studies on the ImageNet benchmark using ResNet-50 (Shi et al., 2023).

1. Algorithmic Foundations and Kronecker Preconditioning

Distributed Shampoo positions itself between diagonal AdaGrad and full-matrix AdaGrad by constructing a block-diagonal preconditioner, with each block corresponding to a parameter tensor. For each block (parameter), let W(i)Rdi×di1W^{(i)} \in \mathbb{R}^{d_i \times d_{i-1}}. The full-matrix AdaGrad preconditioner accumulates At(i)=s=0tvec(Gs(i))vec(Gs(i))TA_t^{(i)} = \sum_{s=0}^t \operatorname{vec}(G_s^{(i)}) \operatorname{vec}(G_s^{(i)})^T, incurring O(di2di12)\mathcal{O}(d_i^2 d_{i-1}^2) memory and O(di3di13)\mathcal{O}(d_i^3 d_{i-1}^3) computation per update—prohibitive for large-scale models.

Shampoo approximates each At(i)A_t^{(i)} by constructing Kronecker factors: Lt(i)=s=0tGs(i)[Gs(i)]T+ϵIdi,Rt(i)=s=0t[Gs(i)]TGs(i)+ϵIdi1L_t^{(i)} = \sum_{s=0}^t G_s^{(i)} [G_s^{(i)}]^T + \epsilon I_{d_i}, \quad R_t^{(i)} = \sum_{s=0}^t [G_s^{(i)}]^T G_s^{(i)} + \epsilon I_{d_{i-1}} and uses the approximation

At(i)Aˉt(i)=[Lt(i)]1/2[Rt(i)]1/2A_t^{(i)} \approx \bar{A}_t^{(i)} = [L_t^{(i)}]^{1/2} \otimes [R_t^{(i)}]^{1/2}

which allows the per-block update

Wt+1(i)=Wt(i)αt[Lt(i)]1/4Gt(i)[Rt(i)]1/4W_{t+1}^{(i)} = W_t^{(i)} - \alpha_t [L_t^{(i)}]^{-1/4} G_t^{(i)} [R_t^{(i)}]^{-1/4}

with the global update wt+1=wtαtAˉt1/2gtw_{t+1} = w_t - \alpha_t \bar{A}_t^{-1/2} g_t, and Aˉt\bar{A}_t as the block-diagonal concatenation of all Aˉt(i)\bar{A}_t^{(i)}. This structure leads to substantial reductions in resource usage, retaining much of the improved conditioning of full-matrix AdaGrad.

A key enhancement is learning-rate grafting: Shampoo inherits the norm of a diagonal method's update (e.g., AdaGrad), rescaling the search direction for each block as

Pt,final(i)=Pt,graft(i)FPt,Shampoo(i)Pt,Shampoo(i)FP_{t,\text{final}}^{(i)} = - \| P_{t,\text{graft}}^{(i)} \|_F \frac{P_{t,\text{Shampoo}}^{(i)}}{\| P_{t,\text{Shampoo}}^{(i)} \|_F}

facilitating reuse of established learning-rate schedules.

2. Distributed Data Parallelism and DTensor Sharding

A naïve distributed implementation of Shampoo would replicate all Kronecker state and support expensive matrix inversions on each GPU, incurring remarkable slowdowns (50–75% compared to diagonal methods). Distributed Shampoo instead exploits PyTorch's DTensor interface to shard these preconditioners.

The DTensor-based strategy partitions the set of block preconditioners {L(i),R(i)}\{L^{(i)}, R^{(i)}\} across JJ GPUs using a greedy load-balancing algorithm. Each GPU (or process-group) manages only its allocated subset, reducing per-GPU memory load by approximately JJ and localizing computation. Each GPU calculates inverse roots for its blocks and applies them to corresponding gradients, accumulating partial search directions.

A single 1D int8 buffer aggregates local search directions across GPUs. The AllGather primitive then broadcasts the complete set {Pt(i)}i=1n\{P_t^{(i)}\}_{i=1}^n to all participants, synchronizing updates efficiently.

To further balance computation and communication, multi-group hierarchies can be created: preconditioners replicate across QGQ_G-sized subgroups, and AllGather operations become localized to these subgroups, minimizing congestion while maintaining synchronous parameter updates.

3. Performance Optimizations and Overhead Assessment

Distributed Shampoo incorporates several system-level optimizations:

  • Periodic root-inverse computation: The most expensive computation—matrix fourth-root inverses—is amortized by updating only every ff steps ("stale roots"), e.g., f=50f=50 preserves accuracy for ResNet-50/ImageNet, incurring only 5–8% wall-clock overhead.
  • Dimension and block-size heuristics: Tensors with dimension above DmaxD_{\max} are either blocked into b×bb \times b patches, diagonalized, or fallback to diagonal AdaGrad. Typical parameters are Dmax=2048D_{\max}=2048 and b{1024,2048}b\in\{1024, 2048\}.
  • Fused elementwise operations: PyTorch’s _foreach kernel is used for optimizations such as β\beta-updates and decay.
  • Guarded eigendecomposition: A retry in double precision mitigates rare decomposition failures.

Measured benchmarks demonstrate that, on 8×\timesV100 GPUs, batch-128/GPU, per-step overhead is 8–10% compared to SGD-Nesterov with f=50f=50, dropping below 2% for f=100f=100 without significant loss in final accuracy.

4. Resource Complexity: Memory and Computation

A detailed complexity analysis is as follows, for a single di×di1d_i \times d_{i-1} parameter block:

Optimizer Memory Per-Step Compute
Full-matrix AdaGrad O(di2di12)\mathcal{O}(d_i^2 d_{i-1}^2) O(di3di13)\mathcal{O}(d_i^3 d_{i-1}^3)
Diagonal AdaGrad O(didi1)\mathcal{O}(d_i d_{i-1}) O(didi1)\mathcal{O}(d_i d_{i-1})
Shampoo O(di2+di12)\mathcal{O}(d_i^2 + d_{i-1}^2) O(di3+di13+didi1(di1/2+di11/2))\mathcal{O}(d_i^3 + d_{i-1}^3 + d_i d_{i-1}(d_i^{1/2} + d_{i-1}^{1/2}))

In aggregate, Shampoo's total memory requirement is roughly 4–7×\times the model parameters, substantially feasible compared to full-matrix AdaGrad, and computational cost grows cubically (matrix roots), but remains well below that of full-matrix alternatives.

5. Empirical Evaluation: ResNet-50/ImageNet Ablations

Comprehensive ablation experiments were conducted on ImageNet (1k1\text{k} classes) with ResNet-50 (25.5M parameters) on 8×\timesV100 GPUs (batch 128×\times8), cosine decay learning rate schedule with 5-epoch warmup.

  • Fixed 90-epoch budget: Shampoo achieves 77.44% Top-1 accuracy compared to 76.85% for SGD-Nesterov, with only +8% wall-clock overhead.
  • “Equal-time” comparison: Shampoo, at 60 epochs, matches SGD-Nesterov’s 76.9% accuracy at 90 epochs, yielding 1.35×\times time savings and requiring 1.5×\times fewer steps for convergence.
  • Learning-rate sensitivity: Shampoo exhibits superior robustness and consistency in accuracy across a 10×\times range of base learning rates, outperforming SGD-Nesterov in both accuracy and variance.

A representative subset of epoch sweeps:

Method / Epochs 40 60 80 90
SGD-Nesterov 75.2% 76.1% 76.6% 76.9%
Shampoo 76.4% 77.2% 77.3% 77.4%

In this setup, Shampoo reaches the 90-epoch SGD-Nesterov accuracy after only 60 epochs.

6. Deployment Recommendations

The following guidelines are advised for scalable deployment of Distributed Shampoo:

  • Utilize DTensor for state sharding across GPUs (use_dtensor=True).
  • Set max_preconditioner_dim=2048 to limit block size, apply blocking and merging as needed.
  • Update matrix roots every 50 steps (precondition_frequency=50) for efficiency.
  • Incorporate learning-rate grafting from SGD or Adam for schedule reuse.
  • Enable decoupled weight decay (use_decoupled_weight_decay=True) and bias correction for regularization.
  • Combine with momentum/Nesterov acceleration (momentum=0.9, use_nesterov=True).
  • Use fused elementwise operations (_foreach) and guarded eigendecomposition for kernel reliability.

Using these practices, Distributed Shampoo delivers improved convergence speed and variance control over diagonal adaptive methods, with a per-step wall-clock cost in the single-digit percent range, validating its practicality for production-scale distributed neural network training (Shi et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Distributed Shampoo.