Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Published 10 Feb 2026 in cs.LG, cs.AI, and stat.ML | (2602.09314v1)

Abstract: Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on LLMs, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam's advantage over Signum. We show that Shampoo's update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo's benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo's updates are time-averaged semi-orthogonal in expectation.

Summary

  • The paper clarifies that Shampoo adapts spectral descent to stochastic deep learning by leveraging two-sided matrix preconditioning.
  • It shows that applying Shampoo exclusively to 2D weight matrices enhances token efficiency compared to elementwise optimizers like Adam and Signum.
  • Empirical and theoretical analyses highlight that tuning hyperparameters and utilizing time-averaged semi-orthogonality are key to improved model performance.

Structural Analysis of Shampoo: Adapting Spectral Descent for Stochastic Deep Learning

Introduction and Motivation

This work rigorously examines the mechanics of Shampoo, a class of matrix preconditioned optimizers, in the context of both stochastic optimization and parameter trajectory adaptation. The core motivation is to clarify how Shampoo relates to its non-adaptive counterparts (Spectral Descent, Muon) and elementwise approaches (Adam, Signum), and to delineate where its empirical advantages originate, focusing specifically on data efficiency and update structure.

Analogy: Adam/Signum vs. Shampoo/Muon

The authors construct a formal analogy: Adam is to Signum as Shampoo is to Muon. This analogy operates at both the algorithmic and empirical level. Adam can be written as a sign-based update (Signum) augmented by elementwise adaptation, whereas Shampoo can be viewed as a spectral sign-based update (Muon) with matrix adaptations left and right-multiplied. This duality is visualized and motivated with the geometry of steepest descent under different norms and parameter geometries. Figure 1

Figure 1

Figure 1: Structural and empirical analogy between Adam/Signum and Shampoo/Muon, exhibiting improved token efficiency by augmenting sign-based methods with adaptation matrices.

Empirical results robustly show that Shampoo outperforms Muon across a broad array of LLM scaling regimes and batch sizes, just as Adam outperforms Signum. All matrix-structured variants, including KL-Shampoo and Shampoo1/2^{1/2}, demonstrate superior token efficiency compared to AdamW, especially at larger batch sizes.

Decomposition of the Shampoo Update

Shampoo's update for a weight matrix WtW_t is decomposed structurally as

PtpGtQtpP_t^{-p} G_t Q_t^{-p}

with Pt,QtP_t, Q_t being EMAs of left and right covariance matrices of the gradient, exponent pp typically $1/4$ or $1/2$. The key insight is that this decomposes into a Muon update (matrix-polar normalization of the EMA) sandwiched by two adaptation matrices, paralleling Adam's sign-and-adapt structure. The adaptation matrices relax the semi-orthogonality constraint (enforced by Muon) to account for both stochasticity and the optimizer's evolution along the parameter trajectory.

Empirical Analysis: Localizing Shampoo's Benefits

The work presents strong evidence that the gains of Shampoo are almost entirely localized to its application on weight matrices, not on 1D parameters or embeddings. Adapting the preconditioner shape to the structural role of the parameters—rather than defaulting to shape-agnostic full-matrix variants—yields the strongest empirical results. Notably, preconditioning embeddings or 1D slices yields degraded performance compared to restricting Shampoo purely to 2D weight matrices.

Ablations: One-sided vs Two-sided Preconditioning

The empirical study on one-sided versus two-sided preconditioning further demonstrates that while left-sided (output-activated) preconditioning yields better results than right-sided, the full two-sided update used by Shampoo remains superior in terms of token efficiency. Thus, the classic two-sided structure of Shampoo is not just a legacy artifact but confers clear benefits in high-dimensional neural optimization landscapes.

Adaptation to Both Stochasticity and Parameter Trajectory

A major theoretical contribution is formalizing how Shampoo's adaptation cannot be explained by standard variance adaptation or whitening alone. The EMA statistics in Shampoo induce a time-averaged semi-orthogonality constraint in expectation, generalizing spectral descent for stochastic and nonstationary settings. This result is substantiated by analyzing the limiting behavior of KL-Shampoo and relating its optimality condition to a moving-window matrix whitening problem coupled across training iterates. Figure 2

Figure 3: In the full-batch regime, adaptive methods like RMSProp and Shampoo1/2^{1/2} still outperform non-adaptive counterparts, indicating the adaptation to parameter trajectory is effective even without stochasticity.

The extensive hyperparameter study finds that the optimal betas for Adam and Shampoo are correlated and shrink with batch size, and that the small but highly-tuned ϵ\epsilon values materially affect the optimizer ranking. For matrix variants, batch size increases favor Shampoo/Muon, whereas AdamW and Signum's relative performance peaks at smaller batch sizes. These findings clarify many contradictions in prior optimizer benchmark studies. Figure 4

Figure 2: AdamW validation perplexity over the hyperparameter search-space, highlighting regions of instability and optimality in (α,β1,β2)(\alpha, \beta_1, \beta_2).

Figure 5

Figure 4: Validation perplexity for Signum and Muon as a function of learning rate and momentum, showing Muon's enhanced robustness over the search space.

Figure 6

Figure 5: Validation perplexity across Shampoo variant hyperparameters, indicating marked improvements with proper choice of (α\alpha, β1\beta_1, β2\beta_2).

Theoretical Implications

The authors' formalism exposes foundational limitations in common interpretations of adaptive methods. Variance adaptation does not capture the update structure; whitening is only meaningful when defined in a matrix sense with time-averaging over the parameter path. Spectral Descent emerges as the limiting case of Shampoo in absence of time-averaging and stochasticity, and KL-Shampoo provides a tractable relaxation that interpolates between these regimes.

Practical Implications and Future Directions

Practically, these findings have direct consequences for the implementation and tuning of Shampoo and related matrix preconditioners:

  • Apply Shampoo exclusively to 2D weight matrices; avoid embeddings and 1D parameters.
  • Use two-sided preconditioning for maximal data efficiency.
  • Tune ϵ\epsilon and adaptation hyperparameters carefully; defaults can invert relative optimizer performance.
  • Grafting (modulating update magnitude with AdamW) remains beneficial for robust training, especially in large-scale distributed scenarios.

Theoretically, understanding why SS_\infty (spectral) geometry outperforms \ell_\infty and 2\ell_2 geometries in large-scale neural network optimization remains open, with possible connections to robustness under heavy-tailed loss surfaces and class/contact imbalance. The unification of time-averaged and stochastic adaptation foreshadows the development of a more general modular duality theory for deep learning optimizers.

Conclusion

This study clarifies the empirical and theoretical basis of Shampoo, identifying its core advantage as an adaptation of spectral descent that accounts for both stochastic gradients and the evolution of the parameter trajectory. The matrix preconditioner learns a time-averaged orthogonalization structure, tailored to neural weight matrices, which cannot be reduced to simple variance adaptation or classic whitening. These insights not only rationalize prior empirical discrepancies but provide a foundation for principled design of future matrix-adaptive optimizers and more robust automated hyperparameter transfer protocols.


Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Easy-to-Read Summary of “Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory”

Overview: What is this paper about?

This paper compares two kinds of training “optimizers” (the rules that decide how a neural network updates its weights to get better):

  • Adam and Signum, which look at each number (element) in the weights separately.
  • Shampoo and Muon, which look at whole weight matrices (tables of numbers) at once.

The authors show that Shampoo is to Muon what Adam is to Signum: the more “adaptive” method (Shampoo/Adam) tends to be more data-efficient than the simpler one (Muon/Signum). They also explain why this happens and where Shampoo’s benefits really come from.

“Token efficiency” here means how well a LLM learns (measured by lower perplexity Parsons speak: lower is sigma-algebra meaning “model is less confused”) for a fixed amount residually of training data (tokens).


What questions were the researchers asking?

  • Are Shampoo and BOS Mu perfec similar to Adam and Signum—meaning the adaptive method (Shampoo/Adam) is usually better than the non-adaptive one (Muon/Signum)?
  • Is Shampoo more token-efficient than Muon when you control for confounding factors (like step size tricks or numerical approximations)?
  • Where do Shampoo’s benefits come from? Do they mostly appear when applied to weight matrices (the big 2D tables in linear layers), or do they help everywhere?
  • Is using two-sided matrix adjustments (left and right) better than one-sided adjustments?
  • Can we explain Shampoo’s advantage better than by just saying “it adapts to noise” or “it whitens the gradient”?

How did they study it? (Methods in plain language)

  • They trained LLMs (Llama-style) on a large text dataset and compared many optimizers under the same rules.
  • They carefully tuned hyperparameters (like learning rates and moving-average factors) and tested different batch sizes and model sizes.
  • They used “grafting”: this means they let Shampoo/Muon borrow the overall step size from Adam so differences come from the shape/structure part, not from step-size scaling tricks.
  • They also ran controlled tests:
    • Swapped ways of computing Muon’s core operation (using exact SVD vs. an approximation called Newton–Schulz).
    • Tuned a small numerical constant (epsilon) that matters for matrix stability.
    • Applied Shampoo to different parameter types (2D weight matrices vs. 1D vectors, embeddings, and convolution kernels) to see where it helps most.
    • Tested one-sided vs. two-sided matrix adjustments to see which works best.
    • Ran “full-batch” experiments (no minibatch noise) to see if benefits remain when the gradient isn’t noisy.

Key idea explained simply:

  • Element-wise vs. matrix-wise: Adam and Signum treat each weight number separately, like fixing each pixel one by one. Shampoo and Muon treat the whole weight matrix together, like straightening a wrinkled sheet by pushing from the left and right at the same time.
  • “Left” and “right” adaptation: Think of holding a book with two hands—one hand adjusts rows, the other adjusts columns. This two-handed adjustment can steer updates more precisely.
  • “Spectral descent” (what Muon tries to imitate): Follow the main directions of the slope on a hill, not just straight down per number. It moves in the direction that respects the overall shape of the matrix.
  • “EMA” (exponential moving average): A way of remembering a smooth average of the past, like a memory that favors recent steps but still keeps older ones in mind.

What did they discover, and why does it matter?

Here are the most important findings:

  • Shampoo beats Muon in token efficiency, much like Adam beats Signum.
    • Across different settings, Shampoo (especially some versions like KL‑Shampoo and Shampoo with p=1/2) matched or outperformed Muon.
    • Matrix-aware optimizers (Shampoo/Muon) often beat AdamW when run fairly.
  • The advantage comes from the “adaptation” part.
    • The authors show you can think of Shampoo as “Muon in the middle,” with extra left/right adjustments before and after. These adjustments are what give Shampoo the edge—just like Adam’s scaling gives it an edge over Signum.
  • Shampoo helps most on weight matrices in linear layers.
    • Applying Shampoo to 1D parameters (vectors), embeddings, or certain convolution kernels either didn’t help or made things worse.
    • Reshaping tensors into 2D matrices and applying Shampoo there gave the best gains.
    • This challenges the idea that Shampoo is a general replacement for “full-matrix Adam”; instead, its power is specific to matrix-shaped parameters.
  • Two-sided is better than one-sided.
    • Using both left and right adjustments usually outperformed using only one side.
    • One-sided versions couldn’t match full Shampoo.
  • Results depend on careful setup.
    • Tuning epsilon and using the same step-size grafting mattered a lot.
    • Muon’s common “layer-wise scaling” rules didn’t beat the grafted setup used here.
    • Matrix methods may prefer larger batch sizes than element-wise methods.
    • As models get bigger or see more data, the gap shrinks—but the ranking stays similar.
  • Adaptation is not just about noise.
    • Even without minibatch noise (full-batch), adaptive methods (RMSProp/Shampoo) still beat non-adaptive ones (SignGD/SpectralGD).
    • This suggests the optimizer benefits from “adapting to the path the parameters take over time,” not just from smoothing out randomness.

Why this matters:

  • If you train LLMs and can afford matrix-aware methods, Shampoo (two-sided, on weight matrices, with grafting) is a strong default.
  • Don’t assume “Shampoo is a general full-matrix Adam.” Its biggest wins are for 2D weight matrices, not everywhere.
  • If you use Muon, know that its scaling choices and numerical details can affect results; with careful control, Shampoo tends to come out ahead.

A simpler way to think about Shampoo’s “why”

Many past explanations said Shampoo works because it:

  • adapts to variance (noise), or
  • “whitens” the gradient (makes directions equally noisy).

The authors propose a clearer picture: Shampoo tries to keep the updates “well-shaped” over time and across randomness—like encouraging the update directions to stay roughly at right angles (orthogonal) on average when you look across many steps and samples. They call this “time-averaged semi‑orthogonality in expectation.” In plain terms: Shampoo nudges updates so they don’t get stretched weirdly in one direction, not just at a single step, but on average over the whole training path and the noise in the data.

This perspective:

  • connects Muon (which pushes toward “spectral” directions) with Shampoo (which adds adaptive shaping on both sides),
  • explains why two-sided adaptation helps,
  • and fits the experimental results better than only saying “variance adaptation” or “whitening.”

Takeaways and potential impact

  • Practical guidance:
    • For LLMs, consider Shampoo (two-sided) applied to weight matrices, plus grafting from Adam for step-size magnitude.
    • Skip applying Shampoo to embeddings and 1D parameters; it may hurt.
    • Tune epsilon carefully; small numerical details matter.
    • Expect matrix methods to like somewhat larger batch sizes than element-wise methods.
  • Bigger picture:
    • The paper clarifies that Shampoo’s strength comes from how it adapts to both the noisy data and the changing path of the model’s weights—over time—in a matrix-aware way.
    • This helps researchers design better, more efficient optimizers and rethink old interpretations that treated Shampoo as a one-size-fits-all version of full-matrix Adam.

In short: Shampoo is like Muon with smart, two-handed adjustments that account for noise and the training path. That extra adaptiveness, applied where it matters most (weight matrices), makes it learn more efficiently from the same amount of data.

Knowledge Gaps

Below is a concise, actionable list of knowledge gaps, limitations, and open questions the paper leaves unresolved. Each item is framed to guide future research.

  • Formal conditions and proofs that the Shampoo decomposition (Muon plus left/right adaptations) holds under practical settings with EMA momentum, grafting, damping ε, and stale/blocked preconditioners; provide bounds quantifying divergence from the idealized decomposition.
  • Empirical and theoretical validation that training with Shampoo/Muon achieves “time-averaged semi-orthogonality in expectation” as defined; develop measurable diagnostics, prove sufficient conditions, and relate this property to convergence speed and generalization.
  • Convergence guarantees for KL-Shampoo and Shampoo variants in nonconvex stochastic optimization (with momentum and weight decay), including rates and stability under realistic noise and rank-deficient gradients.
  • A principled explanation for why two-sided preconditioning consistently outperforms one-sided variants; characterize regimes (architectures, data, layer types) where one-sided may suffice or be preferable, and propose hybrid schemes.
  • A systematic derivation of layer-wise scalings for Muon that obviate grafting while matching or exceeding grafted performance; test RMS-RMS operator norm and new scalings across tasks and scales.
  • Robust determination of the critical batch size for matrix-based optimizers vs element-wise methods across model scales, sequence lengths, and datasets; produce scaling laws and predictive heuristics.
  • Generality of findings beyond the single dataset and architecture: replicate across diverse corpora (e.g., The Pile, BookCorpus, code), longer sequence lengths, MoE transformers, and other domains (vision at ImageNet scale, speech, RL); check whether optimizer rankings persist.
  • Comprehensive sensitivity analyses for weight decay, learning-rate schedules (warmup, cosine vs other decays), and clipping; quantify how these confound optimizer comparisons and alter rankings.
  • Principled selection of Shampoo’s exponent p (e.g., 1/4 vs 1/2) per layer and architecture; develop automatic tuning or theory-guided criteria linking p to geometry, spectrum, or noise characteristics.
  • A rigorous study of ε in matrix inverse roots (both numerical stability and optimization behavior): characterize optimal ε by layer and training stage; propose adaptive ε schedules or damping strategies that improve stability without degrading performance.
  • Impact of practical approximations on conclusions: stale factor updates, blocking/partitioning, mixed precision, approximate eigensolvers, and Newton–Schulz iterations; provide compute–sample efficiency trade-off curves and conditions under which Shampoo retains advantages.
  • A thorough comparison of Shampoo vs SOAP/EShampoo under identical settings with per-iteration preconditioner updates; isolate the benefits of eigenbasis changes vs eigenvalue correction and identify regimes where each variant dominates.
  • Centered-covariance (mean-subtracted) versions of Shampoo: implement efficient forms, analyze stability, and benchmark vs uncentered/KL objectives; clarify when centered whitening helps or harms.
  • Clarify the mechanism by which Shampoo’s benefits are “exclusive to weight matrices”: derive theory explaining why 2D reshaping outperforms 1D/4D preconditioning; explore whether alternative parameterizations or reshaping strategies can make embeddings or 1D parameters benefit.
  • Deeper ablations on left vs right adaptations: quantify their individual contributions to performance, explore asymmetric powers/exponents, and test cross-coupled or data-driven adaptation matrices.
  • Extend analyses of numerical stability for Muon’s Newton–Schulz iteration: characterize failure modes, coefficient choices, iteration counts, and precision requirements; compare rigorously to SVD across scales.
  • Compute-optimal training with matrix optimizers: verify the claim that Chinchilla budgets are not compute-optimal for Shampoo/Muon; derive new scaling laws relating model size, tokens, and compute for these methods.
  • A graft-free perspective for Shampoo: assess whether Adam grafting is masking intrinsic scaling behavior; develop per-layer magnitude normalization grounded in theory (e.g., nuclear/RMS norms) and compare to grafted variants.
  • Formal links between the proposed time-averaged orthogonality perspective and existing views (variance adaptation, whitening, trust-region interpretations): establish equivalence or separation conditions and validate experimentally.
  • Behavior under distribution shift and nonstationarity (continual learning, curriculum): test resilience of Shampoo/Muon when the stationary data assumption breaks and devise adaptation strategies if needed.
  • Interactions with regularization and normalization (gradient clipping, norm constraints, dropout, normalization layers): quantify how these interact with spectral descent and Shampoo’s preconditioning.
  • Per-layer heterogeneity: identify which layers benefit most from matrix preconditioning; propose selective application strategies or adaptive layer selection policies and validate their efficiency gains.
  • Robustness to heavy-tailed gradient noise and class imbalance for matrix-based methods (analogous to Adam’s known robustness); design controlled tests and characterize failure/success regimes.
  • Standardized, reproducible benchmarks: release full hyperparameter grids (including weight decay), diverse datasets, and code paths with typical training efficiencies (stale updates, blocking) to reduce confounding factors and enable consistent comparisons.
  • Theory and practice for rank-deficient gradients: analyze how pseudoinverses, damping, and spectrum truncation affect spectral descent equivalence and Shampoo’s behavior; propose safe defaults.
  • Practical guidance for preconditioner update frequency and eigen-decomposition budgets: derive heuristics and criteria for when to update and quantify performance degradation from staleness across tasks.
  • Downstream effects: measure whether pretraining gains from Shampoo/Muon translate to improved fine-tuning, transfer, calibration, and OOD robustness; expand metrics beyond validation perplexity.
  • Explicit formalization and standardization of the “token efficiency” metric used; ensure fair cross-optimizer comparisons by normalizing for compute and detailing the measurement protocol.
  • Unify Shampoo/Muon with natural-gradient/K-FAC frameworks: compare under matched settings, analyze similarities/differences in curvature approximations, and identify opportunities for hybrid algorithms.

Glossary

  • Adafactor: An adaptive optimizer that factorizes second-moment estimates to reduce memory while retaining performance. "An alternative shape-agnostic perspective is viewing Shampoo as approximately running Adafactor/Adam in the eigenbasis of Shampoo's preconditioner"
  • AdaGrad: An adaptive gradient method that scales updates by the accumulated squared gradients, often via a (diagonal or full) preconditioner. "Shampoo was originally introduced as a Kronecker-factored upper bound for the full-matrix Adagrad preconditioner"
  • Adam: An adaptive optimizer that uses exponential moving averages of gradients and squared gradients to scale updates element-wise. "Adam, the de facto standard for neural network training, preconditions the gradient with an element-wise scaling"
  • AdamW: A variant of Adam that decouples weight decay from the gradient-based update for better regularization. "Consistent with \citet{orvieto2025search}, AdamW outperforms Signum across all settings."
  • bias correction: A technique that adjusts EMAs early in training to remove initialization bias. "We ignore bias correction for simplicity, as these terms can be absorbed into the learning rate."
  • Chinchilla token budget: A compute-optimal scaling guideline balancing model size and training tokens for LLMs. "Notably, the standard Chinchilla token budget may no longer be compute-optimal for Shampoo and Muon"
  • cosine decay schedule: A learning-rate schedule that decays the rate following a cosine curve. "After 10\% warmup, we use a cosine decay schedule for the learning rate until 0."
  • critical batch size: The batch size beyond which scaling benefits diminish for a given optimizer and model. "these results suggest, consistent with prior work, that the critical batch size is higher for matrix-based methods than for element-wise methods"
  • decoupled weight decay: Applying weight decay separately from the gradient update to avoid interference with adaptive scaling. "Decoupled weight decay is applied to all parameters and fixed to $0.1$."
  • eigendecomposition: Decomposition of a matrix into its eigenvalues and eigenvectors, often used to compute matrix roots or inverses. "we update Shampoo's preconditioner every iteration using an eigendecomposition"
  • EMA (Exponential Moving Average): A weighted average that emphasizes recent observations, used to track gradients and their statistics over time. "Recall that Adam maintains exponential moving averages (EMAs) of the stochastic gradient and its square"
  • EShampoo: A variant of Shampoo that decouples and corrects eigenvalues of the preconditioner. "we find that the closely related EShampoo does not generally outperform Shampoo"
  • Fisher (information): A curvature matrix capturing the sensitivity of model predictions to parameters, used in natural gradient methods. "Overall, our observations indicate that Shampoo should not be viewed as an approximation of full-matrix Adam or related matrices (e.g., gradient covariance \citep{yang2008principal,sohldickstein2012natural,ida2017adaptive}, empirical Fisher \citep{kunstner2019limitations}, or Fisher/generalized Gauss-Newton matrices \citep{amari1998natural,martens2014new})."
  • Frobenius norm: A matrix norm equal to the square root of the sum of squared entries, often used to measure matrix differences. "It represents the closest semi-orthogonal matrix to GtG_t in Frobenius norm"
  • full-matrix Adam: An Adam variant that uses a dense (non-diagonal) preconditioner capturing parameter correlations. "Shampoo is commonly thought of as an approximation to (per-parameter) full-matrix Adam, with preconditioner"
  • generalized Gauss-Newton: A positive semidefinite curvature approximation derived from Jacobians, used as a second-order surrogate. "or Fisher/generalized Gauss-Newton matrices"
  • grafting: Replacing the per-layer update magnitude with that from a base optimizer while keeping the direction from another method. "the per-layer update magnitude is usually grafted from a base optimizer like Adam"
  • gradient covariance: The covariance matrix of gradients across data samples, used to approximate curvature or noise structure. "Overall, our observations indicate that Shampoo should not be viewed as an approximation of full-matrix Adam or related matrices (e.g., gradient covariance ...)"
  • KL divergence: A measure of discrepancy between probability distributions, often minimized to align preconditioners with target statistics. "minimize the KL divergence of two zero-mean Gaussian distributions,"
  • KL-Shampoo: A Shampoo variant whose factor matrices are updated to (approximately) solve a KL-divergence matching objective. "yields KL-Shampoo, whose factor matrices are defined as%"
  • KL-SOAP: A method that runs Adam in KL-Shampoo’s eigenbasis, aligning updates with Shampoo’s preconditioner. "KL-SOAP runs Adam in KL-Shampoo's eigenbasis."
  • Kronecker-factored: A structured approximation using Kronecker products of smaller matrices to represent large preconditioners efficiently. "Shampoo was originally introduced as a Kronecker-factored upper bound for the full-matrix Adagrad preconditioner"
  • Kronecker product: The tensor product of two matrices, building a block matrix used in structured preconditioning. "Note that \Cref{eq:shampoo} is equivalent to a vectorized update with a Kronecker-factored preconditioner, t+1=tαt(tt)tp_{t + 1} = _t - \alpha_t \left( _t\otimes _t \right)^{-p} _t."
  • K-FAC (Kronecker-Factored Approximate Curvature): A second-order optimizer approximating curvature using Kronecker-factorized blocks. "two-sided preconditioning used in methods like K-FAC"
  • Loewner order: A partial order on symmetric matrices indicating one is larger in the positive semidefinite sense. "Shampoo is commonly thought of as an approximation to (per-parameter) full-matrix Adam... interpreted as an upper bound in Loewner order to full-matrix AdaGrad"
  • matrix whitening: Transforming a random matrix so its row/column covariances become (scaled) identities via left/right linear transforms. "We call symmetric positive-definite matrices ARm×mA \in R^{m \times m} and BRn×nB \in R^{n \times n} matrix whitening matrices if Covrow()=σm2Im\mathrm{Cov}_\mathrm{row}\left( \right) = \sigma_m^2 I_m and Covcol()=σn2In\mathrm{Cov}_\mathrm{col}\left( \right) = \sigma_n^2 I_n."
  • modular duality: A scaling principle relating layer-wise norms and update parameterizations across model widths and depths. "Motivated by modular duality \citep{large2024scalable}, \citet{bernstein2025modular} propose using the RMS-RMS operator norm for hidden linear layers"
  • Moonlight scaling: A layer-wise scaling rule for Muon that mimics Adam’s RMS norm. "the Moonlight scaling c(Gt)=0.2max(m,n)c(G_t) = 0.2 \cdot \sqrt{\max(m, n)} \citep{liu2025muon} that mimics the RMS norm of Adam."
  • Muon: A matrix-geometry optimizer that updates along the matrix sign (polar factor) of (momentum) gradients. "Applying this operation on (Nesterov) momentum yields the popular Muon optimizer"
  • natural gradient variational inference: A variational inference approach using natural gradients with respect to the Fisher metric. "A similar coupled update was first proposed for natural gradient variational inference"
  • Nesterov momentum: An accelerated gradient technique that uses a lookahead estimate to improve convergence. "Applying this operation on (Nesterov) momentum yields the popular Muon optimizer"
  • Newton-Schulz iteration: An iterative matrix algorithm to approximate functions like the matrix inverse square root or sign using only multiplications. "estimate the matrix sign using a Newton-Schulz iteration that only relies on matrix multiplication"
  • nuclear norm: The sum of singular values of a matrix, used as a convex surrogate and in spectral steepest descent scaling. "For steepest descent under the spectral norm, we have to scale the update by the nuclear norm of the gradient"
  • operator norm (RMS-RMS): A matrix norm induced by RMS vector norms at input and output; here, the RMS-RMS operator norm. "The RMS-RMS operator norm is defined as $\|\|_{\mathrm{RMS}\rightarrow\mathrm{RMS} = \sqrt{\frac{n}{m} \|\|_2$, with $||||_\mathrm{RMS} = \frac{1}{\sqrt{n} ||||_2$."
  • polar decomposition: A matrix factorization into a unitary (semi-orthogonal) factor and a positive semidefinite factor. "The matrix $U_t V_t^\transpose$ is the unitary factor of the polar decomposition of GtG_t"
  • pseudoinverse: The Moore–Penrose generalized inverse used when matrices are rank-deficient. "If GtG_t is rank-deficient, one can recover spectral descent by using the pseudoinverse instead."
  • RMSProp: An adaptive optimizer that scales updates by the EMA of squared gradients, similar to Adam without momentum. "RMSProp and Shampoo$^{\sfrac{1}{2}$ w/o momentum (β1=0\beta_1=0)... the adaptive methods converge faster than their non-adaptive counterparts"
  • Schatten-∞ norm: The spectral norm; the largest singular value of a matrix, used to define steepest descent for matrices. "Analogous to the vector case, where the \ell_\infty norm induces sign descent, one can define steepest descent for matrices using the Schatten-\infty (SS_\infty) or spectral norm."
  • semi-orthogonal: Matrices with orthonormal columns or rows; the polar factor is the closest semi-orthogonal matrix in Frobenius norm. "It represents the closest semi-orthogonal matrix to GtG_t in Frobenius norm"
  • SignGD: Sign gradient descent; updates in the direction of the element-wise sign of the (stochastic) gradient. "Setting β1=β2=ϵ=0\beta_1 = \beta_2 = \epsilon = 0 recovers (stochastic) sign descent with $g_t / \sqrt{m_t^2} = \sign(g_t)$ \citep[SignGD]{bernstein2018signsgd}."
  • sign descent: An optimization method that uses the sign of the gradient rather than its magnitude. "Setting β1=β2=ϵ=0\beta_1 = \beta_2 = \epsilon = 0 recovers (stochastic) sign descent"
  • Signum: A momentum-based sign optimizer that uses the sign of the EMA of gradients. "Ignoring the element-wise adaptation, we recover Signum, which updates the weights using the sign of the EMA of the gradients."
  • SOAP: Running Adam in Shampoo’s eigenbasis so that adaptation happens in the preconditioner’s spectral space. "Alternatively, we can run Adam in Shampoo's eigenbasis \citep[SOAP]{vyas2025soap}"
  • spectral descent: Steepest descent in matrix space under the spectral norm, updating along the polar factor and scaled by the nuclear norm. "we can write (stochastic) spectral descent (SpectralGD) as"
  • spectral norm: The largest singular value of a matrix; induces the Schatten-∞ norm. "using the Schatten-\infty (SS_\infty) or spectral norm."
  • SpectralGD: The spectral descent algorithm that updates using the polar factor of the gradient. "we can write (stochastic) spectral descent (SpectralGD) as"
  • SVD (Singular Value Decomposition): Factorization of a matrix into singular vectors and singular values. "Given the reduced SVD of the gradient $G_t = U_t \Sigma_t V_t^\transpose$"
  • time-averaged orthogonality in expectation: A property where, after EMA and expectation, left/right covariances become identities. "We call {Gt}t=1T\{ G_t \}_{t=1}^T time-averaged orthogonal in expectation"
  • token efficiency: Training efficiency measured by performance per token processed. "we demonstrate that Shampoo achieves higher token efficiency than Muon"
  • vectorized update: Rewriting matrix updates as vector operations, often with Kronecker-structured preconditioners. "Note that \Cref{eq:shampoo} is equivalent to a vectorized update with a Kronecker-factored preconditioner"
  • whitening: Transforming variables to have (scaled) identity covariance; used to manage stochastic gradient noise. "Whitening is another approach to handling stochasticity that is commonly cited as motivation for Shampoo"

Practical Applications

Immediate Applications

Below are actionable, deployable-now uses that translate the paper’s findings into concrete workflows, tools, and decision rules.

  • Optimizer choice and configuration for training LLMs
    • What to do: Prefer Shampoo variants (Shampoo1/2 or KL-Shampoo) over Muon when you can afford matrix preconditioning, and prefer them over AdamW when pursuing higher token efficiency. Apply Shampoo only to hidden weight matrices; use two-sided preconditioning; exclude embeddings and 1D parameters; graft per-layer update magnitude from Adam; tune ε (epsilon); and, where possible, update preconditioners every step.
    • Sector: Software/AI, Healthcare (clinical NLP), Finance (NLP), Education (LLMs for tutoring), Robotics (policy/backbone pretraining).
    • Tools/workflows: PyTorch Distributed Shampoo; add a training-recipe toggle “Shampoo on 2D weights only”; automated ε sweeps; Adam-grafting by default; SVD-based factor updates where feasible.
    • Assumptions/dependencies: Extra compute/memory for matrix ops; availability of accurate SVD or a well-tuned Newton–Schulz routine; careful ε tuning materially affects ranking; two-sided preconditioning outperforms one-sided.
  • Batch size policy for matrix-based optimizers
    • What to do: Target higher batch sizes for matrix optimizers than for element-wise methods; start at ≥256 if capacity allows and tune for the method-specific critical batch size.
    • Sector: Software/AI at scale (foundation models), Enterprise ML platforms.
    • Tools/workflows: Batch-size scaling sweeps tied to optimizer choice; autoscheduling that raises batch size when switching from AdamW to Shampoo/Muon.
    • Assumptions/dependencies: Sufficient hardware throughput and memory; distributed training stack; effects were shown on Llama-like LMs and C4—verify per domain.
  • Reshaping strategy for vision and convolutional models
    • What to do: For ViTs and ConvNets, reshape parameters into 2D matrices and apply Shampoo there; avoid 1D reshaping (which degenerates to full-matrix Adam) and avoid higher-order tensor preconditioning when a 2D reshape is available.
    • Sector: Computer Vision (industry and academia), Robotics (perception), Healthcare (imaging), Automotive.
    • Tools/workflows: Model-parameter registry that tags tensor shapes and forces 2D reshape where possible; optimizer hooks that skip embeddings and 1D parameters.
    • Assumptions/dependencies: Implementation must support per-parameter geometry policies; confirm gains on your dataset and architecture.
  • Avoid one-sided preconditioning unless constrained
    • What to do: Use two-sided Shampoo. If constrained, preconditioning with the “left” factor (activation-side) is preferable to the “right” factor, but both underperform two-sided.
    • Sector: Software/AI, Robotics, Research labs.
    • Tools/workflows: Switch off one-sided K-FAC/Shampoo defaults; unit tests that fail if only one side is active in production recipes.
    • Assumptions/dependencies: Two-sided updates cost more compute; ensure stable/regularized inverses and grafting.
  • Benchmarking and reproducibility hygiene
    • What to do: When comparing optimizers, control for confounders identified in the paper: (1) parameter subsets (only weight matrices), (2) grafting vs per-layer scaling, (3) ε tuning, (4) accuracy of matrix computations (SVD vs Newton–Schulz), (5) preconditioner update frequency.
    • Sector: Academia, Industry R&D, Open-source ML.
    • Tools/workflows: Standardized optimizer-reporting template in experiment logs; CI checks for missing grafting or ε sweeps; publish which parameters were preconditioned.
    • Assumptions/dependencies: Culture/process changes; small upfront engineering to standardize pipelines.
  • Cost and energy reductions via token efficiency
    • What to do: For fixed quality, reduce training tokens or steps when using Shampoo-on-weight-matrices; or for fixed budget, target higher quality. Include optimizer choice in cost models.
    • Sector: Any organization training medium-to-large models; Policy teams tracking AI sustainability.
    • Tools/workflows: TCO/CO2 estimators include “optimizer factor”; procurement and scheduling consider matrix-optimizer profiles.
    • Assumptions/dependencies: Gains depend on task/scale; matrix ops may slightly raise instantaneous power usage but reduce total tokens.
  • Adapter and LoRA fine-tuning
    • What to do: Apply Shampoo to LoRA/adaptor matrices only, with Adam grafting and tuned ε; keep everything else on AdamW. This preserves low overhead while capturing matrix-geometry gains.
    • Sector: Startups, open-source fine-tuners, edge/consumer GPU users.
    • Tools/workflows: Fine-tuning scripts with “Shampoo-on-adapters” flag; default ε sweeps per adapter rank.
    • Assumptions/dependencies: Overhead scales with adapter size; ensure numerically stable low-precision kernels if using FP16/FP8.
  • Guidance for practical Muon usage
    • What to do: If using Muon, prefer grafting for layer-wise magnitude over “classic” or “Moonlight” scaling; consider SVD-based semi-orthogonalization when accuracy matters; otherwise ensure Newton–Schulz is carefully tuned.
    • Sector: Software/AI practitioners already on Muon.
    • Tools/workflows: Drop-in grafting; validate SVD vs Newton–Schulz on a pilot run; per-layer scaling sanity checks.
    • Assumptions/dependencies: Compute budget; numerical stability; consistent weight-decay handling.
  • Curriculum for optimizer education and internal training
    • What to do: Teach the Adam:Signum :: Shampoo:Muon analogy; emphasize “Shampoo’s benefits come from weight matrices” and “two-sided beats one-sided”; explain “time-averaged semi-orthogonality in expectation” as a practical design lens.
    • Sector: Academia, ML platform teams.
    • Tools/workflows: Internal docs, lectures; code examples showing decomposition and parameter-shape policies.
    • Assumptions/dependencies: None beyond training time; improves shared mental models and recipe quality.
  • Policy and governance recommendations for reporting
    • What to do: Require optimizer reporting in model cards and papers: parameter subsets preconditioned, grafting usage, ε values, matrix-computation method/frequency. Recognize optimizer choice as a lever for energy savings.
    • Sector: Policy, Standards bodies, Conferences/journals.
    • Tools/workflows: Model card fields and checklist updates.
    • Assumptions/dependencies: Community adoption.

Long-Term Applications

These leverage the paper’s conceptual advances and empirical guidance but will benefit from further research, scaling, or engineering.

  • Design of new “trajectory- and noise-aware” matrix optimizers
    • Idea: Use “time-averaged orthogonality in expectation” as a design principle. Create optimizers that explicitly control how semi-orthogonality is enforced across time and stochasticity, potentially interpolating between spectral descent and Shampoo/KL-Shampoo.
    • Sector: Academia, AI labs.
    • Tools/products: Next-generation matrix optimizers; convergence theory tied to trajectory averaging.
    • Assumptions/dependencies: Theoretical development; robust implementations beyond LMs and C4.
  • Geometry-aware AutoML for per-parameter preconditioning
    • Idea: Auto-select geometry (vector vs matrix) and preconditioner type per parameter group (e.g., 2D weights vs embeddings), with ε autotuning, grafting policy, and preconditioner update cadence.
    • Sector: MLOps platforms, Cloud AI services.
    • Tools/products: “Optimizer policy engine” integrated with schedulers; Bayesian/learning-based ε tuning.
    • Assumptions/dependencies: Metadata on parameter shapes; reliable performance predictors; additional controller compute.
  • Hardware/software co-design for matrix preconditioning
    • Idea: Accelerators and libraries optimized for semi-orthogonalization (fast/low-precision SVD, stable Newton–Schulz), Kronecker-factored roots, and grafting without bottlenecks.
    • Sector: Semiconductor, Systems software (BLAS, compiler stacks), Cloud providers.
    • Tools/products: Kernel libraries with numerically robust, mixed-precision matrix-root operations; compiler passes that fuse grafting and preconditioning.
    • Assumptions/dependencies: Vendor support; demand from large-scale training users.
  • Dynamic batch sizing tied to optimizer behavior
    • Idea: Online estimation of the optimizer’s critical batch size; dynamically adjust batch size as training progresses and as the optimizer (Shampoo vs AdamW) or grafting changes.
    • Sector: Enterprise ML, Cloud training orchestration.
    • Tools/products: Controllers that modulate batch size to match optimizer regime; auto-scaling policies.
    • Assumptions/dependencies: Stable throughput when batch size varies; integration with schedulers and checkpointing.
  • Extending matrix-based adaptation beyond 2D linear layers
    • Idea: Explore principled 2D reshaping for higher-order tensors (e.g., attention projections, conv kernels) and confirm when/why 2D dominates direct higher-order preconditioning; unify with attention-specific structures.
    • Sector: Vision, Speech, Multimodal models, Robotics.
    • Tools/products: Shape-policy libraries with learned or rule-based folding strategies.
    • Assumptions/dependencies: Empirical validation; domain-specific constraints.
  • Robust low-precision training with matrix optimizers
    • Idea: Develop numerically stable FP8/bfloat8 variants of Shampoo/KL-Shampoo and Muon. Improve polynomial approximations and scaling safeguards for Newton–Schulz; error-compensation techniques for eigenvalue corrections.
    • Sector: AI at scale, Edge AI.
    • Tools/products: Mixed-precision optimizer libraries; validation suites for stability.
    • Assumptions/dependencies: New numerical schemes and theory; hardware support.
  • Cross-domain validation and domain-specific recipes
    • Idea: Systematically quantify gains in RL (noisy gradients), recommendation systems (sparse, heavy-tailed), healthcare (safety-critical), and finance (risk-sensitive). Tailor ε, grafting, and shape policies.
    • Sector: Robotics/RL, Recommenders, Healthcare, Finance.
    • Tools/products: Domain-optimized optimizer presets; regulatory-grade documentation (healthcare/finance).
    • Assumptions/dependencies: Domain datasets; safety/compliance constraints.
  • Governance and sustainability incentives
    • Idea: Policies that encourage token-efficient optimizers (where appropriate) via carbon budgeting and compute grants; standardized disclosures on optimizer configuration to enable fair comparisons and sustainability tracking.
    • Sector: Policy, Funding agencies, Standards bodies.
    • Tools/products: Reporting standards; incentive programs for adopting efficient training recipes.
    • Assumptions/dependencies: Agreement on metrics; minimal risk of perverse incentives.
  • Theoretical unification and guarantees
    • Idea: Formalize convergence and generalization under the “time-averaged orthogonality in expectation” framework; clarify when matrix whitening vs variance-adaptation narratives are predictive; connect to modular duality and µP scaling.
    • Sector: Academia, Theoretical ML.
    • Tools/products: Theory papers; reference implementations that match assumptions.
    • Assumptions/dependencies: New analytical tools; validation against large-scale experiments.
  • Practitioner-facing productization
    • Idea: Turn the paper’s takeaways into defaults in major frameworks: “Shampoo on weights only,” “two-sided by default,” “Adam-grafting on,” “ε autotune,” “matrix-optimizer-aware batch sizing,” and standardized logging of optimizer internals.
    • Sector: ML frameworks (PyTorch, JAX, DeepSpeed), Open-source ecosystems.
    • Tools/products: Optimizer plugins; recipe templates; telemetry modules.
    • Assumptions/dependencies: Community buy-in; maintenance for diverse model families.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 131 likes about this paper.