Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

KL-Shampoo Optimizer

Updated 4 September 2025
  • KL-Shampoo is an adaptive optimization algorithm that reformulates gradient covariance estimation as a KL divergence minimization to enforce symmetric positive-definite preconditioners.
  • It eliminates heuristic Adam grafting and reduces memory overhead by employing two-sided preconditioner updates and moving average-based eigenvalue estimation.
  • KL-Shampoo demonstrates superior or equivalent performance in neural network training tasks, with applications ranging from language models to vision and multi-modal architectures.

KL-Shampoo is an adaptive optimization algorithm for training neural networks, designed to improve upon the Shampoo optimizer by reformulating second-moment (covariance) estimation as a Kullback-Leibler (KL) minimization problem. This alternative information-geometric perspective provides a more principled foundation for covariance estimation, removes specific heuristics from the original Shampoo approach, and enables stable training without referencing or grafting with Adam, thereby reducing memory overhead and complexity.

1. Formal Perspective: Covariance Estimation via KL Minimization

The original Shampoo algorithm uses a Kronecker-factored approximation to estimate the covariance of the gradients for each tensor-valued parameter. While this second-moment estimation has been traditionally analyzed with the Frobenius norm, such a view does not ensure that the preconditioner remains symmetric positive-definite (SPD), a property essential for meaningful descent directions during optimization.

KL-Shampoo reframes the problem through the Kullback-Leibler divergence between Gaussian distributions, where the second moment is interpreted as a covariance matrix. If Σ\Sigma denotes the empirical covariance and MM the preconditioner (Kronecker-factored as M=ABM=A\otimes B), the KL divergence between N(0,Σ)\mathcal{N}(0,\Sigma) and N(0,M)\mathcal{N}(0,M) is

KL(Σ,M)=12(logdetM+Tr(M1Σ))+const.\mathrm{KL}(\Sigma, M) = \frac{1}{2}\left(\log\det M + \mathrm{Tr}(M^{-1}\Sigma)\right) + \text{const}.

Minimizing this functional automatically enforces SPD constraints and aligns the update direction with the most information-theoretically efficient choice.

2. Identified Limitation in Traditional Shampoo

Analysis reveals that the canonical Shampoo procedure updates only one Kronecker factor while holding the other fixed (typically to the identity), meaning it optimizes a one-sided KL minimization. This does not fully solve the joint problem for both factors. Moreover, Shampoo typically relies on "learning rate grafting," where Adam's updates stabilize training by correcting the scales in the optimizer's steps—a heuristic that introduces additional memory requirements and complexity.

Therefore, while Shampoo can stably train neural networks using these heuristics, its update schedule and covariance estimation are not jointly optimal when measured by KL divergence.

3. KL-Shampoo Design: Two-Sided Preconditioner Updates

KL-Shampoo is derived by considering the joint KL minimization for both Kronecker factors. The ideal update seeks to solve

minA,B KL(E[gg],AB)\min_{A,B} \ \mathrm{KL}(\mathbb{E}[gg^\top],\,A\otimes B)

which leads to the coupled equations

A=1dbE[(B)1g],B=1daE[g(A)1],A^* = \frac{1}{d_b}\, \mathbb{E}[ (B^*)^{-1}g^\top ], \qquad B^* = \frac{1}{d_a}\, \mathbb{E}[ g^\top (A^*)^{-1} ],

where gg is the gradient tensor, and da,dbd_a, d_b are the appropriate dimensional scalings.

Direct implementation is impractical due to coupled dependencies, so KL-Shampoo employs a moving average update:

A(1β2)A+β2dbE[B1g],B(1β2)B+β2daE[gA1].A \leftarrow (1-\beta_2)A + \frac{\beta_2}{d_b}\mathbb{E}[B^{-1}g^\top], \qquad B \leftarrow (1-\beta_2)B + \frac{\beta_2}{d_a}\mathbb{E}[g^\top A^{-1}].

These updates ensure that the scaling of the step sizes is naturally controlled by the dimensionality of the factors, resolving the mis-scaling inherent in Shampoo and eliminating the necessity for external stabilization.

Implementation is made computationally efficient by updating the eigenvalues with moving averages and infrequently updating the eigenbasis via QR decomposition. This approach avoids the computational expense of regular eigendecomposition and leverages approximate but robust matrix factorizations to maintain high-quality preconditioners.

4. Performance and Empirical Improvements

Experimental results indicate that KL-Shampoo achieves performance superior or equivalent to Shampoo with Adam grafting (SOAP) in neural network pretraining tasks. For example, in training RWKV7 models, KL-Shampoo stably trains where Shampoo with power p=12p=\frac{1}{2} may fail without Adam stabilization. KL-Shampoo eliminates the need for this grafting, reduces memory overhead, and matches or exceeds the converged performance of SOAP.

This demonstrates that principled covariance estimation grounded in KL minimization is sufficient for optimizer stabilization, and the moving-average and dimension-normalization features mitigate the issues found in prior heuristic approaches.

5. Practical Implementation and Computational Strategies

KL-Shampoo's practical scheme requires efficient matrix inverse and factor updates. Eigenvalue estimation is performed with a moving average using a potentially outdated eigenbasis, with the basis itself refreshed occasionally using a QR-based method. This dual update schedule ensures that the preconditioner remains close to optimal in KL divergence while maintaining scalable computational requirements. Although the full eigendecomposition provides both the eigenbasis and eigenvalues, the QR method offers a lower-cost alternative in settings of infrequent eigenbasis changes.

6. Applications and Extension Potential

KL-Shampoo is particularly applicable to the pretraining of neural networks with tensor-valued parameters, where the KL-based framework generalizes naturally to higher-order tensor structures. Beyond LLM pretraining, the algorithm is suitable for large-scale vision models, multi-modal architectures, and any domain where structured second moment estimation is beneficial.

Future directions suggested by the paper include refining the moving-average update scheme, leveraging proximal-gradient methods for further efficiency, and extending the QR-based approach to more general tensor-valued preconditioners. A more in-depth theoretical analysis of KL-Shampoo's convergence under the joint KL minimization framework remains an area for further investigation.

7. Broader Context within Optimization Literature

KL-Shampoo sits at the intersection of covariance estimation, matrix/tensor factorization, and information geometry. Its design is informed by limitations uncovered through KL analysis and the desire to remove heuristic stabilization procedures such as Adam grafting. This shift toward principled optimizer parameter estimation demonstrates a growing trend in the optimization literature: leveraging information-theoretic objectives (such as KL divergence) for robust, scalable, and theoretically justified algorithmic improvements in deep learning optimizers (Lin et al., 3 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KL-Shampoo.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube