Sufficiency of eigenvalue-corrected Shampoo (EShampoo) under μP

Investigate whether eigenvalue-corrected Shampoo that eliminates grafting remains sufficient under Maximal Update Parameterization (μP), where per-layer learning-rate scaling follows width-dependent initialization rules, and determine the compatibility conditions between μP scaling and EShampoo’s preconditioner corrections.

Background

Recent work decomposes Shampoo’s preconditioner and shows that correcting eigenvalues each step (while updating the eigenbasis periodically) can remove the need for learning-rate grafting and enable direct schedule transfer from Adam. This refinement changes the magnitude and scaling of the update without extra heuristics.

The authors note that μP imposes width-dependent scaling rules for layer-wise learning rates and initializations. Whether EShampoo’s corrections remain sufficient under μP’s scaling constraints is flagged as an unresolved question.

References

Whether such corrections remain sufficient under μP, where per-layer learning rate scaling is already governed by width-dependent initialization rules, remains as an open question.

— Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale (2512.18373 - Nagwekar, 20 Dec 2025) in Subsection “Interplay with μP and Optimizer Choice” within Section “Learning Rate Schedules”

Sufficiency of eigenvalue-corrected Shampoo (EShampoo) under μP

Sponsor

Background

References

Related Problems