Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuonH Optimizer: Stiefel Projection for ERM

Updated 1 April 2026
  • MuonH is a variant of the Muon optimizer that projects mini-batch gradients onto the Stiefel manifold to enforce orthonormal search directions.
  • It provides rigorous convergence guarantees for nonconvex ERM by addressing heavy-tailed noise and employing Hölder-smooth gradient assumptions.
  • The method achieves faster convergence rates compared to mini-batch SGD and ensures balanced learning in deep neural networks, particularly for tail classes.

MuonH is a variant of the Muon optimizer that incorporates orthonormal search directions via projection onto the Stiefel manifold, specifically designed for nonconvex empirical risk minimization (ERM) in the presence of heavy-tailed stochastic noise and Hölder-smoothness in the objective's gradients. MuonH generalizes normalized-SGD to the matrix setting and provides rigorous convergence guarantees under conditions where traditional assumptions, such as bounded variance, do not hold. The method leverages key advances in manifold optimization and heavy-tail resilient stochastic approximation to achieve faster convergence rates than standard mini-batch SGD in terms of the gradient norm, with particular relevance for deep neural networks and associative memory structures in LLMs.

1. Algorithmic Structure of MuonH

MuonH operates on a matrix parameterization WtRm×nW_t \in \mathbb{R}^{m \times n} at iteration tt. Each update enforces orthogonality in the search direction by projecting a stochastic (mini-batch) gradient onto the Stiefel manifold. The central loop of MuonH (without momentum, β=0\beta=0) is:

  1. Draw a mini-batch ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t}) from the dataset.
  2. Compute the mini-batch gradient:

Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)

  1. Project GtG_t onto the Stiefel manifold to obtain the orthonormal direction OtO_t:

Ot=argminORm×n:OO=InOGtFO_t = \arg\min_{O \in \mathbb{R}^{m \times n} : O^\top O = I_n} \|O - G_t\|_F

If Gt=UtΣtVtG_t = U_t \Sigma_t V_t^\top (compact SVD), Ot=UtVtO_t = U_t V_t^\top.

  1. Update parameters:

tt0

Momentum can be incorporated by replacing tt1 with a weighted sum incorporating past gradients. Newton–Schulz iteration is used in practice as an efficient alternative to the full SVD for approximating tt2 (Iiduka, 16 Mar 2026).

2. Theoretical Foundations: Hölder-Smoothness and Heavy-Tailed Noise

MuonH is analyzed for empirical risk minimization objectives

tt3

with two structural properties:

  • Hölder-Smoothness: The gradients of tt4 satisfy for all tt5 and exponent tt6

tt7

  • Heavy-Tailed Noise: The stochastic gradient estimator tt8 is unbiased, with tt9-variance bounded by β=0\beta=00 for β=0\beta=01:

β=0\beta=02

The regime β=0\beta=03 admits genuinely heavy-tailed noise, encountered in practical large-scale learning (Iiduka, 16 Mar 2026).

3. Convergence Guarantees and Rate Improvements

The main convergence theorem establishes that, under suitable step-size and mini-batch schedules:

  • β=0\beta=04
  • β=0\beta=05
  • β=0\beta=06 the iterates β=0\beta=07 satisfy almost surely

β=0\beta=08

i.e. convergence to stationary points even under heavy-tailed noise. The step-size can be taken as β=0\beta=09 with ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})0.

Convergence Rate Comparison:

  • Mini-batch SGD achieves ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})1 so ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})2
  • MuonH achieves ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})3, i.e., MuonH improves the rate in the gradient norm metric.

If ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})4 (e.g., ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})5 for ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})6), SGD's best achievable rate is ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})7 (squared-norm), while MuonH attains ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})8 (norm) (Iiduka, 16 Mar 2026).

4. Comparative Structure: Muon, MuonH, and AdamW

The Muon family is fundamentally distinct from standard optimizers through its use of matrix manifold geometry and explicit spectral (operator-norm) constraints:

  • Muon (original) uses a normalized-momentum update direction from the SVD of the momentum buffer, enforcing search directions along the top singular vectors.
  • MuonH solves a Hessian-free trust-region subproblem, with updates involving the rank decomposition and nuclear-norm scaling, standardizing step size and directionality (Li et al., 5 Feb 2025).
  • AdamW performs elementwise adaptive scaling via second-order moments, but does not globally control the layer capacity or spectrum. AdamW can induce norm growth and spectral concentration, problematic in pathological regimes such as grokking and heavy-tailed class distributions (Tveit et al., 22 Apr 2025).
Optimizer Update Form Spectral Constraint Preconditioning
Muon ξt=(ξt,1,,ξt,bt)\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})9 Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)0 Momentum / SVD
MuonH Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)1 Operator norm Hessian-free, SVD
AdamW Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)2 None Diagonal, no SVD

5. Empirical Behavior: Stability under Heavy-Tail and Grokking

MuonH and the Muon family demonstrate practical advantages in domains with:

  • Heavy-tailed noise: Empirically observed in large-scale training on non-uniform data distributions; MuonH's norm-based step mitigates gradient explosion and learning imbalance in rare/“tail” classes (Wang et al., 30 Sep 2025).
  • Grokking regime: On modular arithmetic and parity tasks, Muon achieves a Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)333% reduction in mean grokking epoch (mem–gen transition) vs AdamW (102.89 vs 153.09), statistically highly significant (Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)4, Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)5) (Tveit et al., 22 Apr 2025).
  • Associative memory and transformers: Experiments and theory show Muon’s update yields an isotropic singular value spectrum in critical associative-memory blocks (Value/Output attention, FFNs), ensuring balanced learning even for tail classes where Adam yields high disparity (Wang et al., 30 Sep 2025).

6. Hyperparameter Regimes and Implementation Notes

Robust operation of MuonH requires:

  • Hölder exponent Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)6: Typically Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)7 (smooth ERM), but any Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)8 suffices.
  • Tail index Gt=1bti=1btfξt,i(Wt)G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)9: Empirically estimated or set to GtG_t0 for bounded variance; GtG_t1 otherwise.
  • Step-size schedule GtG_t2, with GtG_t3. Practical GtG_t4.
  • Batch size: Moderate constants (256–1024) or slow exponential increase.
  • SVD approximation: 5 Newton–Schulz steps typically suffice, incurring negligible computational overhead relative to backpropagation.
  • Momentum: GtG_t5. Additional terms in the convergence condition remain summable if GtG_t6.

Recommended settings in neural LLMs: GtG_t7, GtG_t8, spectral-norm bound GtG_t9, and no weight decay on attention/FFN blocks for isolating effects (Iiduka, 16 Mar 2026, Tveit et al., 22 Apr 2025, Wang et al., 30 Sep 2025).

7. Practical Significance and Research Directions

MuonH and related Muon optimizers provide algorithmic infrastructure for learning dynamics in nonconvex, nonsmooth, and statistically imbalanced regimes. Key advantages are:

  • Provably faster convergence in gradient norm under minimal smoothness and with heavy-tailed stochastic effects.
  • Isotropic singular-value evolution in critical network blocks, translating to improved learning for rare/“tail” data—a major advantage for long-tailed NLP and vision benchmarks.
  • Empirical acceleration of delayed generalization transitions (grokking) and balanced performance across head and tail classes.
  • Layerwise normalization preventing operator-norm blowup and aligning with implicit regularization trends seen empirically.

Ongoing research aims to integrate MuonH more closely with LLM pretraining pipelines, optimize SVD approximations further, and generalize convergence analysis to settings with additional nonlinear (e.g., batchnorm, attention) or structured noise (Iiduka, 16 Mar 2026, Li et al., 5 Feb 2025, Wang et al., 30 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuonH Optimizer.