MuonH Optimizer: Stiefel Projection for ERM

Updated 1 April 2026

MuonH is a variant of the Muon optimizer that projects mini-batch gradients onto the Stiefel manifold to enforce orthonormal search directions.
It provides rigorous convergence guarantees for nonconvex ERM by addressing heavy-tailed noise and employing Hölder-smooth gradient assumptions.
The method achieves faster convergence rates compared to mini-batch SGD and ensures balanced learning in deep neural networks, particularly for tail classes.

MuonH is a variant of the Muon optimizer that incorporates orthonormal search directions via projection onto the Stiefel manifold, specifically designed for nonconvex empirical risk minimization (ERM) in the presence of heavy-tailed stochastic noise and Hölder-smoothness in the objective's gradients. MuonH generalizes normalized-SGD to the matrix setting and provides rigorous convergence guarantees under conditions where traditional assumptions, such as bounded variance, do not hold. The method leverages key advances in manifold optimization and heavy-tail resilient stochastic approximation to achieve faster convergence rates than standard mini-batch SGD in terms of the gradient norm, with particular relevance for deep neural networks and associative memory structures in LLMs.

1. Algorithmic Structure of MuonH

MuonH operates on a matrix parameterization $W_t \in \mathbb{R}^{m \times n}$ at iteration $t$ . Each update enforces orthogonality in the search direction by projecting a stochastic (mini-batch) gradient onto the Stiefel manifold. The central loop of MuonH (without momentum, $\beta=0$ ) is:

Draw a mini-batch $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ from the dataset.
Compute the mini-batch gradient:

$G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$

Project $G_t$ onto the Stiefel manifold to obtain the orthonormal direction $O_t$ :

$O_t = \arg\min_{O \in \mathbb{R}^{m \times n} : O^\top O = I_n} \|O - G_t\|_F$

If $G_t = U_t \Sigma_t V_t^\top$ (compact SVD), $O_t = U_t V_t^\top$ .

Update parameters:

$t$ 0

Momentum can be incorporated by replacing $t$ 1 with a weighted sum incorporating past gradients. Newton–Schulz iteration is used in practice as an efficient alternative to the full SVD for approximating $t$ 2 (Iiduka, 16 Mar 2026).

2. Theoretical Foundations: Hölder-Smoothness and Heavy-Tailed Noise

MuonH is analyzed for empirical risk minimization objectives

$t$ 3

with two structural properties:

Hölder-Smoothness: The gradients of $t$ 4 satisfy for all $t$ 5 and exponent $t$ 6

$t$ 7

Heavy-Tailed Noise: The stochastic gradient estimator $t$ 8 is unbiased, with $t$ 9-variance bounded by $\beta=0$ 0 for $\beta=0$ 1:

$\beta=0$ 2

The regime $\beta=0$ 3 admits genuinely heavy-tailed noise, encountered in practical large-scale learning (Iiduka, 16 Mar 2026).

3. Convergence Guarantees and Rate Improvements

The main convergence theorem establishes that, under suitable step-size and mini-batch schedules:

$\beta=0$ 4
$\beta=0$ 5
$\beta=0$ 6 the iterates $\beta=0$ 7 satisfy almost surely

$\beta=0$ 8

i.e. convergence to stationary points even under heavy-tailed noise. The step-size can be taken as $\beta=0$ 9 with $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 0.

Convergence Rate Comparison:

Mini-batch SGD achieves $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 1 so $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 2
MuonH achieves $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 3, i.e., MuonH improves the rate in the gradient norm metric.

If $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 4 (e.g., $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 5 for $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 6), SGD's best achievable rate is $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 7 (squared-norm), while MuonH attains $\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 8 (norm) (Iiduka, 16 Mar 2026).

4. Comparative Structure: Muon, MuonH, and AdamW

The Muon family is fundamentally distinct from standard optimizers through its use of matrix manifold geometry and explicit spectral (operator-norm) constraints:

Muon (original) uses a normalized-momentum update direction from the SVD of the momentum buffer, enforcing search directions along the top singular vectors.
MuonH solves a Hessian-free trust-region subproblem, with updates involving the rank decomposition and nuclear-norm scaling, standardizing step size and directionality (Li et al., 5 Feb 2025).
AdamW performs elementwise adaptive scaling via second-order moments, but does not globally control the layer capacity or spectrum. AdamW can induce norm growth and spectral concentration, problematic in pathological regimes such as grokking and heavy-tailed class distributions (Tveit et al., 22 Apr 2025).

Optimizer	Update Form	Spectral Constraint	Preconditioning
Muon	$\xi_t = (\xi_{t,1},\dots,\xi_{t,b_t})$ 9	$G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 0	Momentum / SVD
MuonH	$G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 1	Operator norm	Hessian-free, SVD
AdamW	$G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 2	None	Diagonal, no SVD

5. Empirical Behavior: Stability under Heavy-Tail and Grokking

MuonH and the Muon family demonstrate practical advantages in domains with:

Heavy-tailed noise: Empirically observed in large-scale training on non-uniform data distributions; MuonH's norm-based step mitigates gradient explosion and learning imbalance in rare/“tail” classes (Wang et al., 30 Sep 2025).
Grokking regime: On modular arithmetic and parity tasks, Muon achieves a $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 333% reduction in mean grokking epoch (mem–gen transition) vs AdamW (102.89 vs 153.09), statistically highly significant ( $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 4, $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 5) (Tveit et al., 22 Apr 2025).
Associative memory and transformers: Experiments and theory show Muon’s update yields an isotropic singular value spectrum in critical associative-memory blocks (Value/Output attention, FFNs), ensuring balanced learning even for tail classes where Adam yields high disparity (Wang et al., 30 Sep 2025).

6. Hyperparameter Regimes and Implementation Notes

Robust operation of MuonH requires:

Hölder exponent $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 6: Typically $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 7 (smooth ERM), but any $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 8 suffices.
Tail index $G_t = \frac{1}{b_t}\sum_{i=1}^{b_t} \nabla f_{\xi_{t,i}}(W_t)$ 9: Empirically estimated or set to $G_t$ 0 for bounded variance; $G_t$ 1 otherwise.
Step-size schedule $G_t$ 2, with $G_t$ 3. Practical $G_t$ 4.
Batch size: Moderate constants (256–1024) or slow exponential increase.
SVD approximation: 5 Newton–Schulz steps typically suffice, incurring negligible computational overhead relative to backpropagation.
Momentum: $G_t$ 5. Additional terms in the convergence condition remain summable if $G_t$ 6.

Recommended settings in neural LLMs: $G_t$ 7, $G_t$ 8, spectral-norm bound $G_t$ 9, and no weight decay on attention/FFN blocks for isolating effects (Iiduka, 16 Mar 2026, Tveit et al., 22 Apr 2025, Wang et al., 30 Sep 2025).

7. Practical Significance and Research Directions

MuonH and related Muon optimizers provide algorithmic infrastructure for learning dynamics in nonconvex, nonsmooth, and statistically imbalanced regimes. Key advantages are:

Provably faster convergence in gradient norm under minimal smoothness and with heavy-tailed stochastic effects.
Isotropic singular-value evolution in critical network blocks, translating to improved learning for rare/“tail” data—a major advantage for long-tailed NLP and vision benchmarks.
Empirical acceleration of delayed generalization transitions (grokking) and balanced performance across head and tail classes.
Layerwise normalization preventing operator-norm blowup and aligning with implicit regularization trends seen empirically.

Ongoing research aims to integrate MuonH more closely with LLM pretraining pipelines, optimize SVD approximations further, and generalize convergence analysis to settings with additional nonlinear (e.g., batchnorm, attention) or structured noise (Iiduka, 16 Mar 2026, Li et al., 5 Feb 2025, Wang et al., 30 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization (2026)

A Note on the Convergence of Muon (2025)

Muon Optimizer Accelerates Grokking (2025)

Muon Outperforms Adam in Tail-End Associative Memory Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuonH Optimizer.