MuonH Optimizer: Stiefel Projection for ERM
- MuonH is a variant of the Muon optimizer that projects mini-batch gradients onto the Stiefel manifold to enforce orthonormal search directions.
- It provides rigorous convergence guarantees for nonconvex ERM by addressing heavy-tailed noise and employing Hölder-smooth gradient assumptions.
- The method achieves faster convergence rates compared to mini-batch SGD and ensures balanced learning in deep neural networks, particularly for tail classes.
MuonH is a variant of the Muon optimizer that incorporates orthonormal search directions via projection onto the Stiefel manifold, specifically designed for nonconvex empirical risk minimization (ERM) in the presence of heavy-tailed stochastic noise and Hölder-smoothness in the objective's gradients. MuonH generalizes normalized-SGD to the matrix setting and provides rigorous convergence guarantees under conditions where traditional assumptions, such as bounded variance, do not hold. The method leverages key advances in manifold optimization and heavy-tail resilient stochastic approximation to achieve faster convergence rates than standard mini-batch SGD in terms of the gradient norm, with particular relevance for deep neural networks and associative memory structures in LLMs.
1. Algorithmic Structure of MuonH
MuonH operates on a matrix parameterization at iteration . Each update enforces orthogonality in the search direction by projecting a stochastic (mini-batch) gradient onto the Stiefel manifold. The central loop of MuonH (without momentum, ) is:
- Draw a mini-batch from the dataset.
- Compute the mini-batch gradient:
- Project onto the Stiefel manifold to obtain the orthonormal direction :
If (compact SVD), .
- Update parameters:
0
Momentum can be incorporated by replacing 1 with a weighted sum incorporating past gradients. Newton–Schulz iteration is used in practice as an efficient alternative to the full SVD for approximating 2 (Iiduka, 16 Mar 2026).
2. Theoretical Foundations: Hölder-Smoothness and Heavy-Tailed Noise
MuonH is analyzed for empirical risk minimization objectives
3
with two structural properties:
- Hölder-Smoothness: The gradients of 4 satisfy for all 5 and exponent 6
7
- Heavy-Tailed Noise: The stochastic gradient estimator 8 is unbiased, with 9-variance bounded by 0 for 1:
2
The regime 3 admits genuinely heavy-tailed noise, encountered in practical large-scale learning (Iiduka, 16 Mar 2026).
3. Convergence Guarantees and Rate Improvements
The main convergence theorem establishes that, under suitable step-size and mini-batch schedules:
- 4
- 5
- 6 the iterates 7 satisfy almost surely
8
i.e. convergence to stationary points even under heavy-tailed noise. The step-size can be taken as 9 with 0.
Convergence Rate Comparison:
- Mini-batch SGD achieves 1 so 2
- MuonH achieves 3, i.e., MuonH improves the rate in the gradient norm metric.
If 4 (e.g., 5 for 6), SGD's best achievable rate is 7 (squared-norm), while MuonH attains 8 (norm) (Iiduka, 16 Mar 2026).
4. Comparative Structure: Muon, MuonH, and AdamW
The Muon family is fundamentally distinct from standard optimizers through its use of matrix manifold geometry and explicit spectral (operator-norm) constraints:
- Muon (original) uses a normalized-momentum update direction from the SVD of the momentum buffer, enforcing search directions along the top singular vectors.
- MuonH solves a Hessian-free trust-region subproblem, with updates involving the rank decomposition and nuclear-norm scaling, standardizing step size and directionality (Li et al., 5 Feb 2025).
- AdamW performs elementwise adaptive scaling via second-order moments, but does not globally control the layer capacity or spectrum. AdamW can induce norm growth and spectral concentration, problematic in pathological regimes such as grokking and heavy-tailed class distributions (Tveit et al., 22 Apr 2025).
| Optimizer | Update Form | Spectral Constraint | Preconditioning |
|---|---|---|---|
| Muon | 9 | 0 | Momentum / SVD |
| MuonH | 1 | Operator norm | Hessian-free, SVD |
| AdamW | 2 | None | Diagonal, no SVD |
5. Empirical Behavior: Stability under Heavy-Tail and Grokking
MuonH and the Muon family demonstrate practical advantages in domains with:
- Heavy-tailed noise: Empirically observed in large-scale training on non-uniform data distributions; MuonH's norm-based step mitigates gradient explosion and learning imbalance in rare/“tail” classes (Wang et al., 30 Sep 2025).
- Grokking regime: On modular arithmetic and parity tasks, Muon achieves a 333% reduction in mean grokking epoch (mem–gen transition) vs AdamW (102.89 vs 153.09), statistically highly significant (4, 5) (Tveit et al., 22 Apr 2025).
- Associative memory and transformers: Experiments and theory show Muon’s update yields an isotropic singular value spectrum in critical associative-memory blocks (Value/Output attention, FFNs), ensuring balanced learning even for tail classes where Adam yields high disparity (Wang et al., 30 Sep 2025).
6. Hyperparameter Regimes and Implementation Notes
Robust operation of MuonH requires:
- Hölder exponent 6: Typically 7 (smooth ERM), but any 8 suffices.
- Tail index 9: Empirically estimated or set to 0 for bounded variance; 1 otherwise.
- Step-size schedule 2, with 3. Practical 4.
- Batch size: Moderate constants (256–1024) or slow exponential increase.
- SVD approximation: 5 Newton–Schulz steps typically suffice, incurring negligible computational overhead relative to backpropagation.
- Momentum: 5. Additional terms in the convergence condition remain summable if 6.
Recommended settings in neural LLMs: 7, 8, spectral-norm bound 9, and no weight decay on attention/FFN blocks for isolating effects (Iiduka, 16 Mar 2026, Tveit et al., 22 Apr 2025, Wang et al., 30 Sep 2025).
7. Practical Significance and Research Directions
MuonH and related Muon optimizers provide algorithmic infrastructure for learning dynamics in nonconvex, nonsmooth, and statistically imbalanced regimes. Key advantages are:
- Provably faster convergence in gradient norm under minimal smoothness and with heavy-tailed stochastic effects.
- Isotropic singular-value evolution in critical network blocks, translating to improved learning for rare/“tail” data—a major advantage for long-tailed NLP and vision benchmarks.
- Empirical acceleration of delayed generalization transitions (grokking) and balanced performance across head and tail classes.
- Layerwise normalization preventing operator-norm blowup and aligning with implicit regularization trends seen empirically.
Ongoing research aims to integrate MuonH more closely with LLM pretraining pipelines, optimize SVD approximations further, and generalize convergence analysis to settings with additional nonlinear (e.g., batchnorm, attention) or structured noise (Iiduka, 16 Mar 2026, Li et al., 5 Feb 2025, Wang et al., 30 Sep 2025).