Muon-based Gradient Optimizer (MuSGD)
- Muon-based Gradient Optimizer (MuSGD) is a matrix-aware stochastic method that applies polar decomposition and Newton–Schulz iterations to orthogonalize momentum updates.
- The approach enhances convergence by enforcing spectral constraints and integrating a per-row/column normalization, achieving better performance than AdamW and SGD.
- Empirical benchmarks on GPT and LLaMA models show significant improvements in validation perplexity and training efficiency with negligible computational overhead.
The Muon-based Gradient Optimizer (commonly termed MuSGD) is a matrix-aware stochastic optimization method characterized by spectral-norm orthogonalization of momentum or gradient updates. Originally developed for efficient and stable pre-training of LLMs, MuSGD proceeds via momentum accumulation and orthogonalization using polar decomposition or Newton–Schulz iterations, followed by a geometrically scaled parameter update. The approach has led to substantial performance gains over AdamW and standard SGD in both theoretical convergence and empirical efficiency, especially on large-scale transformer architectures. The Muon+ variant introduces an additional per-row or per-column normalization after the orthogonalization step, yielding further robustness and improvements in validation perplexity across GPT and LLaMA model families (Zhang et al., 25 Feb 2026).
1. Mathematical Formulation and Algorithmic Workflow
A single MuSGD step at iteration for a matrix parameter proceeds as follows:
where is the stochastic gradient, the momentum coefficient, and the learning rate. The orthogonalization ensures that , enforcing a unit spectral norm and equivalently projecting the update onto the Stiefel manifold (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025).
In practical implementations, the polar factor is computed efficiently using Newton–Schulz iterations of the form:
with typically fixed at 5 for stability and throughput. This procedure approximates with controllable error in singular values (Zhang et al., 25 Feb 2026).
In Muon+, an additional normalization pass is applied:
0
where, for instance, column-wise,
1
This stabilizes the scale of updates and controls the sensitivity to 2 (Zhang et al., 25 Feb 2026).
2. Spectral Regularization and Theoretical Principles
Muon’s key innovation is the use of spectral-norm constraints at each update. This can be formalized in the Lion-3 mirror descent family, where the nuclear norm 4 defines the regularization and 5 (matrix sign function) acts as a subgradient preconditioner. The Muon update is then equivalent to solving:
6
for weight-decay parameter 7. The optimizer enforces 8 throughout training, leading to spectral regularization and improved generalization. These updates are strictly dual to enforcing spectral constraints at each iteration via Fenchel conjugacy and KKT stationarity (Chen et al., 18 Jun 2025).
On non-square matrices—or with block-structured neural weights—this spectral flattening (enforcing all singular values to unity or near unity) yields highly controlled update directions. The explicit orthogonalization acts as a geometry-aware normalization, enhancing both convergence and step-size robustness (Shen et al., 29 May 2025).
3. Convergence Rates and Variance-Reduction Extensions
Standard stochastic MuSGD achieves a nonconvex convergence rate of 9 on the expected gradient norm. Under additional smoothness and PL conditions, the optimizer enjoys 0 or 1 (with variance reduction) rates, matching lower bounds for this problem class:
- Option EMA (standard Muon): 2 on ergodic gradient norm (Chang et al., 19 Sep 2025).
- Muon-VR2 (variance reduction): 3 with two-batch correction and properly coupled step/momentum schedules (Chang et al., 19 Sep 2025, Qian et al., 18 Dec 2025).
Variance-reduced momentum (MVR) techniques integrated in the Gluon-MVR-2 framework yield the optimal nonconvex rate of 4. These refinements involve inner-outer double buffering and strong per-layer relative smoothness assumptions, amplifying stability at large batch sizes (Qian et al., 18 Dec 2025).
Table: Convergence Rates
| Method | Nonconvex Rate | Reference |
|---|---|---|
| SGD, AdamW | 5 | Standard Theory |
| MuSGD Standard | 6 | (Chang et al., 19 Sep 2025) |
| MuSGD Variance-Reduced | 7 | (Qian et al., 18 Dec 2025) |
4. Empirical Performance and Robustness
Extensive pretraining benchmarks confirm that Muon and Muon+ consistently outperform AdamW across architectures and training regimes:
- GPT-style models (124M–774M): Muon+ improves validation perplexity by up to 2 points over Muon (e.g., 29.66 8 27.64, GPT-Small) (Zhang et al., 25 Feb 2026).
- LLaMA-style models (60M–1B): Similar robust improvements (e.g., LLaMA-1B, 10.68 9 10.31 PPL).
- Compute-optimal (T2P 0 20) and over-training (T2P 1 200): Perplexity improvements persist with extended tokens.
| Model | Muon PPL | Muon+ PPL | ΔPPL |
|---|---|---|---|
| GPT-Small | 29.66 | 27.64 | -2.02 |
| GPT-Base | 21.70 | 19.98 | -1.72 |
| GPT-Large | 17.82 | 16.91 | -0.91 |
| LLaMA-60M | 25.75 | 25.25 | -0.50 |
| LLaMA-1B | 10.68 | 10.31 | -0.37 |
Overhead from the Muon+ normalization step is negligible (2) relative to the Newton–Schulz orthogonalization kernel (3). Muon+ requires no additional hyperparameter tuning relative to base Muon; the effective learning-rate window often widens due to increased scale-invariance and normalized update geometry (Zhang et al., 25 Feb 2026). Integration is straightforward: the normalization stage is a direct one-liner after the polar factor.
In the compute/epoch Pareto regime, Muon reaches target loss with half the training of AdamW while maintaining or improving perplexity. Additional robustness gains are reported for large-batch training, reduced grokking latency, and alleviation of spectral collapse in deep vision transformers (ViTs) (Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026, Tveit et al., 22 Apr 2025).
5. Comparative Analysis and Theoretical Insights
Muon uniformly outperforms SGD and AdamW in settings where the Hessian or gradient covariance is low-rank, block-diagonal, or highly anisotropic—regimes typical in modern transformers and wide MLPs. The spectral-norm constraint enables learning rates up to the scale of the average singular value of the gradient (not the largest), a mechanism termed "spectral flattening" (Nguyen et al., 13 May 2026). This greatly increases both the maximal stable step and the convergence rate under Kronecker-factored or K-FAC curvature models.
Muon’s update direction coincides with steepest descent under spectral norm and implements the natural gradient on the Stiefel manifold for square matrices (Mehta et al., 29 Sep 2025). In broader terms, the Muon step falls under non-Euclidean mirror descent, interpreted as a linear minimization oracle (LMO) constrained to a spectral-norm ball, with decoupled weight decay enforcing explicit operator-norm regularization throughout training (Chen et al., 18 Jun 2025, Qian et al., 18 Dec 2025).
6. Practical Integration and Recommendations
- Per-layer application: Muon and Muon+ are applied to all non-embedding, non-norm matrix parameters; AdamW is recommended for scalar and embedding tokens as well as (optionally) LayerNorm layers.
- Hyperparameters: Default learning rates for Muon+ are robust; a typical range is 4 to 5, with momentum 6 and 5 Newton–Schulz steps per iteration.
- Scaling considerations: The normalization factor 7 is included to match the update scale across rectangular matrices; weight decay and gradient clipping are unchanged.
- Implementation: Replace the orthogonal update step in preexisting Muon code with a normalized variant as specified. Adjust the normalization axis (“col”, “row”, or both) empirically; column-then-row is recommended for best robustness.
- System overhead: Newton–Schulz iterations dominate per-layer time; the added normalization has minimal overhead—empirically under 5% of optimizer compute (Zhang et al., 25 Feb 2026).
7. Significance and Extensions
Muon+ establishes a new standard for matrix-aware optimization in deep learning, yielding pervasive improvements in convergence, loss, and stability. Its one-line normalization enhancement generalizes readily to new architectures, requires no retuning, and is robust across scales. Theoretical grounding within non-Euclidean mirror descent, spectral regularization, and blockwise adaptive trust-regions supports Muon+’s empirical success. Extensive pretraining and fine-tuning trials on both LLMs and vision transformers corroborate its efficiency and generality (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026).
The Muon framework continues to expand, with recent advances such as curvature-aware extensions (e.g., Mousse), mixed Muon–SGD hybrids, and fully schedule-free variants. These lines of work aim to harness Muon’s spectral geometry while further refining its adaptivity, scaling behavior, and generalization guarantees.