Muon-based Gradient Optimizer (MuSGD)

Updated 4 June 2026

Muon-based Gradient Optimizer (MuSGD) is a matrix-aware stochastic method that applies polar decomposition and Newton–Schulz iterations to orthogonalize momentum updates.
The approach enhances convergence by enforcing spectral constraints and integrating a per-row/column normalization, achieving better performance than AdamW and SGD.
Empirical benchmarks on GPT and LLaMA models show significant improvements in validation perplexity and training efficiency with negligible computational overhead.

The Muon-based Gradient Optimizer (commonly termed MuSGD) is a matrix-aware stochastic optimization method characterized by spectral-norm orthogonalization of momentum or gradient updates. Originally developed for efficient and stable pre-training of LLMs, MuSGD proceeds via momentum accumulation and orthogonalization using polar decomposition or Newton–Schulz iterations, followed by a geometrically scaled parameter update. The approach has led to substantial performance gains over AdamW and standard SGD in both theoretical convergence and empirical efficiency, especially on large-scale transformer architectures. The Muon+ variant introduces an additional per-row or per-column normalization after the orthogonalization step, yielding further robustness and improvements in validation perplexity across GPT and LLaMA model families (Zhang et al., 25 Feb 2026).

1. Mathematical Formulation and Algorithmic Workflow

A single MuSGD step at iteration $t$ for a matrix parameter $W_t \in \mathbb{R}^{m \times n}$ proceeds as follows:

$\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$

where $G_t$ is the stochastic gradient, $\mu$ the momentum coefficient, and $\eta$ the learning rate. The orthogonalization ensures that $O_t^T O_t \approx I$ , enforcing a unit spectral norm and equivalently projecting the update onto the Stiefel manifold (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025).

In practical implementations, the polar factor is computed efficiently using Newton–Schulz iterations of the form:

$\begin{aligned} X_0 &= \tfrac{1}{\alpha} M, \quad \alpha = \| M \|_2, \ X_{k+1} &= \tfrac{3}{2} X_k - \tfrac{1}{2} X_k (X_k^\top X_k), \end{aligned}$

with $K$ typically fixed at 5 for stability and throughput. This procedure approximates $M (M^\top M)^{-1/2}$ with controllable error in singular values (Zhang et al., 25 Feb 2026).

In Muon+, an additional normalization pass is applied:

$W_t \in \mathbb{R}^{m \times n}$ 0

where, for instance, column-wise,

$W_t \in \mathbb{R}^{m \times n}$ 1

This stabilizes the scale of updates and controls the sensitivity to $W_t \in \mathbb{R}^{m \times n}$ 2 (Zhang et al., 25 Feb 2026).

2. Spectral Regularization and Theoretical Principles

Muon’s key innovation is the use of spectral-norm constraints at each update. This can be formalized in the Lion- $W_t \in \mathbb{R}^{m \times n}$ 3 mirror descent family, where the nuclear norm $W_t \in \mathbb{R}^{m \times n}$ 4 defines the regularization and $W_t \in \mathbb{R}^{m \times n}$ 5 (matrix sign function) acts as a subgradient preconditioner. The Muon update is then equivalent to solving:

$W_t \in \mathbb{R}^{m \times n}$ 6

for weight-decay parameter $W_t \in \mathbb{R}^{m \times n}$ 7. The optimizer enforces $W_t \in \mathbb{R}^{m \times n}$ 8 throughout training, leading to spectral regularization and improved generalization. These updates are strictly dual to enforcing spectral constraints at each iteration via Fenchel conjugacy and KKT stationarity (Chen et al., 18 Jun 2025).

On non-square matrices—or with block-structured neural weights—this spectral flattening (enforcing all singular values to unity or near unity) yields highly controlled update directions. The explicit orthogonalization acts as a geometry-aware normalization, enhancing both convergence and step-size robustness (Shen et al., 29 May 2025).

3. Convergence Rates and Variance-Reduction Extensions

Standard stochastic MuSGD achieves a nonconvex convergence rate of $W_t \in \mathbb{R}^{m \times n}$ 9 on the expected gradient norm. Under additional smoothness and PL conditions, the optimizer enjoys $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 0 or $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 1 (with variance reduction) rates, matching lower bounds for this problem class:

Option EMA (standard Muon): $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 2 on ergodic gradient norm (Chang et al., 19 Sep 2025).
Muon-VR2 (variance reduction): $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 3 with two-batch correction and properly coupled step/momentum schedules (Chang et al., 19 Sep 2025, Qian et al., 18 Dec 2025).

Variance-reduced momentum (MVR) techniques integrated in the Gluon-MVR-2 framework yield the optimal nonconvex rate of $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 4. These refinements involve inner-outer double buffering and strong per-layer relative smoothness assumptions, amplifying stability at large batch sizes (Qian et al., 18 Dec 2025).

Table: Convergence Rates

Method	Nonconvex Rate	Reference
SGD, AdamW	$\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 5	Standard Theory
MuSGD Standard	$\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 6	(Chang et al., 19 Sep 2025)
MuSGD Variance-Reduced	$\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 7	(Qian et al., 18 Dec 2025)

4. Empirical Performance and Robustness

Extensive pretraining benchmarks confirm that Muon and Muon+ consistently outperform AdamW across architectures and training regimes:

GPT-style models (124M–774M): Muon+ improves validation perplexity by up to 2 points over Muon (e.g., 29.66 $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 8 27.64, GPT-Small) (Zhang et al., 25 Feb 2026).
LLaMA-style models (60M–1B): Similar robust improvements (e.g., LLaMA-1B, 10.68 $\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}$ 9 10.31 PPL).
Compute-optimal (T2P $G_t$ 0 20) and over-training (T2P $G_t$ 1 200): Perplexity improvements persist with extended tokens.

Model	Muon PPL	Muon+ PPL	ΔPPL
GPT-Small	29.66	27.64	-2.02
GPT-Base	21.70	19.98	-1.72
GPT-Large	17.82	16.91	-0.91
LLaMA-60M	25.75	25.25	-0.50
LLaMA-1B	10.68	10.31	-0.37

Overhead from the Muon+ normalization step is negligible ( $G_t$ 2) relative to the Newton–Schulz orthogonalization kernel ( $G_t$ 3). Muon+ requires no additional hyperparameter tuning relative to base Muon; the effective learning-rate window often widens due to increased scale-invariance and normalized update geometry (Zhang et al., 25 Feb 2026). Integration is straightforward: the normalization stage is a direct one-liner after the polar factor.

In the compute/epoch Pareto regime, Muon reaches target loss with half the training of AdamW while maintaining or improving perplexity. Additional robustness gains are reported for large-batch training, reduced grokking latency, and alleviation of spectral collapse in deep vision transformers (ViTs) (Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026, Tveit et al., 22 Apr 2025).

5. Comparative Analysis and Theoretical Insights

Muon uniformly outperforms SGD and AdamW in settings where the Hessian or gradient covariance is low-rank, block-diagonal, or highly anisotropic—regimes typical in modern transformers and wide MLPs. The spectral-norm constraint enables learning rates up to the scale of the average singular value of the gradient (not the largest), a mechanism termed "spectral flattening" (Nguyen et al., 13 May 2026). This greatly increases both the maximal stable step and the convergence rate under Kronecker-factored or K-FAC curvature models.

Muon’s update direction coincides with steepest descent under spectral norm and implements the natural gradient on the Stiefel manifold for square matrices (Mehta et al., 29 Sep 2025). In broader terms, the Muon step falls under non-Euclidean mirror descent, interpreted as a linear minimization oracle (LMO) constrained to a spectral-norm ball, with decoupled weight decay enforcing explicit operator-norm regularization throughout training (Chen et al., 18 Jun 2025, Qian et al., 18 Dec 2025).

6. Practical Integration and Recommendations

Per-layer application: Muon and Muon+ are applied to all non-embedding, non-norm matrix parameters; AdamW is recommended for scalar and embedding tokens as well as (optionally) LayerNorm layers.
Hyperparameters: Default learning rates for Muon+ are robust; a typical range is $G_t$ 4 to $G_t$ 5, with momentum $G_t$ 6 and 5 Newton–Schulz steps per iteration.
Scaling considerations: The normalization factor $G_t$ 7 is included to match the update scale across rectangular matrices; weight decay and gradient clipping are unchanged.
Implementation: Replace the orthogonal update step in preexisting Muon code with a normalized variant as specified. Adjust the normalization axis (“col”, “row”, or both) empirically; column-then-row is recommended for best robustness.
System overhead: Newton–Schulz iterations dominate per-layer time; the added normalization has minimal overhead—empirically under 5% of optimizer compute (Zhang et al., 25 Feb 2026).

7. Significance and Extensions

Muon+ establishes a new standard for matrix-aware optimization in deep learning, yielding pervasive improvements in convergence, loss, and stability. Its one-line normalization enhancement generalizes readily to new architectures, requires no retuning, and is robust across scales. Theoretical grounding within non-Euclidean mirror descent, spectral regularization, and blockwise adaptive trust-regions supports Muon+’s empirical success. Extensive pretraining and fine-tuning trials on both LLMs and vision transformers corroborate its efficiency and generality (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026).

The Muon framework continues to expand, with recent advances such as curvature-aware extensions (e.g., Mousse), mixed Muon–SGD hybrids, and fully schedule-free variants. These lines of work aim to harness Muon’s spectral geometry while further refining its adaptivity, scaling behavior, and generalization guarantees.

Markdown Report Issue Upgrade to Chat

References (9)

Muon+: Towards Better Muon via One Additional Normalization Step (2026)

Muon: Training and Trade-offs with Latent Attention and MoE (2025)

Muon Optimizes Under Spectral Norm Constraints (2025)

On the Convergence Analysis of Muon (2025)

On the Convergence of Muon and Beyond (2025)

Muon is Provably Faster with Momentum Variance Reduction (2025)

Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra (2026)

Muon Optimizer Accelerates Grokking (2025)

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Muon-based Gradient Optimizer (MuSGD).

Muon-based Gradient Optimizer (MuSGD)

1. Mathematical Formulation and Algorithmic Workflow

2. Spectral Regularization and Theoretical Principles

3. Convergence Rates and Variance-Reduction Extensions

4. Empirical Performance and Robustness

5. Comparative Analysis and Theoretical Insights

6. Practical Integration and Recommendations

7. Significance and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Muon-based Gradient Optimizer (MuSGD)

1. Mathematical Formulation and Algorithmic Workflow

2. Spectral Regularization and Theoretical Principles

3. Convergence Rates and Variance-Reduction Extensions

4. Empirical Performance and Robustness

5. Comparative Analysis and Theoretical Insights

6. Practical Integration and Recommendations

7. Significance and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research