Papers
Topics
Authors
Recent
Search
2000 character limit reached

Muon-based Gradient Optimizer (MuSGD)

Updated 4 June 2026
  • Muon-based Gradient Optimizer (MuSGD) is a matrix-aware stochastic method that applies polar decomposition and Newton–Schulz iterations to orthogonalize momentum updates.
  • The approach enhances convergence by enforcing spectral constraints and integrating a per-row/column normalization, achieving better performance than AdamW and SGD.
  • Empirical benchmarks on GPT and LLaMA models show significant improvements in validation perplexity and training efficiency with negligible computational overhead.

The Muon-based Gradient Optimizer (commonly termed MuSGD) is a matrix-aware stochastic optimization method characterized by spectral-norm orthogonalization of momentum or gradient updates. Originally developed for efficient and stable pre-training of LLMs, MuSGD proceeds via momentum accumulation and orthogonalization using polar decomposition or Newton–Schulz iterations, followed by a geometrically scaled parameter update. The approach has led to substantial performance gains over AdamW and standard SGD in both theoretical convergence and empirical efficiency, especially on large-scale transformer architectures. The Muon+ variant introduces an additional per-row or per-column normalization after the orthogonalization step, yielding further robustness and improvements in validation perplexity across GPT and LLaMA model families (Zhang et al., 25 Feb 2026).

1. Mathematical Formulation and Algorithmic Workflow

A single MuSGD step at iteration tt for a matrix parameter WtRm×nW_t \in \mathbb{R}^{m \times n} proceeds as follows:

Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}

where GtG_t is the stochastic gradient, μ\mu the momentum coefficient, and η\eta the learning rate. The orthogonalization ensures that OtTOtIO_t^T O_t \approx I, enforcing a unit spectral norm and equivalently projecting the update onto the Stiefel manifold (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025).

In practical implementations, the polar factor is computed efficiently using Newton–Schulz iterations of the form:

X0=1αM,α=M2, Xk+1=32Xk12Xk(XkXk),\begin{aligned} X_0 &= \tfrac{1}{\alpha} M, \quad \alpha = \| M \|_2, \ X_{k+1} &= \tfrac{3}{2} X_k - \tfrac{1}{2} X_k (X_k^\top X_k), \end{aligned}

with KK typically fixed at 5 for stability and throughput. This procedure approximates M(MM)1/2M (M^\top M)^{-1/2} with controllable error in singular values (Zhang et al., 25 Feb 2026).

In Muon+, an additional normalization pass is applied:

WtRm×nW_t \in \mathbb{R}^{m \times n}0

where, for instance, column-wise,

WtRm×nW_t \in \mathbb{R}^{m \times n}1

This stabilizes the scale of updates and controls the sensitivity to WtRm×nW_t \in \mathbb{R}^{m \times n}2 (Zhang et al., 25 Feb 2026).

2. Spectral Regularization and Theoretical Principles

Muon’s key innovation is the use of spectral-norm constraints at each update. This can be formalized in the Lion-WtRm×nW_t \in \mathbb{R}^{m \times n}3 mirror descent family, where the nuclear norm WtRm×nW_t \in \mathbb{R}^{m \times n}4 defines the regularization and WtRm×nW_t \in \mathbb{R}^{m \times n}5 (matrix sign function) acts as a subgradient preconditioner. The Muon update is then equivalent to solving:

WtRm×nW_t \in \mathbb{R}^{m \times n}6

for weight-decay parameter WtRm×nW_t \in \mathbb{R}^{m \times n}7. The optimizer enforces WtRm×nW_t \in \mathbb{R}^{m \times n}8 throughout training, leading to spectral regularization and improved generalization. These updates are strictly dual to enforcing spectral constraints at each iteration via Fenchel conjugacy and KKT stationarity (Chen et al., 18 Jun 2025).

On non-square matrices—or with block-structured neural weights—this spectral flattening (enforcing all singular values to unity or near unity) yields highly controlled update directions. The explicit orthogonalization acts as a geometry-aware normalization, enhancing both convergence and step-size robustness (Shen et al., 29 May 2025).

3. Convergence Rates and Variance-Reduction Extensions

Standard stochastic MuSGD achieves a nonconvex convergence rate of WtRm×nW_t \in \mathbb{R}^{m \times n}9 on the expected gradient norm. Under additional smoothness and PL conditions, the optimizer enjoys Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}0 or Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}1 (with variance reduction) rates, matching lower bounds for this problem class:

  • Option EMA (standard Muon): Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}2 on ergodic gradient norm (Chang et al., 19 Sep 2025).
  • Muon-VR2 (variance reduction): Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}3 with two-batch correction and properly coupled step/momentum schedules (Chang et al., 19 Sep 2025, Qian et al., 18 Dec 2025).

Variance-reduced momentum (MVR) techniques integrated in the Gluon-MVR-2 framework yield the optimal nonconvex rate of Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}4. These refinements involve inner-outer double buffering and strong per-layer relative smoothness assumptions, amplifying stability at large batch sizes (Qian et al., 18 Dec 2025).

Table: Convergence Rates

Method Nonconvex Rate Reference
SGD, AdamW Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}5 Standard Theory
MuSGD Standard Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}6 (Chang et al., 19 Sep 2025)
MuSGD Variance-Reduced Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}7 (Qian et al., 18 Dec 2025)

4. Empirical Performance and Robustness

Extensive pretraining benchmarks confirm that Muon and Muon+ consistently outperform AdamW across architectures and training regimes:

  • GPT-style models (124M–774M): Muon+ improves validation perplexity by up to 2 points over Muon (e.g., 29.66 Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}8 27.64, GPT-Small) (Zhang et al., 25 Feb 2026).
  • LLaMA-style models (60M–1B): Similar robust improvements (e.g., LLaMA-1B, 10.68 Mt=μMt1+(1μ)Gt, Ot=Ortho(Mt),Ortho(M)=M(MM)1/2UV, Wt=Wt1ηm/nOt,\begin{aligned} M_t &= \mu\, M_{t-1} + (1-\mu)\, G_t, \ O_t &= \operatorname{Ortho}(M_t), \quad \operatorname{Ortho}(M) = M (M^\top M)^{-1/2} \approx U V^\top, \ W_t &= W_{t-1} - \eta \sqrt{m/n}\, O_t, \end{aligned}9 10.31 PPL).
  • Compute-optimal (T2P GtG_t0 20) and over-training (T2P GtG_t1 200): Perplexity improvements persist with extended tokens.
Model Muon PPL Muon+ PPL ΔPPL
GPT-Small 29.66 27.64 -2.02
GPT-Base 21.70 19.98 -1.72
GPT-Large 17.82 16.91 -0.91
LLaMA-60M 25.75 25.25 -0.50
LLaMA-1B 10.68 10.31 -0.37

Overhead from the Muon+ normalization step is negligible (GtG_t2) relative to the Newton–Schulz orthogonalization kernel (GtG_t3). Muon+ requires no additional hyperparameter tuning relative to base Muon; the effective learning-rate window often widens due to increased scale-invariance and normalized update geometry (Zhang et al., 25 Feb 2026). Integration is straightforward: the normalization stage is a direct one-liner after the polar factor.

In the compute/epoch Pareto regime, Muon reaches target loss with half the training of AdamW while maintaining or improving perplexity. Additional robustness gains are reported for large-batch training, reduced grokking latency, and alleviation of spectral collapse in deep vision transformers (ViTs) (Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026, Tveit et al., 22 Apr 2025).

5. Comparative Analysis and Theoretical Insights

Muon uniformly outperforms SGD and AdamW in settings where the Hessian or gradient covariance is low-rank, block-diagonal, or highly anisotropic—regimes typical in modern transformers and wide MLPs. The spectral-norm constraint enables learning rates up to the scale of the average singular value of the gradient (not the largest), a mechanism termed "spectral flattening" (Nguyen et al., 13 May 2026). This greatly increases both the maximal stable step and the convergence rate under Kronecker-factored or K-FAC curvature models.

Muon’s update direction coincides with steepest descent under spectral norm and implements the natural gradient on the Stiefel manifold for square matrices (Mehta et al., 29 Sep 2025). In broader terms, the Muon step falls under non-Euclidean mirror descent, interpreted as a linear minimization oracle (LMO) constrained to a spectral-norm ball, with decoupled weight decay enforcing explicit operator-norm regularization throughout training (Chen et al., 18 Jun 2025, Qian et al., 18 Dec 2025).

6. Practical Integration and Recommendations

  • Per-layer application: Muon and Muon+ are applied to all non-embedding, non-norm matrix parameters; AdamW is recommended for scalar and embedding tokens as well as (optionally) LayerNorm layers.
  • Hyperparameters: Default learning rates for Muon+ are robust; a typical range is GtG_t4 to GtG_t5, with momentum GtG_t6 and 5 Newton–Schulz steps per iteration.
  • Scaling considerations: The normalization factor GtG_t7 is included to match the update scale across rectangular matrices; weight decay and gradient clipping are unchanged.
  • Implementation: Replace the orthogonal update step in preexisting Muon code with a normalized variant as specified. Adjust the normalization axis (“col”, “row”, or both) empirically; column-then-row is recommended for best robustness.
  • System overhead: Newton–Schulz iterations dominate per-layer time; the added normalization has minimal overhead—empirically under 5% of optimizer compute (Zhang et al., 25 Feb 2026).

7. Significance and Extensions

Muon+ establishes a new standard for matrix-aware optimization in deep learning, yielding pervasive improvements in convergence, loss, and stability. Its one-line normalization enhancement generalizes readily to new architectures, requires no retuning, and is robust across scales. Theoretical grounding within non-Euclidean mirror descent, spectral regularization, and blockwise adaptive trust-regions supports Muon+’s empirical success. Extensive pretraining and fine-tuning trials on both LLMs and vision transformers corroborate its efficiency and generality (Zhang et al., 25 Feb 2026, Mehta et al., 29 Sep 2025, Southworth et al., 23 May 2026).

The Muon framework continues to expand, with recent advances such as curvature-aware extensions (e.g., Mousse), mixed Muon–SGD hybrids, and fully schedule-free variants. These lines of work aim to harness Muon’s spectral geometry while further refining its adaptivity, scaling behavior, and generalization guarantees.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Muon-based Gradient Optimizer (MuSGD).