Muon Optimizer: Matrix-Aware Learning

Updated 30 June 2025

Muon Optimizer is a matrix-structured, geometry-aware algorithm that enhances deep learning stability and efficiency.
It employs spectral norm-based update rules and implicit regularization to control the Lipschitz constant and improve generalization.
Demonstrated in large-scale LLM pretraining and distributed training, it achieves competitive convergence and reduced memory overhead.

The Muon optimizer is a geometry-aware, matrix-structured optimization algorithm designed to improve the stability, efficiency, and scalability of large-scale machine learning—especially deep neural network training. Unlike traditional first-order optimizers such as Adam, which treat neural network parameters as flattened vectors and maintain per-coordinate adaptivity, Muon directly exploits the matrix structure of weight parameters, uses spectral norm-based update rules, and implements a form of implicit regularization via spectral norm constraints. It has been theoretically analyzed within the Lion-𝒦 framework and demonstrated in settings ranging from LLM pretraining to communication-efficient distributed learning.

1. Mathematical Foundation and Update Mechanism

Muon operates within the Lion-𝒦 family of optimizers by selecting the nuclear norm as its convex map, yielding a matrix-structured update. For a parameter matrix $X \in \mathbb{R}^{n \times m}$ and gradient $G_t$ , Muon’s core update (with decoupled weight decay) is: $\begin{aligned} M_{t+1} &= \beta_2 M_t - (1-\beta_2) G_t \ P_{t+1} &= \beta_1 M_{t+1} - (1-\beta_1) G_t \ X_{t+1} &= X_t + \eta_t \left(\nabla K(P_{t+1}) - X_{t+1}\right) \end{aligned}$ where $K(X) = \|X\|_*$ (nuclear norm), and $\nabla K(X)$ uses the matrix sign function constructed from the singular vectors of $X$ . In practice, Muon often uses iterative Newton-Schulz orthogonalization for efficiency in large layers.

The optimizer update corresponds to a steepest-descent step under the spectral norm: $O_t = \arg\min_{O \in \mathbb{R}^{n \times m},\|O\|_2 \leq 1}\langle G_t, O \rangle \implies O_t = U V^\top$ where $G_t = U \Sigma V^\top$ is the singular value decomposition of the gradient or momentum matrix.

2. Implicit Spectral Norm Constraint and Regularization

A central result from recent theory is that Muon, with decoupled weight decay, implicitly solves a constrained optimization problem: $\min_X F(X) \quad \text{subject to} \quad \|X\|_2 \leq \frac{1}{\lambda}$ That is, Muon enforces an upper bound on the spectral norm (largest singular value) of weight matrices, with $\lambda$ being the effective weight decay. This regularization constrains the Lipschitz constant of the network and improves stability and generalization. The connection between the nuclear norm in the update and the spectral norm in the constraint follows from convex duality.

This mechanism is made rigorous through KKT conditions, with Muon’s iterates shown to converge to stationary points of the constrained problem under standard smoothness and stochasticity assumptions.

3. Convergence and Practical Efficiency

Muon’s convergence has been established both in deterministic and stochastic settings:

For convex and nonconvex objectives with matrix Lipschitz smoothness, the rate for achieving an $\epsilon$ $ϵ$ -nuclear norm stationary point is:
- Deterministic: $O(L_* \Delta \epsilon^{-2})$
- Stochastic: $O(r L_* \sigma^2 \Delta \epsilon^{-4})$
- where $L_*$ is the spectral norm smoothness constant, $r = \min(n, m)$ is the matrix rank, and $\Delta = f(X_0) - f^*$ .
Empirically, Muon has been shown to achieve $\mathcal{O}(1/\sqrt{T})$ convergence rates for the average gradient norm, while remaining effective even at batch size one, in contrast to many normalized optimizers which degrade with high noise.
When the Hessian structure is low-rank or blockwise diagonal—a common scenario in deep learning—the spectral norm smoothness $L_*$ relevant for Muon is much smaller than the Frobenius norm smoothness $L$ relevant in gradient descent, granting Muon a substantial computational advantage.

Experimental work supports that Muon outperforms gradient descent, especially in early optimization phases and when neural network parameters exhibit exploitable structure.

4. Applications in LLM Training and Distributed Frameworks

Muon has attained broad practical impact:

In LLM pretraining, Muon expands the compute–loss Pareto frontier over AdamW, achieving target loss with less compute or fewer devices, especially at large batch sizes beyond the critical regime where AdamW deteriorates (AI et al., 4 May 2025). Muon achieves high data efficiency, stable hyperparameter transferability, and competitive wall-clock time and memory overheads.
Empirically, Muon scales out-of-the-box to multi-billion parameter models (e.g., Moonlight, a 16B MoE model trained with 5.7T tokens) and enables efficient distributed implementations with reduced optimizer state and communication costs (Liu et al., 24 Feb 2025).
In communication-efficient distributed training frameworks such as DiLoCo, Muon as the local (inner) optimizer allows for extreme delta quantization (down to 2 bits) when paired with error-feedback accumulators, preserving model accuracy while reducing communication volume by up to 8x, significantly outperforming AdamW (Thérien et al., 29 May 2025).
Within hybrid optimizers such as COSMOS, Muon is employed to precondition high-dimensional subspaces where state- or compute-intensive preconditioners like SOAP would be impractical, achieving strong token efficiency with minimal memory (Liu et al., 24 Feb 2025).

5. Algorithmic Design, Memory Efficiency, and Generalization

Muon maintains only a single first-moment buffer per parameter matrix and uses efficient matrix orthogonalization within each update, generally halving optimizer memory compared to AdamW. No second-order moments or per-coordinate learning rates are maintained. This stateless or low-state design underpins its scalability to large models and distributed setups.

A distinctive feature is layer-wise logic and adaptivity: in state-of-the-art frameworks, Muon is often deployed only on hidden/module layers, with AdamW or similar optimizers on input/output layers, or with adaptive per-layer stepsizes derived through generalized smoothness models (as formulated in the Gluon extension (Riabinin et al., 19 May 2025)).

Muon's implicit spectral norm constraint acts as powerful regularization, controlling parameter growth and enabling improved generalization and robustness to overfitting.

6. Practical Performance, Limitations, and Future Directions

Muon has demonstrated up to 2x computational efficiency over AdamW in compute-optimal LLM pretraining (Liu et al., 24 Feb 2025), and 10–15% improved token efficiency, particularly in large or post-critical-batch regimes (AI et al., 4 May 2025).
In minimalist and memory-constrained optimizer studies, Muon achieves performance competitive with or superior to compressed or adaptive schemes (e.g., APOLLO, SWAN, Fira). However, recent work with SCALE suggests similar or better performance can sometimes be achieved using column-wise normalization and strategic momentum allocation at even lower memory (Glentis et al., 20 Jun 2025). This suggests that while Muon sets a high standard for efficient adaptation, further minimality may be achievable in carefully engineered settings.
Limitations include potential computational cost for full SVD or Newton-Schulz iterations at extreme scale (though typically negligible with optimized implementation), and dependence on the presence of meaningful Hessian structure (low-rankness, blockwise diagonalization) for optimal advantage.
The Lion-𝒦 framework provides a blueprint for new optimizers: by choosing alternative convex maps $K$ , future work may develop new regularization or robustness properties for deep learning optimization, with Muon as a central, concretely analyzed example (Chen et al., 18 Jun 2025).

Optimizer	State Required	Regularization/Constraint	Memory Cost (7B LLM)	Notes
AdamW	1st + 2nd moment (all)	No explicit structural constraint	High (~40GB)	Adaptive rates, high mem
Muon	1st moment (all)	Spectral norm constraint (implicit)	Medium (~27GB)	Matrix aware, low mem
SCALE	Last layer only	Column-wise norm (implicit stateless)	Low (~14GB)	Minimalist, stateless

7. References to Key Developments

(Li et al., 5 Feb 2025, Liu et al., 24 Feb 2025, AI et al., 4 May 2025): Practical convergence analysis, scalability, and empirical efficiency for LLMs.
(Shen et al., 29 May 2025, Chen et al., 18 Jun 2025): Theoretical foundations, spectral norm constraint, and positioning within the Lion-𝒦 family.
(Riabinin et al., 19 May 2025): Layer-wise smoothness, generalized theory, and bridging the gap between practice and theory via the Gluon framework.
(Thérien et al., 29 May 2025, Liu et al., 24 Feb 2025): Communication-efficient distributed learning and memory-efficient hybrid optimizers.
(Glentis et al., 20 Jun 2025): Minimalist, column-normalized alternatives and benchmarking.

Muon thus represents a significant advance in matrix-based optimization for deep learning, combining rigorous theoretical underpinnings with practical, scalable implementation in large-scale and distributed training environments. Its impact is marked by its broad adoption and empirical success in both industry-scale and cutting-edge research LLMs.