Papers
Topics
Authors
Recent
2000 character limit reached

Muon Optimizer Implementation

Updated 4 January 2026
  • Muon Optimizer Implementation is a matrix-oriented, orthogonality-based algorithm that projects gradients onto orthogonal matrices for stable, efficient training.
  • It leverages polar factorization via Newton–Schulz iteration and Turbo-Muon’s spectral preconditioning to reduce iterations and computational cost.
  • Empirical benchmarks demonstrate improved convergence speed and numerical conditioning in deep models compared to traditional optimizers like AdamW.

Muon Optimizer Implementation

The Muon optimizer is a matrix-oriented, orthogonality-based optimization algorithm designed for large-scale deep learning, primarily in vision and language modeling domains. Muon performs weight updates by projecting gradients or momentum matrices onto the set of orthogonal (or pseudo-orthogonal) matrices, leveraging polar factorization via iterative polynomial maps such as Newton–Schulz. This approach yields improved isotropy in updates, stable training dynamics, and effective conditioning, resulting in superior data efficiency relative to traditional optimizers like AdamW. Recent developments include Turbo-Muon, which enhances the Newton–Schulz step through spectral preconditioning, leading to significant reductions in computational cost without compromising orthogonality or model quality (Boissin et al., 4 Dec 2025).

1. Principles of Orthogonality-Based Optimization

Orthogonality-based optimization proceeds by transforming each layer’s raw gradient (reshaped into a 2D matrix) into its nearest orthogonal matrix, known as the “polar factor.” For a matrix XX, with singular value decomposition X=UΣVX = U \Sigma V^{\top}, the polar factor is Q=UVQ = U V^{\top}. This projection stabilizes training by enforcing isotropic updates and regularizing the spectral properties of weight changes, which is especially advantageous for deep architectures and LLMs (Mehta et al., 29 Sep 2025, Boissin et al., 4 Dec 2025). Empirical evidence demonstrates expanded Pareto frontier and reduced training FLOPs for Muon compared to AdamW (Liu et al., 24 Feb 2025, AI et al., 4 May 2025).

2. Polar-Factor Approximation via Newton–Schulz Iteration

The Newton–Schulz (NS) map accelerates polar decomposition by avoiding explicit SVD computation, instead iteratively refining an initial guess via matrix polynomials. Two NS variants are frequently implemented:

  • Classical 2nd-order: Xk+1=12Xk(3IXkXk)X_{k+1} = \frac{1}{2} X_k (3I - X_k^{\top} X_k)
  • Quintic (Muon default): Xk+1=akXk+bkXkXkXk+ckXkXkXkXkXkX_{k+1} = a_k X_k + b_k X_k X_k^{\top} X_k + c_k X_k X_k^{\top} X_k X_k^{\top} X_k

Typical Muon settings employ K=5K=5 iterations with empirically tuned coefficients (a,b,c)=(3.4445,4.7750,2.0315)(a, b, c) = (3.4445, -4.7750, 2.0315) (Mehta et al., 29 Sep 2025). Each iteration involves a sequence of three fused matrix multiplications, yielding rapid convergence under spectral-norm scaling (X021\|X_0\|_2 \leq 1).

3. Turbo-Muon Preconditioning: Accelerating NS Convergence

Turbo-Muon introduces an “Almost-Orthogonal Layer” (AOL) preconditioner that replaces the conventional Frobenius-norm scaling with a spectral norm-oriented diagonal scaling:

  • Previous scaling: s=1/X0Fs = 1/\|X_0\|_F, X1=X0sX_1 = X_0 s
  • AOL scaling: For A0=X0X0A_0 = X_0^{\top} X_0, set $s_i = 1 / \sqrt{ \sum_j |A_0_{ij}| }$ and X1=X0diag(s)X_1 = X_0 \operatorname{diag}(s)

This procedure tightens the singular value distribution of the initial state, substantially lowering polar error and improving numerical conditioning. Implementation leverages reuse of A0A_0 from the first NS step, incurring negligible additional cost. Subsequent NS steps revert to the standard iteration. Empirically, AOL preconditioning reduces the number of NS steps required (e.g., 5 → 4), yielding computational speedups up to 2.8× and reducing the orthogonalization overhead by ~20% (Boissin et al., 4 Dec 2025).

Table: NS Initialization Comparison

Scheme Scaling Method Singular Value Spread Convergence Speed
Frobenius s=1/X0Fs = 1/\|X_0\|_F Wide Slower, 5 iters needed
Turbo-Muon (AOL) si=1/jA0ijs_i = 1/\sqrt{\sum_j|A_0|_{ij}} Tight Faster, 4 iters enough

4. Algorithmic Integration and Pseudocode

A Turbo-Muon orthogonalization for a PyTorch layer’s gradient GRn×mG \in \mathbb{R}^{n \times m} applies:

1
2
3
4
5
6
7
8
9
10
11
12
def turbo_muon_ortho(G, coeffs, n_iters=4):
    A0 = G.T @ G                       # Gram matrix
    row_sum = torch.sum(torch.abs(A0), dim=1)
    s = 1.0 / torch.sqrt(row_sum + 1e-12)
    X = G * s.unsqueeze(0)             # Apply diagonal scaling
    A = (s.unsqueeze(1) * A0) * s.unsqueeze(0)
    for i in range(n_iters):
        a, b, c = coeffs[i]
        B = b*A + c*(A @ A)
        X = a*X + X @ B
        A = X.T @ X
    return X

Integration is performed by overwriting gradients post-backward, before optimizer steps, without changing other hyperparameters or schedules. Turbo-Muon serves as a drop-in replacement for any Muon-based workflow (Boissin et al., 4 Dec 2025).

5. Benchmarks, Accuracy, and Resource Efficiency

Experimental results in (Boissin et al., 4 Dec 2025) establish Turbo-Muon as effectively lossless in terms of model accuracy while providing:

  • Matrix polar factor speedup: up to 2.8× per-layer at matched error
  • LLM step time improvement: 8–10% reduction for 1.3B/32K-token runs on A100-80GB
  • NanoGPT speed: 3% total runtime reduction (273.8 s → 266.0 s, identical final loss)
  • CIFAR-10 CNN: invariant accuracy, slight time reduction (2.66 s → 2.64 s per epoch)

Table: Empirical Effects of Turbo-Muon

Task Muon Overhead Turbo-Muon Overhead Accuracy Change
GPT-2 (1.3B) LLM ~10 ms ~9 ms None
NanoGPT, 144M 273.8 s 266.0 s ±0.002 loss
CIFAR-10 CNN 2.66 s 2.64 s ±0.01% acc

6. Implementation Guidelines and Best Practices

Turbo-Muon is implemented with identical momentum, weight decay, and learning rate schedule as Muon and AdamW protocols. Critical recommendations are:

  • Use four NS iterations and legacy polynomial coefficients.
  • Swap the legacy orthogonalization call in optimizer routines for Turbo-Muon.
  • Ensure CUDA ≥ 11.6 and install Triton for fused kernels.
  • AOL preconditioning step is fused for efficiency; minimal extra memory overhead.
  • Monitor runtime drops and ensure final metrics remain unchanged.
  • No hyperparameter tuning required; validation can be conducted by running with the --orthonorm=turbomuon flag in supported repositories (Boissin et al., 4 Dec 2025).

7. Theoretical and Practical Implications

Turbo-Muon’s AOL scaling guarantees convergence of the NS iteration and improves the singular value spectrum, making the optimizer robust to pathological gradient statistics. Its negligible overhead enables its adoption in large models with minimal engineering effort. The AOL idea is broadly applicable in any orthogonality-based optimizer relying on iterative polar factorization, unlocking the computational bottleneck typical of NS-type updates (Boissin et al., 4 Dec 2025). These advances solidify the Muon family as a principled alternative to adaptive optimizers, with direct implications for compute-limited regimes, federated settings, communication-constrained distributed training, and hyperparameter transferability (AI et al., 4 May 2025, Liu et al., 31 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MUON Optimizer Implementation.