Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

NorMuon Optimizer Overview

Updated 8 October 2025
  • NorMuon is a matrix-structured deep learning optimizer that combines orthogonal updates with neuron-wise normalization to balance update magnitudes.
  • It demonstrates a 21.74% reduction in training steps for large models and achieves efficient memory usage by storing only per-neuron statistics.
  • Its scalable FSDP2 implementation and modest latency increase (~3%) make NorMuon robust for distributed training of large language models.

NorMuon is a matrix-structured deep learning optimizer designed to address optimization inefficiencies in large-scale neural network training, particularly for LLMs. It extends the Muon optimizer by coupling the conditioning improvement from orthogonalized updates with neuron-wise normalization, thereby providing balanced adaptive learning rates at the level of individual neurons. This hybrid mechanism achieves substantial gains in training efficiency and stability, with minimal additional memory overhead, and is engineered for practical distributed deployment via the FSDP2 parallelism framework.

1. Optimization Scheme and Update Mechanism

NorMuon builds upon the Muon optimizer’s polar decomposition-based orthogonalization of momentum matrices. Whereas the Muon update for a parameter matrix WW maintains a momentum MtM_t and applies an orthogonalization operator such that Ot=NS5(Mt)O_t = \mathrm{NS5}(M_t) (with NS5\mathrm{NS5} denoting five Newton–Schulz iterations for approximate polar factorization), NorMuon introduces neuron-wise normalization post-orthogonalization. The optimizer computes first-order momentum:

Mt=β1Mt1+(1β1)L(Wt1)M_t = \beta_1 M_{t-1} + (1-\beta_1) \nabla L(W_{t-1})

After orthogonalization:

Ot=NS5(Mt)O_t = \mathrm{NS5}(M_t)

NorMuon then computes per-neuron (row-wise) second-order momentum statistics:

vt=β2vt1+(1β2)meancols(OtOt)v_t = \beta_2 v_{t-1} + (1-\beta_2) \operatorname{mean}_{\text{cols}}(O_t \odot O_t)

where OtOtO_t \odot O_t denotes element-wise squaring and meancols\operatorname{mean}_{\text{cols}} computes the mean across each row. The normalized update is formed as:

O^t=OtVt+ε\hat{O}_t = \frac{O_t}{\sqrt{V_t + \varepsilon}}

where VtV_t is expanded from vtv_t (vector) to match the size of OtO_t for row-wise scaling and ε\varepsilon avoids division by zero. To restore the global RMS of the update to be consistent with Adam-style scaling, a global rescaling factor η^\hat{\eta} is applied to O^t\hat{O}_t. The final weight update step is:

Wt+1=WtηλWtη^O^tW_{t+1} = W_t - \eta \lambda W_t - \hat{\eta} \hat{O}_t

This dual adaptation—orthogonalization followed by neuron-wise normalization—corrects Muon’s tendency for high variance in update norms across neurons and controls the directionality to align with better-conditioned descent directions.

2. Training Efficiency and Empirical Results

Extensive experiments across scales (124M, 350M, 1.1B, 5.4B parameters) establish NorMuon’s empirical advantages. For 1.1B parameter pretraining on SlimPajama, NorMuon achieves a 21.74% reduction in training steps to reach the target validation loss compared to AdamW, and an 11.31% reduction over Muon. Memory usage closely tracks Muon’s: NorMuon adds only a vector of neuron statistics per matrix, whereas Adam requires two full tensors (first- and second-order moments) for each parameter. This yields approximately 50% lower memory overhead relative to Adam. Computationally, orthogonalization and normalization raise the iteration latency by approximately 3% versus AdamW, a modest increase offset by faster convergence.

Wall-clock analyses confirm these findings: For the 5.4B model, NorMuon’s per-step time increases only marginally (2.9–3%) over Adam, supporting the claim of practical scalability. Ablation studies on smaller models (e.g., 350M parameters) demonstrate that universal application of neuron-wise normalization yields best results, outperforming selective or coordinate-wise normalization schemes.

3. Theoretical Justification: Conditioning and Balance

Muon’s orthogonalization reduces the condition number of the update direction, mitigating optimization stalling caused by ill-conditioned gradients. However, because the approximation (fixed Newton–Schulz iteration count) can leave the matrix with rows of disparate norms, Muon may bias updates toward those neurons. NorMuon’s per-row normalization resolves this by insulating individual neurons from dominating the update, as evidenced by its normalized Frobenius norm alignment and consistent per-neuron update scale. This complementary arrangement—conditioning via polar factor approximation and balanced adaptation via neuron-wise normalization—addresses two fundamentally distinct axes of optimizer geometry: global update direction and fine-grained update magnitude.

4. Distributed Implementation under FSDP2

NorMuon’s scalable implementation under the FSDP2 sharding paradigm is central for LLM workloads. The optimizer sorts parameter matrices by size and assigns orthogonalization computation to devices in a round-robin fashion. Each assigned device performs an all-gather to assemble its matrix, computes the polar decomposition (NS5), and scatters the result. Row-wise sharding (i.e., partitioning across rows/neuron index) ensures that neuron-wise normalization (the computation of vtv_t and scaling of OtO_t) is performed locally with no additional communication, maximizing parallel efficiency.

Communication volume rises (33–50% compared to AdamW under standard FSDP), but this is effectively overlapped with other computations, resulting in only minor wall-clock impact. The increased communication is justified by improved training efficiency, and the memory footprint remains competitive due to minimal overhead from storing per-row statistics.

5. Comparative Analysis with Adam and Muon

NorMuon’s design is motivated by both theoretical and empirical deficiencies observed in Adam and Muon. Adam’s element-wise updates (steepest descent under the \ell_\infty norm) do not respect matrix structure and require full-matrix moment storage, incurring memory expense. Muon, via spectral norm steepest descent, better aligns updates with matrix geometry but still allows certain neurons to receive disproportionately large updates. NorMuon’s hybrid approach mitigates both issues: its memory requirements rival Muon, but its convergence velocity and stability are markedly higher.

Optimizer Conditioning Update Balance Memory Overhead Training Steps (1.1B)
AdamW No Yes High Baseline
Muon Yes No Moderate Baseline − 10%
NorMuon Yes Yes Moderate+vector Baseline − 21.74%

All data trace directly to experimental metrics and optimization mechanics as presented in (Li et al., 7 Oct 2025).

6. Practical Implications and Deployment Considerations

NorMuon provides several practical advantages for modern LLM training pipelines:

  • Convergence Speed: Substantial reduction in training iterations accelerates experimentation and deployment.
  • Memory Efficiency: Minimal additional storage over Muon makes NorMuon suitable for multi-billion parameter models.
  • Scalability: The FSDP2 implementation efficiently distributes both orthogonalization and normalization, maintaining feasible wall-clock times.
  • Robustness: Balanced neuron update magnitude reduces the risk of certain neurons overfitting or failing to generalize.
  • Algorithmic Tuning: Two momentum coefficients (β1\beta_1, β2\beta_2) and learning rate rescaling parameters require precise calibration per model and dataset, but the optimizer performs robustly across tested scales.

Additional communication demands represent a moderate challenge, but can be managed by careful pipeline engineering as demonstrated in the reference implementations.

7. Significance and Prospects for Optimizer Design

NorMuon exemplifies a class of optimizers that combine orthogonal update alignment (conditioning) with per-neuron normalization (balance), showing that these mechanisms are complementary rather than competitive. Empirical and theoretical analyses demonstrate that such dual-adaptation approaches outperform methods focused solely on either directionality or magnitude. This opens avenues for further research into optimizers that exploit additional structural properties of neural network parameters, such as layer-wise or blockwise regularization and more general forms of adaptive normalization. The results suggest that leveraging both spectral geometry and neuron-specific adaptation is essential for scalable, efficient network optimization in deep learning.

A plausible implication is that NorMuon’s design principles may generalize beyond LLMs, informing optimizer development for architectures with rich parameter structure (e.g., vision transformers, mixtures of experts, or higher-order tensor modules). The synergy between orthogonalization and adaptive per-unit scaling may set a new template for efficient, robust optimizers in large-scale deep learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to NorMuon Optimizer.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube