Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

NorMuon Optimizer Overview

Updated 8 October 2025

NorMuon is a matrix-structured deep learning optimizer that combines orthogonal updates with neuron-wise normalization to balance update magnitudes.
It demonstrates a 21.74% reduction in training steps for large models and achieves efficient memory usage by storing only per-neuron statistics.
Its scalable FSDP2 implementation and modest latency increase (~3%) make NorMuon robust for distributed training of large language models.

NorMuon is a matrix-structured deep learning optimizer designed to address optimization inefficiencies in large-scale neural network training, particularly for LLMs. It extends the Muon optimizer by coupling the conditioning improvement from orthogonalized updates with neuron-wise normalization, thereby providing balanced adaptive learning rates at the level of individual neurons. This hybrid mechanism achieves substantial gains in training efficiency and stability, with minimal additional memory overhead, and is engineered for practical distributed deployment via the FSDP2 parallelism framework.

1. Optimization Scheme and Update Mechanism

NorMuon builds upon the Muon optimizer’s polar decomposition-based orthogonalization of momentum matrices. Whereas the Muon update for a parameter matrix $W$ maintains a momentum $M_t$ and applies an orthogonalization operator such that $O_t = \mathrm{NS5}(M_t)$ (with $\mathrm{NS5}$ denoting five Newton–Schulz iterations for approximate polar factorization), NorMuon introduces neuron-wise normalization post-orthogonalization. The optimizer computes first-order momentum:

$M_t = \beta_1 M_{t-1} + (1-\beta_1) \nabla L(W_{t-1})$

After orthogonalization:

$O_t = \mathrm{NS5}(M_t)$

NorMuon then computes per-neuron (row-wise) second-order momentum statistics:

$v_t = \beta_2 v_{t-1} + (1-\beta_2) \operatorname{mean}_{\text{cols}}(O_t \odot O_t)$

where $O_t \odot O_t$ denotes element-wise squaring and $\operatorname{mean}_{\text{cols}}$ computes the mean across each row. The normalized update is formed as:

$\hat{O}_t = \frac{O_t}{\sqrt{V_t + \varepsilon}}$

where $V_t$ is expanded from $v_t$ (vector) to match the size of $O_t$ for row-wise scaling and $\varepsilon$ avoids division by zero. To restore the global RMS of the update to be consistent with Adam-style scaling, a global rescaling factor $\hat{\eta}$ is applied to $\hat{O}_t$ . The final weight update step is:

$W_{t+1} = W_t - \eta \lambda W_t - \hat{\eta} \hat{O}_t$

This dual adaptation—orthogonalization followed by neuron-wise normalization—corrects Muon’s tendency for high variance in update norms across neurons and controls the directionality to align with better-conditioned descent directions.

2. Training Efficiency and Empirical Results

Extensive experiments across scales (124M, 350M, 1.1B, 5.4B parameters) establish NorMuon’s empirical advantages. For 1.1B parameter pretraining on SlimPajama, NorMuon achieves a 21.74% reduction in training steps to reach the target validation loss compared to AdamW, and an 11.31% reduction over Muon. Memory usage closely tracks Muon’s: NorMuon adds only a vector of neuron statistics per matrix, whereas Adam requires two full tensors (first- and second-order moments) for each parameter. This yields approximately 50% lower memory overhead relative to Adam. Computationally, orthogonalization and normalization raise the iteration latency by approximately 3% versus AdamW, a modest increase offset by faster convergence.

Wall-clock analyses confirm these findings: For the 5.4B model, NorMuon’s per-step time increases only marginally (2.9–3%) over Adam, supporting the claim of practical scalability. Ablation studies on smaller models (e.g., 350M parameters) demonstrate that universal application of neuron-wise normalization yields best results, outperforming selective or coordinate-wise normalization schemes.

3. Theoretical Justification: Conditioning and Balance

Muon’s orthogonalization reduces the condition number of the update direction, mitigating optimization stalling caused by ill-conditioned gradients. However, because the approximation (fixed Newton–Schulz iteration count) can leave the matrix with rows of disparate norms, Muon may bias updates toward those neurons. NorMuon’s per-row normalization resolves this by insulating individual neurons from dominating the update, as evidenced by its normalized Frobenius norm alignment and consistent per-neuron update scale. This complementary arrangement—conditioning via polar factor approximation and balanced adaptation via neuron-wise normalization—addresses two fundamentally distinct axes of optimizer geometry: global update direction and fine-grained update magnitude.

4. Distributed Implementation under FSDP2

NorMuon’s scalable implementation under the FSDP2 sharding paradigm is central for LLM workloads. The optimizer sorts parameter matrices by size and assigns orthogonalization computation to devices in a round-robin fashion. Each assigned device performs an all-gather to assemble its matrix, computes the polar decomposition (NS5), and scatters the result. Row-wise sharding (i.e., partitioning across rows/neuron index) ensures that neuron-wise normalization (the computation of $v_t$ and scaling of $O_t$ ) is performed locally with no additional communication, maximizing parallel efficiency.

Communication volume rises (33–50% compared to AdamW under standard FSDP), but this is effectively overlapped with other computations, resulting in only minor wall-clock impact. The increased communication is justified by improved training efficiency, and the memory footprint remains competitive due to minimal overhead from storing per-row statistics.

5. Comparative Analysis with Adam and Muon

NorMuon’s design is motivated by both theoretical and empirical deficiencies observed in Adam and Muon. Adam’s element-wise updates (steepest descent under the $\ell_\infty$ norm) do not respect matrix structure and require full-matrix moment storage, incurring memory expense. Muon, via spectral norm steepest descent, better aligns updates with matrix geometry but still allows certain neurons to receive disproportionately large updates. NorMuon’s hybrid approach mitigates both issues: its memory requirements rival Muon, but its convergence velocity and stability are markedly higher.

Optimizer	Conditioning	Update Balance	Memory Overhead	Training Steps (1.1B)
AdamW	No	Yes	High	Baseline
Muon	Yes	No	Moderate	Baseline − 10%
NorMuon	Yes	Yes	Moderate+vector	Baseline − 21.74%

All data trace directly to experimental metrics and optimization mechanics as presented in (Li et al., 7 Oct 2025).

6. Practical Implications and Deployment Considerations

NorMuon provides several practical advantages for modern LLM training pipelines:

Convergence Speed: Substantial reduction in training iterations accelerates experimentation and deployment.
Memory Efficiency: Minimal additional storage over Muon makes NorMuon suitable for multi-billion parameter models.
Scalability: The FSDP2 implementation efficiently distributes both orthogonalization and normalization, maintaining feasible wall-clock times.
Robustness: Balanced neuron update magnitude reduces the risk of certain neurons overfitting or failing to generalize.
Algorithmic Tuning: Two momentum coefficients ( $\beta_1$ , $\beta_2$ ) and learning rate rescaling parameters require precise calibration per model and dataset, but the optimizer performs robustly across tested scales.

Additional communication demands represent a moderate challenge, but can be managed by careful pipeline engineering as demonstrated in the reference implementations.

7. Significance and Prospects for Optimizer Design

NorMuon exemplifies a class of optimizers that combine orthogonal update alignment (conditioning) with per-neuron normalization (balance), showing that these mechanisms are complementary rather than competitive. Empirical and theoretical analyses demonstrate that such dual-adaptation approaches outperform methods focused solely on either directionality or magnitude. This opens avenues for further research into optimizers that exploit additional structural properties of neural network parameters, such as layer-wise or blockwise regularization and more general forms of adaptive normalization. The results suggest that leveraging both spectral geometry and neuron-specific adaptation is essential for scalable, efficient network optimization in deep learning.

A plausible implication is that NorMuon’s design principles may generalize beyond LLMs, informing optimizer development for architectures with rich parameter structure (e.g., vision transformers, mixtures of experts, or higher-order tensor modules). The synergy between orthogonalization and adaptive per-unit scaling may set a new template for efficient, robust optimizers in large-scale deep learning.

PDF Markdown Chat (Pro)

References (1)

NorMuon: Making Muon more efficient and scalable (2025)

Follow Topic

Get notified by email when new papers are published related to NorMuon Optimizer.

NorMuon Optimizer Overview

1. Optimization Scheme and Update Mechanism

2. Training Efficiency and Empirical Results

3. Theoretical Justification: Conditioning and Balance

4. Distributed Implementation under FSDP2

5. Comparative Analysis with Adam and Muon

6. Practical Implications and Deployment Considerations

7. Significance and Prospects for Optimizer Design

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NorMuon Optimizer Overview

1. Optimization Scheme and Update Mechanism

2. Training Efficiency and Empirical Results

3. Theoretical Justification: Conditioning and Balance

4. Distributed Implementation under FSDP2

5. Comparative Analysis with Adam and Muon

6. Practical Implications and Deployment Considerations

7. Significance and Prospects for Optimizer Design

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research