NAMO: Norm-Based Adaptive Momentum Optimizer

Updated 21 February 2026

NAMO is a norm-based adaptive optimizer that unifies orthogonalized momentum updates with norm-based scaling, ensuring stable and efficient training in high-dimensional settings.
Its variants, including NAMO-D and subset-norm approaches, enhance memory efficiency and convergence speed, making them well-suited for tasks like image classification and language model pretraining.
Empirical studies show that NAMO outperforms traditional optimizers such as Adam and Muon by achieving faster convergence and improved test accuracy on both moderate and large-scale datasets.

Norm-Based Adaptive Moment Estimation with Orthogonalized Momentum (NAMO) is a class of first-order stochastic optimization algorithms for training high-dimensional models, especially those with matrix-structured parameters such as deep neural networks. NAMO combines orthogonalized momentum updates, originally featured in the Muon optimizer, with norm-based adaptive scaling mechanisms akin to those in AdaGrad and Adam. This unification preserves the implicit regularization and stability benefits of orthogonalized steps while equipping the update magnitude with adaptive control suited for non-stationary, noisy optimization landscapes. Several variants exist, notably the scalar-scaled NAMO, its diagonal extension NAMO-D, and generalized forms for large-scale or memory-efficient training. NAMO and related algorithms provide rigorous convergence guarantees and have demonstrated empirical superiority over classic optimizers in training deep architectures including GPT-2 and LLaMA (Zhang et al., 3 Sep 2025, Nguyen et al., 2024, Zhang et al., 19 Feb 2026).

1. Algorithmic Foundations

NAMO’s core mechanism is the combination of two distinct components: (i) Orthogonalized momentum: The optimizer maintains an exponential moving average of gradients $M_t$ and orthogonalizes it via a matrix factorization (typically polar/SVD) to obtain an update direction $O_t$ with $O_t O_t^\top = I$ . This direction realizes steepest descent under the spectral norm, providing robust updates in the presence of ill-conditioned objectives. (ii) Norm-based adaptive scaling: The update step size is adaptively modulated. Original formulations use either AdaGrad-style accumulators (scalar or subset-based) or second-moment estimates as in Adam, ensuring the step size contracts in flat or noisy regimes and increases in sharp, informative landscapes.

For parameter matrices $\Theta_t\in\mathbb{R}^{m\times n}$ , the prototypical NAMO update sequence is: $\begin{aligned} g_t &= \nabla\mathcal{L}_t(\Theta_{t-1}), \ M_t &= \mu_1 M_{t-1} + (1-\mu_1)g_t, \ v_t &= \mu_2 v_{t-1} + (1-\mu_2)\|g_t\|_F^2, \ O_t &= \text{Orth}(M_t), \ \alpha_t &= \|\hat M_t\|_F/(\sqrt{\hat v_t} + \epsilon), \ \Theta_t &= \Theta_{t-1} - \eta\alpha_t O_t. \end{aligned}$ Here, bias correction applies to moving averages, and $\epsilon > 0$ prevents division by zero. The diagonal extension (NAMO-D) further enables neuron-wise adaptation via per-column scaling matrices, with clamping to maintain well-conditioned updates (Zhang et al., 19 Feb 2026).

2. Architectural Variants and Memory-Efficient Extensions

Several NAMO variants address different scaling and memory constraints:

Scalar NAMO: Uses a global norm-based step size; requires only a single additional scalar beyond the Muon baseline (Zhang et al., 3 Sep 2025, Zhang et al., 19 Feb 2026).
NAMO-D: Augments scalar scaling with a right multiplication by a diagonal matrix $D_t$ , whose entries are adaptively clipped per neuron or unit. This improves alignment with block-diagonal Hessian structures and enhances stability in highly overparameterized or structured tasks (Zhang et al., 19 Feb 2026).
Subset-Norm and Subspace-Momentum NAMO: Parameters are partitioned into $c$ subsets, each with its dedicated accumulator, or momentum is restricted to a low-dimensional subspace. This reduces memory from $O(d)$ to $O(\sqrt{d} + k)$ , essential for billion-parameter models (Nguyen et al., 2024).

The following table summarizes state complexity for different variants:

Variant	State Memory Complexity	Step Size Adaptivity
Muon	$O(mn)$	Fixed
Scalar NAMO	$O(mn) + 1$	Norm-based scalar
NAMO-D	$O(mn) + n$	Per-column adaptive, clamped
SN/SM NAMO	$O(\sqrt{d} + k)$	Subset-wise/adaptive

3. Theoretical Guarantees

NAMO achieves optimal convergence rates under both deterministic and stochastic settings, contingent on Lipschitz-spectral smoothness and bounded-variance noise:

Deterministic (full-batch):

$\frac{1}{T} \sum_{t=1}^T \|\nabla L(\Theta_{t-1})\|_F = O(T^{-1/2})$

Stochastic (mini-batch):

$\frac{1}{T} \sum_{t=1}^T \mathbb{E}\|\nabla L(\Theta_{t-1})\|_F = O(T^{-1/4} + \sigma b^{-1/4} T^{-1/8})$

The rates interpolate to $O(T^{-1/2})$ with increasing batch size or vanishing noise. These results match established lower bounds for first-order methods in both convex and nonconvex regimes (Zhang et al., 3 Sep 2025, Nguyen et al., 2024, Zhang et al., 19 Feb 2026).

In the case of memory-efficient NAMO combining subset-norm with subspace-momentum, high-probability guarantees hold for nonconvex objectives under sub-Gaussian noise, with only slightly inflated constants relative to vanilla AdaGrad or SGD (Nguyen et al., 2024).

4. Empirical Performance and Practical Implementation

Experimental results highlight NAMO’s empirical advantages:

Small- to medium-scale tasks: On function regression and CIFAR-10, NAMO consistently outperforms Adam and Muon on both convergence speed and final test error. In particular, it achieves 2–3% higher test accuracy on CIFAR-10 than Muon/Adam and shows faster test loss decay in synthetic regression (Zhang et al., 3 Sep 2025).
Large-scale LLMs: In pretraining GPT-2 (124M and 355M) on OpenWebText, both NAMO and its diagonal variant achieve lower training and validation loss than AdamW and Muon, with NAMO-D benefiting from per-neuron scaling (final 124M validation losses: AdamW 3.0643, NAMO 3.0351, NAMO-D 3.0246) (Zhang et al., 19 Feb 2026). On LLaMA 1B, memory-optimized SN/SM NAMO attains Adam-level perplexity in half as many tokens while requiring 80% less optimizer state memory (0.84 GB vs. 5 GB for Adam) (Nguyen et al., 2024).

NAMO has negligible memory or compute overhead compared to baseline Muon and significantly reduced cost compared to Adam in large models. Orthogonalization of the momentum matrix is typically handled via Newton–Schulz iterations or SVD; for SN/SM NAMO, subspace dimension $k$ and subset size $c$ are tunable for further trade-offs.

5. Orthogonality, Regularization, and Algorithmic Significance

NAMO’s preservation of orthogonality in update directions is founded on its interpretation as a spectral-norm steepest descent, with

$O_t = \arg\min_{Z} \langle Z, M_t \rangle + \tfrac{1}{2} \|Z\|_2^2,$

where $\langle\cdot,\cdot\rangle$ is the matrix inner product. This spectral regularization both stabilizes training under highly ill-conditioned or degenerate settings and mitigates pathological behaviors of classic adaptive optimizers. In particular, NAMO avoids the sign-bias and step-size explosion/collapse issues known in Adam and RMSProp, ensuring null-gradient consistency (vanishing update magnitude at stationary points) (Zhang et al., 3 Sep 2025, Zhang et al., 19 Feb 2026).

The diagonal extension (NAMO-D) aligns with empirical Hessian block-diagonality, enabling fine-grained adaptation without excessive condition number inflation due to clamping parameter $c$ .

6. Hyperparameter Selection and Practical Guidance

Typical hyperparameters mirror those of Adam and Muon, with momentum factors $\mu_1\approx0.95$ , $\mu_2\approx0.99$ , learning rate $\eta$ selected by short grid search, and stepsize floor $\epsilon=10^{-6}$ . For NAMO-D, clamping $c\in[0.1,1.0]$ trades adaptivity for conditioning. Subset-norm NAMO benefits from grouping parameters by rows or columns to achieve $c\sim\sqrt{d}$ for optimal memory savings.

Implementation requires minimal change from Muon or Adam pipelines: a single line replaces fixed scalar with adaptive norm-based scaling, and for subset/subspace versions, standard linear algebra operations suffice (Zhang et al., 3 Sep 2025, Nguyen et al., 2024, Zhang et al., 19 Feb 2026).

7. Applications, Limitations, and Prospective Directions

NAMO and its variants are effective in training both small and large neural network models, with improved stability, faster convergence, and reduced memory usage relative to legacy optimizers. Key use cases include LLM pretraining (e.g., GPT-2, LLaMA) and scenarios where matrix-structured updates and memory constraints are critical. The mechanistic correspondence with the spectral structure of neural network Hessians and block-wise adaptivity aligns NAMO-D with model architectures exhibiting pronounced modularity or locality.

This suggests continued relevance of NAMO-like methods as model scales and architectural heterogeneity increase. A plausible implication is that further diagonal or block-diagonal extensions, possibly integrating dynamic structure-aware clamping, may enhance stability and sample efficiency in forthcoming large-scale training regimes.

References:

(Zhang et al., 3 Sep 2025) AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
(Zhang et al., 19 Feb 2026) Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum
(Nguyen et al., 2024) Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees