Diagonal NAMO: Adaptive Orthogonal Optimization

Updated 21 February 2026

Diagonal NAMO is a diagonal adaptation of the NAMO algorithm that integrates orthogonal momentum with column-wise adaptive scaling for enhanced optimization.
It employs SVD-based orthogonalization and clamped diagonal updates to balance local signal-to-noise adaptation and numerical stability.
Empirical benchmarks demonstrate superior convergence and performance improvements over traditional optimizers in large-scale pretraining tasks.

Diagonal NAMO (NAMO-D) denotes a diagonal adaptation of the NAMO algorithm that integrates orthogonalized momentum with neuron-wise (column-wise) adaptive scaling for matrix parameter updates in stochastic optimization, most notably applied in large scale neural network (e.g., Transformer) training. The design allows fine-grained adaptation to the stochastic noise structure while maintaining well-conditioned and nearly orthogonal update directions, yielding empirically and theoretically robust convergence in modern deep learning settings (Zhang et al., 19 Feb 2026).

1. Algorithmic Structure and Update Rule

Consider matrix-parameterized models with $\Theta \in \mathbb{R}^{m \times n}$ and loss $\mathcal{L}(\Theta)$ . The NAMO-D step maintains both first- and second-order moment statistics:

First-moment momentum matrix: $M_t = \mu_1 M_{t-1} + (1-\mu_1)G_t$ , where $G_t$ is the current minibatch gradient.
Second moment vector (column-wise): $v_{t,j} = \mu_2 v_{t-1,j} + (1-\mu_2)\|G_{t,:,j}\|_2^2$ for $j=1,\ldots,n$ .

After standard bias correction, orthogonalization is achieved via SVD decomposition $M_t=U\Sigma V^\top$ and $O_t=UV^\top$ . The NAMO-D update applies this orthogonal factor, right-multiplied by a clamped diagonal matrix $D_t$ :

$\Theta_t = \Theta_{t-1} - \eta O_t D_t$

Here, $D_t = \mathrm{diag}(\tilde d_{t,1}, \ldots, \tilde d_{t,n})$ , where each $\tilde d_{t,j}$ adapts to the local signal-to-noise ratio of column $j$ but is clamped toward the average for numerical stability.

2. Construction of the Diagonal Scaling Matrix with Clamping

The unclamped scaling per column is

$s_{t,j} = \frac{\|\hat M_{t,:,j}\|_2}{\sqrt{\hat v_{t,j} + \epsilon}},$

where $\epsilon > 0$ is a regularization term. Set $d_t = (s_{t,1}, \ldots, s_{t,n})^{\top}$ and the mean $\bar d_t = \frac{1}{n}\sum_{j=1}^n s_{t,j}$ . To avoid excessive update anisotropy (which would undermine the orthogonal update design), the clamped entries are

$\tilde d_{t,j} = \min\left\{\max\{s_{t,j}, c\bar d_t\}, \bar d_t / c\right\}$

for a hyperparameter $c \in (0,1]$ . This ensures $\kappa(D_t) \le 1/c^2$ , interpolating between full neuron-wise adaptivity $(c \to 0)$ and strict uniform scaling $(c=1)$ .

3. Convergence Properties

Assuming standard smoothness and bounded gradient noise, the convergence rates are:

Deterministic regime: With $\eta = O(T^{-1/2})$ , $\epsilon = O(T^{-1/2} n^{-1})$ , fixed $c$ , after $T$ steps,

$\frac{1}{T} \sum_{t=1}^T \|\nabla \mathcal{L}(\Theta_{t-1})\|_F = O(T^{-1/2})$

Stochastic regime: With minibatch size $b$ , $\eta = O(T^{-3/4})$ , $1-\mu_1 = 1-\mu_2 = O(T^{-1/2})$ , $\epsilon=O(T^{-1/2})$ ,

$\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla \mathcal{L}(\Theta_{t-1})\|_F] = O\left(T^{-1/4} + \sigma b^{-1/4} T^{-1/8}\right)$

Optimal $O(T^{-1/4})$ stochastic rates are retained for sufficiently large $b$ (Zhang et al., 19 Feb 2026).

4. Computational Complexity and Practical Considerations

Memory and computation cost for NAMO-D are summarized as follows: | Optimizer | Extra Memory | Key Additional Cost | |------------|-----------------|-------------------------| | AdamW | $O(d)$ | $O(d)$ | | Muon | $O(d)$ | $O(mnk)$ (e.g. SVD, Newton–Schulz) | | NAMO | $O(d)$ | $O(d)$ (minor inner prod)| | NAMO-D | $O(d) + O(n)$ | $O(mnk)$ (+ $O(n)$ for $D_t$ ) |

The storage overhead for $D_t$ in NAMO-D is negligible for typical network layers ( $n$ columns), and the primary computational burden remains the orthogonalization step.

5. Hyperparameter Selection and Adaptation Behavior

The clamping parameter $c$ dictates the interpolation between global and neuron-wise adaptation. Heuristically, $c \sim 0.1$ is favored for small models (124M parameters) and $c \sim 0.9$ for medium ones (355M), with $c \in [0.1,0.5]$ typically providing the best balance of adaptivity and conditioning. This tuning strategy gives robust empirical results across scales.

6. Empirical Benchmarks in Large-Scale Pretraining

NAMO-D is evaluated in GPT-2 pretraining on OpenWebText:

For 124M parameter models (50K steps), NAMO-D achieves a validation loss of $\approx 3.025$ nats/token, outperforming AdamW (3.064), Muon (3.044), and scalar NAMO (3.035).
For 355M parameter models (10K steps), NAMO-D achieves 2.951 versus 2.991 (AdamW), 2.968 (Muon), and 2.952 (NAMO). Consistent improvements of 0.01–0.02 nats/token over scalar NAMO and 0.1–0.15 nats/token over AdamW/Muon are reported. Learning rate and clamping hyperparameters are selected via small grid sweeps, as detailed in (Zhang et al., 19 Feb 2026).

7. Theoretical and Practical Implications

NAMO-D achieves optimal deterministic and adaptive stochastic convergence rates without large additional storage or computational burden. The algorithm is justified theoretically under standard optimization assumptions and demonstrates superior empirical performance in LLM pretraining. The architectural design—combining SVD-based orthogonalization with clamped, per-column adaptation—aligns with observed near block-diagonal structure in deep network Hessians. A plausible implication is that NAMO-D provides a principled interpolation between strict orthogonal updates and fully-adaptive, coordinate-wise noise normalization, enhancing both conditioning and adaptation capacity in high-dimensional settings (Zhang et al., 19 Feb 2026).

Markdown Upgrade to Chat

References (1)

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diagonal NAMO.