Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diagonal NAMO: Adaptive Orthogonal Optimization

Updated 21 February 2026
  • Diagonal NAMO is a diagonal adaptation of the NAMO algorithm that integrates orthogonal momentum with column-wise adaptive scaling for enhanced optimization.
  • It employs SVD-based orthogonalization and clamped diagonal updates to balance local signal-to-noise adaptation and numerical stability.
  • Empirical benchmarks demonstrate superior convergence and performance improvements over traditional optimizers in large-scale pretraining tasks.

Diagonal NAMO (NAMO-D) denotes a diagonal adaptation of the NAMO algorithm that integrates orthogonalized momentum with neuron-wise (column-wise) adaptive scaling for matrix parameter updates in stochastic optimization, most notably applied in large scale neural network (e.g., Transformer) training. The design allows fine-grained adaptation to the stochastic noise structure while maintaining well-conditioned and nearly orthogonal update directions, yielding empirically and theoretically robust convergence in modern deep learning settings (Zhang et al., 19 Feb 2026).

1. Algorithmic Structure and Update Rule

Consider matrix-parameterized models with ΘRm×n\Theta \in \mathbb{R}^{m \times n} and loss L(Θ)\mathcal{L}(\Theta). The NAMO-D step maintains both first- and second-order moment statistics:

  • First-moment momentum matrix: Mt=μ1Mt1+(1μ1)GtM_t = \mu_1 M_{t-1} + (1-\mu_1)G_t, where GtG_t is the current minibatch gradient.
  • Second moment vector (column-wise): vt,j=μ2vt1,j+(1μ2)Gt,:,j22v_{t,j} = \mu_2 v_{t-1,j} + (1-\mu_2)\|G_{t,:,j}\|_2^2 for j=1,,nj=1,\ldots,n.

After standard bias correction, orthogonalization is achieved via SVD decomposition Mt=UΣVM_t=U\Sigma V^\top and Ot=UVO_t=UV^\top. The NAMO-D update applies this orthogonal factor, right-multiplied by a clamped diagonal matrix DtD_t:

Θt=Θt1ηOtDt\Theta_t = \Theta_{t-1} - \eta O_t D_t

Here, Dt=diag(d~t,1,,d~t,n)D_t = \mathrm{diag}(\tilde d_{t,1}, \ldots, \tilde d_{t,n}), where each d~t,j\tilde d_{t,j} adapts to the local signal-to-noise ratio of column jj but is clamped toward the average for numerical stability.

2. Construction of the Diagonal Scaling Matrix with Clamping

The unclamped scaling per column is

st,j=M^t,:,j2v^t,j+ϵ,s_{t,j} = \frac{\|\hat M_{t,:,j}\|_2}{\sqrt{\hat v_{t,j} + \epsilon}},

where ϵ>0\epsilon > 0 is a regularization term. Set dt=(st,1,,st,n)d_t = (s_{t,1}, \ldots, s_{t,n})^{\top} and the mean dˉt=1nj=1nst,j\bar d_t = \frac{1}{n}\sum_{j=1}^n s_{t,j}. To avoid excessive update anisotropy (which would undermine the orthogonal update design), the clamped entries are

d~t,j=min{max{st,j,cdˉt},dˉt/c}\tilde d_{t,j} = \min\left\{\max\{s_{t,j}, c\bar d_t\}, \bar d_t / c\right\}

for a hyperparameter c(0,1]c \in (0,1]. This ensures κ(Dt)1/c2\kappa(D_t) \le 1/c^2, interpolating between full neuron-wise adaptivity (c0)(c \to 0) and strict uniform scaling (c=1)(c=1).

3. Convergence Properties

Assuming standard smoothness and bounded gradient noise, the convergence rates are:

  • Deterministic regime: With η=O(T1/2)\eta = O(T^{-1/2}), ϵ=O(T1/2n1)\epsilon = O(T^{-1/2} n^{-1}), fixed cc, after TT steps,

1Tt=1TL(Θt1)F=O(T1/2)\frac{1}{T} \sum_{t=1}^T \|\nabla \mathcal{L}(\Theta_{t-1})\|_F = O(T^{-1/2})

  • Stochastic regime: With minibatch size bb, η=O(T3/4)\eta = O(T^{-3/4}), 1μ1=1μ2=O(T1/2)1-\mu_1 = 1-\mu_2 = O(T^{-1/2}), ϵ=O(T1/2)\epsilon=O(T^{-1/2}),

1Tt=1TE[L(Θt1)F]=O(T1/4+σb1/4T1/8)\frac{1}{T} \sum_{t=1}^T \mathbb{E}[\|\nabla \mathcal{L}(\Theta_{t-1})\|_F] = O\left(T^{-1/4} + \sigma b^{-1/4} T^{-1/8}\right)

Optimal O(T1/4)O(T^{-1/4}) stochastic rates are retained for sufficiently large bb (Zhang et al., 19 Feb 2026).

4. Computational Complexity and Practical Considerations

Memory and computation cost for NAMO-D are summarized as follows: | Optimizer | Extra Memory | Key Additional Cost | |------------|-----------------|-------------------------| | AdamW | O(d)O(d) | O(d)O(d) | | Muon | O(d)O(d) | O(mnk)O(mnk) (e.g. SVD, Newton–Schulz) | | NAMO | O(d)O(d) | O(d)O(d) (minor inner prod)| | NAMO-D | O(d)+O(n)O(d) + O(n) | O(mnk)O(mnk) (+ O(n)O(n) for DtD_t) |

The storage overhead for DtD_t in NAMO-D is negligible for typical network layers (nn columns), and the primary computational burden remains the orthogonalization step.

5. Hyperparameter Selection and Adaptation Behavior

The clamping parameter cc dictates the interpolation between global and neuron-wise adaptation. Heuristically, c0.1c \sim 0.1 is favored for small models (124M parameters) and c0.9c \sim 0.9 for medium ones (355M), with c[0.1,0.5]c \in [0.1,0.5] typically providing the best balance of adaptivity and conditioning. This tuning strategy gives robust empirical results across scales.

6. Empirical Benchmarks in Large-Scale Pretraining

NAMO-D is evaluated in GPT-2 pretraining on OpenWebText:

  • For 124M parameter models (50K steps), NAMO-D achieves a validation loss of 3.025\approx 3.025 nats/token, outperforming AdamW (3.064), Muon (3.044), and scalar NAMO (3.035).
  • For 355M parameter models (10K steps), NAMO-D achieves 2.951 versus 2.991 (AdamW), 2.968 (Muon), and 2.952 (NAMO). Consistent improvements of 0.01–0.02 nats/token over scalar NAMO and 0.1–0.15 nats/token over AdamW/Muon are reported. Learning rate and clamping hyperparameters are selected via small grid sweeps, as detailed in (Zhang et al., 19 Feb 2026).

7. Theoretical and Practical Implications

NAMO-D achieves optimal deterministic and adaptive stochastic convergence rates without large additional storage or computational burden. The algorithm is justified theoretically under standard optimization assumptions and demonstrates superior empirical performance in LLM pretraining. The architectural design—combining SVD-based orthogonalization with clamped, per-column adaptation—aligns with observed near block-diagonal structure in deep network Hessians. A plausible implication is that NAMO-D provides a principled interpolation between strict orthogonal updates and fully-adaptive, coordinate-wise noise normalization, enhancing both conditioning and adaptation capacity in high-dimensional settings (Zhang et al., 19 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diagonal NAMO.