Diagonal NAMO: Adaptive Orthogonal Optimization
- Diagonal NAMO is a diagonal adaptation of the NAMO algorithm that integrates orthogonal momentum with column-wise adaptive scaling for enhanced optimization.
- It employs SVD-based orthogonalization and clamped diagonal updates to balance local signal-to-noise adaptation and numerical stability.
- Empirical benchmarks demonstrate superior convergence and performance improvements over traditional optimizers in large-scale pretraining tasks.
Diagonal NAMO (NAMO-D) denotes a diagonal adaptation of the NAMO algorithm that integrates orthogonalized momentum with neuron-wise (column-wise) adaptive scaling for matrix parameter updates in stochastic optimization, most notably applied in large scale neural network (e.g., Transformer) training. The design allows fine-grained adaptation to the stochastic noise structure while maintaining well-conditioned and nearly orthogonal update directions, yielding empirically and theoretically robust convergence in modern deep learning settings (Zhang et al., 19 Feb 2026).
1. Algorithmic Structure and Update Rule
Consider matrix-parameterized models with and loss . The NAMO-D step maintains both first- and second-order moment statistics:
- First-moment momentum matrix: , where is the current minibatch gradient.
- Second moment vector (column-wise): for .
After standard bias correction, orthogonalization is achieved via SVD decomposition and . The NAMO-D update applies this orthogonal factor, right-multiplied by a clamped diagonal matrix :
Here, , where each adapts to the local signal-to-noise ratio of column but is clamped toward the average for numerical stability.
2. Construction of the Diagonal Scaling Matrix with Clamping
The unclamped scaling per column is
where is a regularization term. Set and the mean . To avoid excessive update anisotropy (which would undermine the orthogonal update design), the clamped entries are
for a hyperparameter . This ensures , interpolating between full neuron-wise adaptivity and strict uniform scaling .
3. Convergence Properties
Assuming standard smoothness and bounded gradient noise, the convergence rates are:
- Deterministic regime: With , , fixed , after steps,
- Stochastic regime: With minibatch size , , , ,
Optimal stochastic rates are retained for sufficiently large (Zhang et al., 19 Feb 2026).
4. Computational Complexity and Practical Considerations
Memory and computation cost for NAMO-D are summarized as follows: | Optimizer | Extra Memory | Key Additional Cost | |------------|-----------------|-------------------------| | AdamW | | | | Muon | | (e.g. SVD, Newton–Schulz) | | NAMO | | (minor inner prod)| | NAMO-D | | (+ for ) |
The storage overhead for in NAMO-D is negligible for typical network layers ( columns), and the primary computational burden remains the orthogonalization step.
5. Hyperparameter Selection and Adaptation Behavior
The clamping parameter dictates the interpolation between global and neuron-wise adaptation. Heuristically, is favored for small models (124M parameters) and for medium ones (355M), with typically providing the best balance of adaptivity and conditioning. This tuning strategy gives robust empirical results across scales.
6. Empirical Benchmarks in Large-Scale Pretraining
NAMO-D is evaluated in GPT-2 pretraining on OpenWebText:
- For 124M parameter models (50K steps), NAMO-D achieves a validation loss of nats/token, outperforming AdamW (3.064), Muon (3.044), and scalar NAMO (3.035).
- For 355M parameter models (10K steps), NAMO-D achieves 2.951 versus 2.991 (AdamW), 2.968 (Muon), and 2.952 (NAMO). Consistent improvements of 0.01–0.02 nats/token over scalar NAMO and 0.1–0.15 nats/token over AdamW/Muon are reported. Learning rate and clamping hyperparameters are selected via small grid sweeps, as detailed in (Zhang et al., 19 Feb 2026).
7. Theoretical and Practical Implications
NAMO-D achieves optimal deterministic and adaptive stochastic convergence rates without large additional storage or computational burden. The algorithm is justified theoretically under standard optimization assumptions and demonstrates superior empirical performance in LLM pretraining. The architectural design—combining SVD-based orthogonalization with clamped, per-column adaptation—aligns with observed near block-diagonal structure in deep network Hessians. A plausible implication is that NAMO-D provides a principled interpolation between strict orthogonal updates and fully-adaptive, coordinate-wise noise normalization, enhancing both conditioning and adaptation capacity in high-dimensional settings (Zhang et al., 19 Feb 2026).