AdaGO Algorithm
- AdaGO is an adaptive optimization algorithm that integrates AdaGrad’s norm-based stepsize adaptation with Muon’s orthogonal momentum update for matrix-structured parameters.
- It achieves optimal convergence rates in both stochastic and deterministic nonconvex settings by dynamically scaling updates according to accumulated gradient norms.
- Empirical benchmarks on regression and classification tasks show AdaGO’s superior performance over Adam and Muon, with minimal computational and memory overhead.
AdaGO is an adaptive optimization algorithm for matrix‐structured parameters that synthesizes the norm‐based stepsize adaptation of AdaGrad with the spectral update geometry of the Muon optimizer. Distinctively, AdaGO preserves the orthogonalization of Muon’s update direction while utilizing an AdaGrad‐type scaling governed by accumulated gradient norms. The algorithm is constructed to be computationally and memory efficient, requiring only a single additional scalar variable compared to Muon. AdaGO achieves optimal theoretical convergence rates for nonconvex stochastic and deterministic optimization under standard smoothness and noise assumptions and empirically demonstrates superior performance relative to Muon and Adam on regression and classification benchmarks.
1. Algorithm Structure and Update Rules
AdaGO’s mechanism fuses AdaGrad’s adaptive scaling with Muon’s orthogonal momentum averaging. Iterative updates are performed as follows:
- For each iteration , a minibatch is sampled yielding the stochastic gradient .
- Momentum is maintained via , analogous to Muon.
- An accumulator tracks the history of (clamped) gradient norms: , where controls clamping.
- Momentum is orthogonalized: if (SVD), then preserves Muon’s spectral descent direction.
- Parameter updates are performed using the adaptive stepsize :
with
where is the base learning rate, and ensures numerical stability.
This scheme generalizes AdaGrad-Norm to the matrix-valued context while retaining Muon’s update geometry. Use of the “min” operator in and ensures outlier gradient norms do not destabilize learning rates.
2. Mathematical Formulation
AdaGO is defined by several critical formulas:
Quantity | Formula | Role |
---|---|---|
Orthogonalized Momentum | , with (SVD) | Update direction |
Stepsize | AdaGrad-type scaling | |
Accumulator | Tracks scaled history | |
Update Rule | Parameter update |
Orthogonalization ensures that the update is spectral (preserving the principal components of ). The adaptive stepsize scales the update by the running norm of past gradients, generalizing AdaGrad-Norm’s adaptive mechanism to the spectral (matrix) setting.
3. Theoretical Convergence Properties
AdaGO achieves optimal convergence rates for nonconvex objectives under standard smoothness and noise conditions:
- In the stochastic setting, with momentum and unit mini-batch size, setting , , and (), AdaGO attains convergence rate, which matches the lower bound for first-order stochastic nonconvex optimization.
- Deterministically (), AdaGO reaches the rate, the known optimum for deterministic nonconvex first-order methods.
- When momentum is turned off and batch sizes increase, AdaGO naturally adjusts stepsizes to account for noise, demonstrating additional robustness.
These rates are attained entirely via norm-based adaptation; the adaptive mechanism enables the optimizer to take more aggressive steps when gradient norms are large, then anneal step lengths near stationary points. The use of spectral (orthogonalized) updates ensures efficacy for matrix parameters.
4. Empirical Benchmarking and Comparative Performance
Empirical evaluations comprised regression and classification tasks:
- Function Regression: Training two-layer MLPs on data sampled from Gaussian random fields, AdaGO consistently reached lower training and test loss than both Adam and Muon. The adaptive scaling permitted AdaGO to exploit large initial steps but converge tightly near optima.
- CIFAR-10 Classification: On convolutional neural networks trained for CIFAR-10, AdaGO realized both lower training loss and higher test accuracy across epochs. Adam occasionally exhibited optimization oscillation, while Muon was more stable but converged to higher loss. AdaGO’s orthogonally scaled steps yielded improved generalization and optimization speed.
This suggests AdaGO may be especially effective for tasks where matrix structure and gradient norm variability are pronounced, such as neural architecture optimization.
5. Implementation Aspects
AdaGO extends Muon with minimal complexity:
- Only one additional scalar variable, namely , must be stored and updated per parameter matrix.
- Accumulation and clamping of gradient norms rely on the Frobenius norm, incurring negligible computational and memory overhead.
- Orthogonalization uses a single SVD per update, identical to Muon; the update direction remains purely orthogonal, not elementwise-adapted as in other Muon variants.
- The design allows AdaGO to be incorporated in deep learning libraries by modifying only the stepsize calculation and the update vector.
A plausible implication is that AdaGO’s efficient adaptation and spectral descent can be leveraged in settings with huge weight matrices—such as LLMs and other parameter-rich architectures—without undue resource cost.
6. Practical Significance and Applications
AdaGO’s combination of AdaGrad-type adaptive scaling and orthogonalized descent directions enables several practical advantages:
- Stepsizes are adjusted automatically to the optimization landscape: High norm gradients lead to rapid parameter motion; in flatter regions, updates become smaller to prevent divergence.
- The orthogonalization inherits the proven empirical robustness of Muon for matrix-valued parameters, preserving favorable principal directions of descent.
- Applicable to training tasks with challenging geometry, e.g., LLMs, deep CNNs, or architectures where nonuniform gradient scaling is prevalent.
- Minimal modification and overhead position AdaGO as a suitable optimizer for resource-constrained but large-scale scenarios.
A plausible implication is that AdaGO can mitigate problems of slow convergence and poor generalization arising from mis-scaled updates in matrix architectures, especially where both landscape adaptivity and stable update geometry are critical.
7. Relation to Prior Methods and Outlook
AdaGO is distinguished by retaining Muon’s spectral update direction while importing the stepsize adaptation of AdaGrad-Norm. Prior adaptive Muon variants alter the update direction via element-wise learning rates, potentially disrupting orthogonality. By contrast, AdaGO preserves the orthogonal structure, using norm-based adaptation to scale vector-valued update magnitudes. This approach results in:
- Computational and memory efficiency
- Improved convergence stability
- Theoretical optimality for nonconvex optimization
The plausible extension is further empirical validation in even larger-scale deep learning regimes, as well as exploration of alternative norm and accumulation strategies compatible with orthogonal update schemes. The integration of adaptive scalar stepsizing with matrix-aware geometry embodied in AdaGO may inform new lines of research in large-scale first-order optimization and spectral descent algorithms.