Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

AdaGO Algorithm

Updated 16 September 2025
  • AdaGO is an adaptive optimization algorithm that integrates AdaGrad’s norm-based stepsize adaptation with Muon’s orthogonal momentum update for matrix-structured parameters.
  • It achieves optimal convergence rates in both stochastic and deterministic nonconvex settings by dynamically scaling updates according to accumulated gradient norms.
  • Empirical benchmarks on regression and classification tasks show AdaGO’s superior performance over Adam and Muon, with minimal computational and memory overhead.

AdaGO is an adaptive optimization algorithm for matrix‐structured parameters that synthesizes the norm‐based stepsize adaptation of AdaGrad with the spectral update geometry of the Muon optimizer. Distinctively, AdaGO preserves the orthogonalization of Muon’s update direction while utilizing an AdaGrad‐type scaling governed by accumulated gradient norms. The algorithm is constructed to be computationally and memory efficient, requiring only a single additional scalar variable compared to Muon. AdaGO achieves optimal theoretical convergence rates for nonconvex stochastic and deterministic optimization under standard smoothness and noise assumptions and empirically demonstrates superior performance relative to Muon and Adam on regression and classification benchmarks.

1. Algorithm Structure and Update Rules

AdaGO’s mechanism fuses AdaGrad’s adaptive scaling with Muon’s orthogonal momentum averaging. Iterative updates are performed as follows:

  • For each iteration tt, a minibatch is sampled yielding the stochastic gradient GtG_t.
  • Momentum is maintained via MtμMt1+(1μ)GtM_t \leftarrow \mu M_{t-1} + (1-\mu) G_t, analogous to Muon.
  • An accumulator vtv_t tracks the history of (clamped) gradient norms: vt2vt12+min{Gt2,γ2}v_t^2 \leftarrow v_{t-1}^2 + \min\{\|G_t\|^2, \gamma^2\}, where γ\gamma controls clamping.
  • Momentum MtM_t is orthogonalized: if Mt=UΣVTM_t = U \Sigma V^T (SVD), then Ot=UVTO_t = U V^T preserves Muon’s spectral descent direction.
  • Parameter updates are performed using the adaptive stepsize αt\alpha_t:

Θt=Θt1αtOt,\Theta_t = \Theta_{t-1} - \alpha_t O_t,

with

αt=max{ϵ,η(min{Gt,γ}/vt)},\alpha_t = \max\{\epsilon, \eta \cdot (\min\{\|G_t\|, \gamma\} / v_t)\},

where η\eta is the base learning rate, and ϵ\epsilon ensures numerical stability.

This scheme generalizes AdaGrad-Norm to the matrix-valued context while retaining Muon’s update geometry. Use of the “min” operator in vtv_t and αt\alpha_t ensures outlier gradient norms do not destabilize learning rates.

2. Mathematical Formulation

AdaGO is defined by several critical formulas:

Quantity Formula Role
Orthogonalized Momentum Ot=UVTO_t = U V^T, with Mt=UΣVTM_t = U \Sigma V^T (SVD) Update direction
Stepsize αt=max{ϵ,ηmin{Gt,γ}vt}\alpha_t = \max\left\{\epsilon,\, \eta\cdot\frac{\min\{\|G_t\|, \gamma\}}{v_t}\right\} AdaGrad-type scaling
Accumulator vt2=vt12+min{Gt2,γ2}v_t^2 = v_{t-1}^2 + \min\{\|G_t\|^2, \gamma^2\} Tracks scaled history
Update Rule Θt=Θt1αtOt\Theta_t = \Theta_{t-1} - \alpha_t O_t Parameter update

Orthogonalization ensures that the update is spectral (preserving the principal components of MtM_t). The adaptive stepsize scales the update by the running norm of past gradients, generalizing AdaGrad-Norm’s adaptive mechanism to the spectral (matrix) setting.

3. Theoretical Convergence Properties

AdaGO achieves optimal convergence rates for nonconvex objectives under standard smoothness and noise conditions:

  • In the stochastic setting, with momentum and unit mini-batch size, setting ϵ=T3/4\epsilon = T^{-3/4}, 1μ=T1/21-\mu = T^{-1/2}, and η=T3/8q\eta = T^{-3/8-q} (q>0q>0), AdaGO attains O(T1/4)O(T^{-1/4}) convergence rate, which matches the lower bound for first-order stochastic nonconvex optimization.
  • Deterministically (μ=0\mu=0), AdaGO reaches the O(1/T)O(1/\sqrt{T}) rate, the known optimum for deterministic nonconvex first-order methods.
  • When momentum is turned off and batch sizes increase, AdaGO naturally adjusts stepsizes to account for noise, demonstrating additional robustness.

These rates are attained entirely via norm-based adaptation; the adaptive mechanism enables the optimizer to take more aggressive steps when gradient norms are large, then anneal step lengths near stationary points. The use of spectral (orthogonalized) updates ensures efficacy for matrix parameters.

4. Empirical Benchmarking and Comparative Performance

Empirical evaluations comprised regression and classification tasks:

  • Function Regression: Training two-layer MLPs on data sampled from Gaussian random fields, AdaGO consistently reached lower training and test loss than both Adam and Muon. The adaptive scaling permitted AdaGO to exploit large initial steps but converge tightly near optima.
  • CIFAR-10 Classification: On convolutional neural networks trained for CIFAR-10, AdaGO realized both lower training loss and higher test accuracy across epochs. Adam occasionally exhibited optimization oscillation, while Muon was more stable but converged to higher loss. AdaGO’s orthogonally scaled steps yielded improved generalization and optimization speed.

This suggests AdaGO may be especially effective for tasks where matrix structure and gradient norm variability are pronounced, such as neural architecture optimization.

5. Implementation Aspects

AdaGO extends Muon with minimal complexity:

  • Only one additional scalar variable, namely vtv_t, must be stored and updated per parameter matrix.
  • Accumulation and clamping of gradient norms rely on the Frobenius norm, incurring negligible computational and memory overhead.
  • Orthogonalization uses a single SVD per update, identical to Muon; the update direction remains purely orthogonal, not elementwise-adapted as in other Muon variants.
  • The design allows AdaGO to be incorporated in deep learning libraries by modifying only the stepsize calculation and the update vector.

A plausible implication is that AdaGO’s efficient adaptation and spectral descent can be leveraged in settings with huge weight matrices—such as LLMs and other parameter-rich architectures—without undue resource cost.

6. Practical Significance and Applications

AdaGO’s combination of AdaGrad-type adaptive scaling and orthogonalized descent directions enables several practical advantages:

  • Stepsizes are adjusted automatically to the optimization landscape: High norm gradients lead to rapid parameter motion; in flatter regions, updates become smaller to prevent divergence.
  • The orthogonalization inherits the proven empirical robustness of Muon for matrix-valued parameters, preserving favorable principal directions of descent.
  • Applicable to training tasks with challenging geometry, e.g., LLMs, deep CNNs, or architectures where nonuniform gradient scaling is prevalent.
  • Minimal modification and overhead position AdaGO as a suitable optimizer for resource-constrained but large-scale scenarios.

A plausible implication is that AdaGO can mitigate problems of slow convergence and poor generalization arising from mis-scaled updates in matrix architectures, especially where both landscape adaptivity and stable update geometry are critical.

7. Relation to Prior Methods and Outlook

AdaGO is distinguished by retaining Muon’s spectral update direction while importing the stepsize adaptation of AdaGrad-Norm. Prior adaptive Muon variants alter the update direction via element-wise learning rates, potentially disrupting orthogonality. By contrast, AdaGO preserves the orthogonal structure, using norm-based adaptation to scale vector-valued update magnitudes. This approach results in:

  • Computational and memory efficiency
  • Improved convergence stability
  • Theoretical optimality for nonconvex optimization

The plausible extension is further empirical validation in even larger-scale deep learning regimes, as well as exploration of alternative norm and accumulation strategies compatible with orthogonal update schemes. The integration of adaptive scalar stepsizing with matrix-aware geometry embodied in AdaGO may inform new lines of research in large-scale first-order optimization and spectral descent algorithms.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AdaGO Algorithm.