Papers
Topics
Authors
Recent
Search
2000 character limit reached

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

Published 19 Apr 2026 in cs.LG | (2604.17423v1)

Abstract: A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.

Authors (2)

Summary

  • The paper introduces a unified framework that establishes sublinear convergence rates for adaptive methods in nonconvex optimization settings.
  • It demonstrates that blockwise adaptive preconditioning—including techniques like AdaGrad, Shampoo, and Muon—achieves reliable performance without bounded-gradient or small-stepsize constraints.
  • The analysis leverages operator inequalities and momentum perturbation techniques to manage stochastic noise and ensure safe composability in heterogeneous architectures.

Unified Convergence Theory for Adaptive First-Order Methods in Nonconvex Optimization

Introduction

The paper "A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo" (2604.17423) addresses the absence of a comprehensive convergence analysis for a broad class of adaptive first-order optimization algorithms applied to nonconvex stochastic objectives. The work situates itself within the context of unconstrained minimization of objectives of the form minXE[f(X,ξ)]\min_X \mathbb{E}[f(X, \xi)], where ff is smooth but potentially nonconvex, and ξ\xi models randomness such as data sampling noise.

Adaptive first-order methods such as AdaNorm, AdaGrad (in both diagonal and dense forms), Shampoo, and Muon are foundational for large-scale machine learning. Whereas previous analyses often focused on specific methods and required restrictive assumptions (e.g., uniformly bounded gradients, convex losses, or small stepsizes), this paper develops a general framework that integrates these diverse algorithmic instances and unifies their convergence analysis even in the presence of nonconvexity, stochastic noise, and momentum.

Framework and Algorithmic Generality

The core of the analysis is a flexible block-structured parameterization. The optimization variables are partitioned into blocks, each of which can correspond to a different geometry (e.g., vectors, matrices). Each block is assigned its own norm and dual norm. The updates proceed by forming an adaptive preconditioner—potentially block-wise or with more intricate geometrical information—used to scale the gradient estimates.

The proposed general algorithm, denoted ADPREC, encompasses:

  • Stochastic update: At each iteration, a stochastic gradient Gk,G_{k,\ell} is estimated for each block.
  • Adaptive preconditioning: Each block maintains a preconditioner Γk,\Gamma_{k,\ell} (e.g., accumulating gradient outer products, diagonal squares), which may differ between blocks.
  • Update step: Variables are updated using a preconditioned step scaled by the dual norm of the preconditioned gradient.

The approach naturally covers isotropic geometries (AdaNorm), diagonal preconditioning (diagonal AdaGrad), dense matrix preconditioning (full AdaGrad), Kronecker-factored matrix preconditioning (Shampoo), and spectral normalization (adaptive Muon). Block-wise combinations support practical deep learning implementations where different layers/parameters may require distinct preconditioning schemes.

Convergence Analysis and Main Results

The convergence proofs introduce several technical innovations:

  • The analysis leverages trace inequalities and operator monotone function theory to control potentially heterogeneous preconditioners while avoiding reliance on supremum norm bounds.
  • The proofs dispense with the necessity for bounded gradients or restrictively small stepsizes.
  • Two momentum mechanisms are considered: (i) updating preconditioners and computing search directions with the gradient momentum, and (ii) momentum applied only to the update, not the preconditioner.

The principal results are:

  1. Unified Sublinear Rate: For all the encompassed algorithms under standard smoothness (Lipschitz gradient) and lower-boundedness, and under mild cumulative variance conditions on the unbiased stochastic gradient oracle, the methods achieve uniform convergence rates. Specifically, the minimal expected dual norm of the (preconditioned) gradient up to iteration kk decreases as O(1/k+1)\mathcal{O}(1/\sqrt{k+1}), up to logarithmic terms, i.e.,

minjkEGjCk+1\min_{j \leq k}\mathbb{E}\|G_j\|_* \leq \frac{C}{\sqrt{k+1}}

where CC depends explicitly on the initial suboptimality, geometry constants, and the structure of the preconditioners.

  1. Variance and Momentum Effects: Under polynomially decaying per-iteration variance (of the form O(1/(k+1)α)\mathcal{O}(1/(k+1)^\alpha)), the rate interpolates between ff0 for low-variance regimes (ff1) and slower rates dictated by noise accumulation for ff2. Momentum does not affect the asymptotic rate unless it is non-decaying, in which case the rate may degrade, but decaying momentum can preserve optimal rates.
  2. No Stepsize Restriction: The analysis holds for arbitrary constant stepsizes, provided only that the variance and smoothness constants are controlled, contrasting sharply with many prior proofs requiring ff3 for some problem-dependent ff4.
  3. Blockwise Heterogeneity: Importantly, convergence guarantees are uniform across blocks, implying that mixing different adaptive methods (e.g., Shampoo on matrix weights, AdaGrad on biases) does not harm overall convergence—a nontrivial claim in modular deep architectures.
  4. Coverage of Major Methods: The paper rigorously maps full/diagonal AdaGrad, AdaNorm, Shampoo, and Adaptive Muon (as well as any blockwise mixture) into its abstract framework, explicitly verifying the structural conditions needed for the general theorems.

Numerical and Analytical Highlights

The paper presents explicit rates (with logarithmic and constant dependencies) and provides structural inequalities backed by operator-theoretic arguments. Notably, the bounds for quantities such as ff5 (a potential function over preconditioners) are derived using spectral analysis and Jensen's inequality, yielding compact and interpretable rate expressions.

When momentum is included, a perturbation argument is introduced, bounding the additional errors due to deviation between the true gradient and the momentum-averaged surrogate. The perturbation is shown to be negligible if the momentum decays appropriately, cementing the robustness of the analysis.

The claims regarding removal of stepsize constraints and avoidance of gradient norm upper bounds are prominently highlighted and supported with general arguments, in contrast to much of the nonconvex stochastic optimization literature.

Practical and Theoretical Implications

Practical Implications:

  • Machine learning practitioners can confidently compose blockwise adaptive methods in heterogeneous architectures (e.g., simultaneously using Shampoo and AdaGrad across layers) without risking loss of theoretical guarantees.
  • The removal of bounded-gradient and restrictive stepsize assumptions simplifies implementation; practitioners can set stepsizes based on empirical performance rather than stringent theory-driven thresholds.
  • The analysis supports stochastic training in high-variance settings, relevant to large-batch and distributed learning scenarios.

Theoretical Implications and Future Directions:

  • The operator-theoretic insights open avenues for yet broader classes of preconditioning, including those leveraging higher-order curvature or spectral properties.
  • Extension to constrained problems, approximate block decompositions, or more general geometries (e.g., Riemannian manifolds) appears feasible within this trace-inequality-based framework.
  • The possibility of using different preconditioners per iteration (full generality in the mapping ff6) and the role of approximate normalization functions for ff7 are noted as ripe topics for further study.
  • The convergence rates are essentially optimal (up to logarithmic factors) for the class of problems considered, yet understanding practical performance and stability under severe stochasticity or adversarial conditions remains a key empirical question.

Conclusion

This paper establishes the first rigorous, unified convergence rate analysis for a broad class of adaptive gradient methods (including AdaNorm, full and diagonal AdaGrad, Shampoo, and Muon) applied to nonconvex stochastic objectives. The results apply to methods both with and without momentum, across blockwise heterogeneous settings, absent common restrictive assumptions. The framework delivers quantifiable, sublinear rates and demonstrates the safe composability of adaptive preconditioning mechanisms—an essential theoretical foundation for modern deep learning optimization. The operator-theoretic methodology suggests fertile grounds for further algorithmic innovation and theoretical generalization within optimization for machine learning.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.