Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaGrad–Norm: Scalar Adaptive Gradient

Updated 22 April 2026
  • AdaGrad–Norm is a scalar adaptive gradient method that uses a single ℓ₂-norm accumulator to tune global stepsizes, enabling efficient optimization without per-coordinate overhead.
  • The algorithm achieves rigorous convergence bounds across convex, nonconvex, and Riemannian settings, adapting dynamically to problem curvature and noise.
  • Its empirical performance in deep learning and large-scale optimization shows strong robustness to hyperparameter tuning and significant memory efficiency compared to traditional methods.

AdaGrad–Norm is a scalar adaptive gradient method that adapts a single global stepsize based on the cumulative ℓ₂-norm of observed gradients. It is a computationally and memory-efficient specialization of AdaGrad, and has become a canonical choice for the robust adaptation of stepsizes in stochastic first-order optimization, particularly when problem curvature and noise are difficult to estimate a priori. The method is widely used in nonconvex optimization, deep learning, and large-scale settings. AdaGrad–Norm, also known as AdaNorm or “scalar AdaGrad,” is rigorously analyzed in convex, nonconvex, and even Riemannian settings, and under both deterministic and stochastic oracles.

1. Algorithmic Formulation

At each iteration, AdaGrad–Norm tracks a single scalar accumulator of the squared gradient norms and sets the current stepsize proportional to the inverse of its square root. Given iterates xtRdx_t\in\mathbb{R}^d and stochastic gradients gtg_t, the canonical update is: bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t, where η>0\eta > 0 is a global stepsize parameter and b0>0b_0 > 0 initializes the accumulator to avoid division by zero. The per-iteration cost and memory are minimal—only a single scalar needs to be stored, in contrast to the O(d)O(d) memory required by per-coordinate AdaGrad or O(d2)O(d^2) for full-matrix adaptive methods (Ward et al., 2018, Gratton et al., 19 Apr 2026).

Extensions of this rule incorporate momentum, clamping (for very large gradients in unstable regimes), and minor floors to ensure monotonically vanishing stepsizes do not completely stall progress. For matrix-parameterized models (e.g., neural networks), the same scalar adaptation can be combined with structured update directions, notably in the AdaGO/Muon framework (Zhang et al., 3 Sep 2025).

2. Convergence Theory in Convex, Nonconvex, and Stochastic Regimes

AdaGrad–Norm is distinguished by rigorous convergence bounds that match or nearly match the best possible rates in various regimes without the need for manually tuning the learning rate based on the problem’s smoothness or noise variance.

  • Smooth Nonconvex Setting: For LL-smooth ff and unbiased, affine-variance stochastic gradient oracle, AdaGrad–Norm achieves O(logT/T)\mathcal{O}(\log T / \sqrt{T}) convergence to first-order stationarity in expectation and high probability, with no boundedness assumptions required on the gradients themselves (Ward et al., 2018, Faw et al., 2022, Gratton et al., 19 Apr 2026). In low-noise settings, improved gtg_t0 rates can be attained.
  • Strongly Convex / PL Setting: Under gtg_t1-strong convexity or the Polyak–Łojasiewicz condition, AdaGrad–Norm exhibits gtg_t2–type linear convergence with iteration complexity that closely tracks optimal theoretically tuned constant-stepsize SGD. This robustness is achieved without knowing gtg_t3 or gtg_t4 a priori (Xie et al., 2019).
  • Deterministic Smooth Setting: In the absence of stochastic noise, AdaGrad–Norm achieves the optimal gtg_t5 rate for gradient norm decay or function suboptimality (Ward et al., 2018, Liu et al., 2022).
  • Last Iterate Suboptimality: For convex but potentially nonsmooth problems, the last iterate suboptimality of AdaGrad–Norm is provably gtg_t6, which is strictly suboptimal compared to the gtg_t7 rate available for the average of iterates, with this separation established to be tight (Preobrazhenskaia et al., 12 Apr 2026).

Convergence theorems typically place minimal restrictions on the initialization gtg_t8 and base step gtg_t9, with the algorithm self-correcting to the appropriate stepsize scale via online adaptation (Faw et al., 2022, Ward et al., 2018).

3. Practical Variants, Extensions, and Memory-Adaptation Tradeoffs

AdaGrad–Norm provides a baseline upon which more elaborate adaptive schemes are constructed:

Variant Stepsize Adaptation Memory Complexity
AdaGrad–Norm (scalar) bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,0 bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,1
AdaGrad–Coordinate bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,2 (per-dim) bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,3
Subset-Norm (SN) bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,4 (groups) bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,5, bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,6
Full-Matrix AdaGrad bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,7 bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,8
  • Subset-Norm partitions coordinates into groups, permitting a tradeoff between memory consumption and adaptation granularity (bt2=bt12+gt22,xt+1=xtηbtgt,b_t^2 = b_{t-1}^2 + \|g_t\|_2^2,\qquad x_{t+1} = x_t - \frac{\eta}{b_t}\,g_t,9-memory if η>0\eta > 00). It interpolates between AdaGrad–Norm and AdaGrad–Coordinate and can leverage inhomogeneous noise structure for improved high-probability convergence (Nguyen et al., 2024).
  • Matrix-Structured Directions: AdaGO (AdaGrad Orthogonalized) combines AdaGrad–Norm-style stepsize adaptation with Muon’s orthogonalized momentum, preserving spectral-descent properties for matrix learning problems, and recovers optimal rates for nonconvex and full-batch settings (Zhang et al., 3 Sep 2025).

4. Geometry-Awareness and Riemannian Generalizations

The core construction of AdaGrad–Norm extends to manifold optimization. MAdaGrad employs a Riemannian gradient and exponential map to perform updates: η>0\eta > 01 with η>0\eta > 02. Complexity results mirror the Euclidean case: η>0\eta > 03 for nonconvex, η>0\eta > 04 for geodesically convex, and η>0\eta > 05 under a Riemannian PL condition (Bento et al., 24 Sep 2025).

On SPD manifolds, this yields sharp complexity bounds and empirically outperforms Riemannian gradient descent with line search.

5. Stability, Asymptotics, and Advanced Analysis

Recent work establishes rigorous stability results for AdaGrad–Norm in smooth nonconvex optimization: under mild coercivity and smoothness conditions, the method maintains bounded iterates and function values almost surely, and both almost sure and mean-square convergence of gradient norm to zero is guaranteed (Jin et al., 5 Jan 2026).

Analysis frameworks demonstrate that the adaptation mechanism (via η>0\eta > 06-norm accumulator) restores crucial summability conditions for step-sizes absent in per-coordinate AdaGrad, enabling ODE-method and Lyapunov techniques to establish stability and convergence results without extra boundedness assumptions (Jin et al., 5 Jan 2026, Gratton et al., 19 Apr 2026).

In online convex optimization, AdaGrad–Norm and its matrix versions are shown to match regret bounds of the form η>0\eta > 07 by designing varying-norm FTRL algorithms, and full-matrix AdaGrad can even be “auto-tuned” without offline oracle choices of learning rate via a norm-adaptive reduction (Cutkosky, 2020).

6. Empirical Performance, Robustness, and Use-Cases

Empirical studies validate AdaGrad–Norm’s practical advantages over standard SGD and per-coordinate AdaGrad:

  • Robustness to Hyperparameters: Performance is insensitive to initialization and the stepsize parameter over orders of magnitude, with self-tuning towards optimal regimes; in contrast, vanilla SGD may diverge with mis-tuned rates (Ward et al., 2018, Xie et al., 2019).
  • Deep Learning: On standard benchmarks (MNIST, CIFAR-10, ImageNet, various architectures), AdaGrad–Norm matches best-tuned SGD performance without hand-tuning for learning rate (Ward et al., 2018).
  • Large-Scale, Resource-Constrained Training: Subset-Norm and memory-adapted variants of AdaGrad–Norm are leveraged in modern LLM pretraining and fine-tuning for substantial memory reductions without loss of optimizer performance (Nguyen et al., 2024).

7. Limitations and Relationship to Other Adaptive Methods

While AdaGrad–Norm provides sharp convergence for stationary-point finding, its last-iterate suboptimality rate in convex nonsmooth optimization is strictly slower than classical subgradient methods for the averaged iterate. The bound η>0\eta > 08 for the final point is tight, revealing a gap between adaptive and non-adaptive methods in this metric (Preobrazhenskaia et al., 12 Apr 2026).

AdaGrad–Norm can be further enhanced via acceleration, hybridization with momentum, or as a component in frameworks enabling groupwise or full-matrix adaptation (Gratton et al., 19 Apr 2026, Zhang et al., 3 Sep 2025). The method’s adaptability is highly robust, with less reliance on problem-dependent knowledge compared to classical, non-adaptive first-order methods.


References (arXiv IDs):

(Ward et al., 2018, Xie et al., 2019, Cutkosky, 2020, Faw et al., 2022, Liu et al., 2022, Nguyen et al., 2024, Zhang et al., 3 Sep 2025, Bento et al., 24 Sep 2025, Jin et al., 5 Jan 2026, Preobrazhenskaia et al., 12 Apr 2026, Gratton et al., 19 Apr 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaGrad–Norm.