AdaGrad–Norm: Scalar Adaptive Gradient
- AdaGrad–Norm is a scalar adaptive gradient method that uses a single ℓ₂-norm accumulator to tune global stepsizes, enabling efficient optimization without per-coordinate overhead.
- The algorithm achieves rigorous convergence bounds across convex, nonconvex, and Riemannian settings, adapting dynamically to problem curvature and noise.
- Its empirical performance in deep learning and large-scale optimization shows strong robustness to hyperparameter tuning and significant memory efficiency compared to traditional methods.
AdaGrad–Norm is a scalar adaptive gradient method that adapts a single global stepsize based on the cumulative ℓ₂-norm of observed gradients. It is a computationally and memory-efficient specialization of AdaGrad, and has become a canonical choice for the robust adaptation of stepsizes in stochastic first-order optimization, particularly when problem curvature and noise are difficult to estimate a priori. The method is widely used in nonconvex optimization, deep learning, and large-scale settings. AdaGrad–Norm, also known as AdaNorm or “scalar AdaGrad,” is rigorously analyzed in convex, nonconvex, and even Riemannian settings, and under both deterministic and stochastic oracles.
1. Algorithmic Formulation
At each iteration, AdaGrad–Norm tracks a single scalar accumulator of the squared gradient norms and sets the current stepsize proportional to the inverse of its square root. Given iterates and stochastic gradients , the canonical update is: where is a global stepsize parameter and initializes the accumulator to avoid division by zero. The per-iteration cost and memory are minimal—only a single scalar needs to be stored, in contrast to the memory required by per-coordinate AdaGrad or for full-matrix adaptive methods (Ward et al., 2018, Gratton et al., 19 Apr 2026).
Extensions of this rule incorporate momentum, clamping (for very large gradients in unstable regimes), and minor floors to ensure monotonically vanishing stepsizes do not completely stall progress. For matrix-parameterized models (e.g., neural networks), the same scalar adaptation can be combined with structured update directions, notably in the AdaGO/Muon framework (Zhang et al., 3 Sep 2025).
2. Convergence Theory in Convex, Nonconvex, and Stochastic Regimes
AdaGrad–Norm is distinguished by rigorous convergence bounds that match or nearly match the best possible rates in various regimes without the need for manually tuning the learning rate based on the problem’s smoothness or noise variance.
- Smooth Nonconvex Setting: For -smooth and unbiased, affine-variance stochastic gradient oracle, AdaGrad–Norm achieves convergence to first-order stationarity in expectation and high probability, with no boundedness assumptions required on the gradients themselves (Ward et al., 2018, Faw et al., 2022, Gratton et al., 19 Apr 2026). In low-noise settings, improved 0 rates can be attained.
- Strongly Convex / PL Setting: Under 1-strong convexity or the Polyak–Łojasiewicz condition, AdaGrad–Norm exhibits 2–type linear convergence with iteration complexity that closely tracks optimal theoretically tuned constant-stepsize SGD. This robustness is achieved without knowing 3 or 4 a priori (Xie et al., 2019).
- Deterministic Smooth Setting: In the absence of stochastic noise, AdaGrad–Norm achieves the optimal 5 rate for gradient norm decay or function suboptimality (Ward et al., 2018, Liu et al., 2022).
- Last Iterate Suboptimality: For convex but potentially nonsmooth problems, the last iterate suboptimality of AdaGrad–Norm is provably 6, which is strictly suboptimal compared to the 7 rate available for the average of iterates, with this separation established to be tight (Preobrazhenskaia et al., 12 Apr 2026).
Convergence theorems typically place minimal restrictions on the initialization 8 and base step 9, with the algorithm self-correcting to the appropriate stepsize scale via online adaptation (Faw et al., 2022, Ward et al., 2018).
3. Practical Variants, Extensions, and Memory-Adaptation Tradeoffs
AdaGrad–Norm provides a baseline upon which more elaborate adaptive schemes are constructed:
| Variant | Stepsize Adaptation | Memory Complexity |
|---|---|---|
| AdaGrad–Norm (scalar) | 0 | 1 |
| AdaGrad–Coordinate | 2 (per-dim) | 3 |
| Subset-Norm (SN) | 4 (groups) | 5, 6 |
| Full-Matrix AdaGrad | 7 | 8 |
- Subset-Norm partitions coordinates into groups, permitting a tradeoff between memory consumption and adaptation granularity (9-memory if 0). It interpolates between AdaGrad–Norm and AdaGrad–Coordinate and can leverage inhomogeneous noise structure for improved high-probability convergence (Nguyen et al., 2024).
- Matrix-Structured Directions: AdaGO (AdaGrad Orthogonalized) combines AdaGrad–Norm-style stepsize adaptation with Muon’s orthogonalized momentum, preserving spectral-descent properties for matrix learning problems, and recovers optimal rates for nonconvex and full-batch settings (Zhang et al., 3 Sep 2025).
4. Geometry-Awareness and Riemannian Generalizations
The core construction of AdaGrad–Norm extends to manifold optimization. MAdaGrad employs a Riemannian gradient and exponential map to perform updates: 1 with 2. Complexity results mirror the Euclidean case: 3 for nonconvex, 4 for geodesically convex, and 5 under a Riemannian PL condition (Bento et al., 24 Sep 2025).
On SPD manifolds, this yields sharp complexity bounds and empirically outperforms Riemannian gradient descent with line search.
5. Stability, Asymptotics, and Advanced Analysis
Recent work establishes rigorous stability results for AdaGrad–Norm in smooth nonconvex optimization: under mild coercivity and smoothness conditions, the method maintains bounded iterates and function values almost surely, and both almost sure and mean-square convergence of gradient norm to zero is guaranteed (Jin et al., 5 Jan 2026).
Analysis frameworks demonstrate that the adaptation mechanism (via 6-norm accumulator) restores crucial summability conditions for step-sizes absent in per-coordinate AdaGrad, enabling ODE-method and Lyapunov techniques to establish stability and convergence results without extra boundedness assumptions (Jin et al., 5 Jan 2026, Gratton et al., 19 Apr 2026).
In online convex optimization, AdaGrad–Norm and its matrix versions are shown to match regret bounds of the form 7 by designing varying-norm FTRL algorithms, and full-matrix AdaGrad can even be “auto-tuned” without offline oracle choices of learning rate via a norm-adaptive reduction (Cutkosky, 2020).
6. Empirical Performance, Robustness, and Use-Cases
Empirical studies validate AdaGrad–Norm’s practical advantages over standard SGD and per-coordinate AdaGrad:
- Robustness to Hyperparameters: Performance is insensitive to initialization and the stepsize parameter over orders of magnitude, with self-tuning towards optimal regimes; in contrast, vanilla SGD may diverge with mis-tuned rates (Ward et al., 2018, Xie et al., 2019).
- Deep Learning: On standard benchmarks (MNIST, CIFAR-10, ImageNet, various architectures), AdaGrad–Norm matches best-tuned SGD performance without hand-tuning for learning rate (Ward et al., 2018).
- Large-Scale, Resource-Constrained Training: Subset-Norm and memory-adapted variants of AdaGrad–Norm are leveraged in modern LLM pretraining and fine-tuning for substantial memory reductions without loss of optimizer performance (Nguyen et al., 2024).
7. Limitations and Relationship to Other Adaptive Methods
While AdaGrad–Norm provides sharp convergence for stationary-point finding, its last-iterate suboptimality rate in convex nonsmooth optimization is strictly slower than classical subgradient methods for the averaged iterate. The bound 8 for the final point is tight, revealing a gap between adaptive and non-adaptive methods in this metric (Preobrazhenskaia et al., 12 Apr 2026).
AdaGrad–Norm can be further enhanced via acceleration, hybridization with momentum, or as a component in frameworks enabling groupwise or full-matrix adaptation (Gratton et al., 19 Apr 2026, Zhang et al., 3 Sep 2025). The method’s adaptability is highly robust, with less reliance on problem-dependent knowledge compared to classical, non-adaptive first-order methods.
References (arXiv IDs):
(Ward et al., 2018, Xie et al., 2019, Cutkosky, 2020, Faw et al., 2022, Liu et al., 2022, Nguyen et al., 2024, Zhang et al., 3 Sep 2025, Bento et al., 24 Sep 2025, Jin et al., 5 Jan 2026, Preobrazhenskaia et al., 12 Apr 2026, Gratton et al., 19 Apr 2026)