Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaGrad Preconditioning Overview

Updated 22 April 2026
  • AdaGrad preconditioning is an adaptive optimization technique that scales gradient updates based on accumulated squared gradients for improved convergence.
  • Its diagonal and structured variants efficiently handle high-dimensional data in large-scale machine learning while balancing computational cost.
  • Recent advancements extend AdaGrad to low-rank and Kronecker-factored preconditioners, offering enhanced performance and robust regret bounds.

AdaGrad Preconditioning is a foundational mechanism in the class of adaptive gradient-based optimization algorithms, designed to dynamically adapt learning rates along each parameter direction during high-dimensional stochastic optimization. The core idea is to accelerate convergence by reducing the effective condition number of the problem via data-dependent, per-coordinate or matrix-valued scaling, closely related to preconditioning in numerical linear algebra. AdaGrad preconditioning arises from maintaining accumulators of past squared gradients and using these statistics to normalize the current gradient step. Its diagonal form is widely used in large-scale machine learning, while recent advancements extend AdaGrad preconditioning to structured and low-rank forms to efficiently approximate full-matrix curvature while controlling computation and memory costs.

1. Mathematical Formulation and Connection to Curvature

Let xtRdx_t\in\mathbb{R}^d denote parameters at iteration tt and gt=ft(xt)g_t=\nabla f_t(x_t) the observed gradient. The generic preconditioned gradient descent update is

xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t

where η>0\eta>0 is the global stepsize and MtM_t is a positive definite preconditioning matrix.

Diagonal AdaGrad Preconditioner

Classical AdaGrad maintains entrywise accumulators: rt=rt1+gtgtr_t = r_{t-1} + g_t \odot g_t

Mt=diag((rt+ε)1/2)M_t = \operatorname{diag}((r_t+\varepsilon)^{-1/2})

The update reads: xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t where ε>0\varepsilon>0 stabilizes inversion.

Near a minimizer tt0, tt1, so the running sum tt2 approximates the local Hessian row norms, making tt3 an online, data-driven preconditioner that approximates tt4 (Ye, 2024).

Full-matrix Preconditioning

Full-matrix AdaGrad accumulates

tt5

and updates

tt6

This scheme directly approximates the inverse Fisher or Hessian in expectation, but with prohibitive cost for high tt7 (Matveeva et al., 28 Aug 2025).

2. Structured and Low-rank Variants

To overcome the inefficiency of full-matrix schemes, various structured preconditioners interpolate between full-matrix and diagonal forms.

Key Structured Forms

Variant Preconditioner tt8 Memory Time per Iteration
AdaGrad-Norm tt9 (scalar) gt=ft(xt)g_t=\nabla f_t(x_t)0 gt=ft(xt)g_t=\nabla f_t(x_t)1
Diagonal AdaGrad gt=ft(xt)g_t=\nabla f_t(x_t)2 gt=ft(xt)g_t=\nabla f_t(x_t)3 gt=ft(xt)g_t=\nabla f_t(x_t)4
Full-matrix AdaGrad gt=ft(xt)g_t=\nabla f_t(x_t)5 gt=ft(xt)g_t=\nabla f_t(x_t)6 gt=ft(xt)g_t=\nabla f_t(x_t)7--gt=ft(xt)g_t=\nabla f_t(x_t)8
Shampoo, KrADagrad* Kronecker-factored matrices gt=ft(xt)g_t=\nabla f_t(x_t)9 xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t0
AdaGram (low-rank) Low-rank factorization xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t1 xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t2

AdaGram (Matveeva et al., 28 Aug 2025) maintains a rank-xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t3 correction of the inverse Cholesky factor of the full Gram matrix, using projector-splitting integrators to efficiently update the preconditioner with xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t4 cost per iteration (for xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t5), and achieves performance close to full-matrix AdaGrad.

KrADagrad (Mei et al., 2023) and Shampoo use Kronecker factorization, maintaining two (or more) small matrices whose product dominates the empirical Fisher. This ensures AdaGrad-style regret at much lower computational cost.

Unified analyses (Xie et al., 13 Mar 2025, Kovalev, 30 Jun 2025) show that, for some problem classes, structured or more "aggressive" preconditioners (diagonal or one-sided Kronecker) can yield lower regret than full-matrix schemes due to a favorable diameter--gradient tradeoff in adaptive regret bounds, challenging the intuition that increased adaptivity uniformly improves performance.

3. Convergence Behavior and Regret Bounds

AdaGrad preconditioning provably accelerates both local and global convergence rates in convex and (recently) non-convex settings.

  • Local linear theory (for well-conditioned Hessians): preconditioned updates with xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t6 dramatically reduce the effective condition number, leading to rapid local contraction rates (Ye, 2024).
  • Global sublinear rate (convex): classic analysis yields xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t7 after xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t8 rounds (Ye, 2024), and in structured settings, unified regret bounds xt+1=xtηMtgtx_{t+1} = x_t - \eta\,M_t\,g_t9 hold (Xie et al., 13 Mar 2025, Kovalev, 30 Jun 2025).
  • Non-convex objectives: Under affine noise variance and mild smoothness, AdaGrad achieves η>0\eta>00 rates for η>0\eta>01; in the over-parameterized regime, η>0\eta>02, matching SGD (Wang et al., 2023). Even under relaxed “local” or η>0\eta>03-smoothness, AdaGrad converges for sufficiently small stepsizes where vanilla SGD can diverge.

For heavy-tailed stochastic noise, unclipped AdaGrad-Norm can exhibit catastrophic high-probability convergence failure: iteration complexity scales inverse-algebraically with confidence η>0\eta>04 rather than polylogarithmically. Gradient clipping restores optimal order, yielding high-probability excess risk bounds with only logarithmic dependence on confidence (Chezhegov et al., 2024).

4. Preconditioning, Regularization, and Algorithmic Integration

AdaGrad preconditioning interacts nontrivially with regularization and advanced optimizer infrastructure:

  • Weight decay: There is a distinction between "coupled" (inside preconditioner) and "decoupled" (outside) schemes. Decoupled (applying weight decay in preconditioned η>0\eta>05-coordinates) is often optimal and underlies AdamW/AdaGradW (Ye, 2024).
  • Gradient norm regularization: Proper preconditioned forms require the regularization term to be expressed in the coordinates defined by η>0\eta>06 for optimal conditioning.
  • Polyak step-size integration: Diagonal AdaGrad preconditioners can be naturally combined with Polyak-style step-size selection, yielding robust, scale-invariant updates that require minimal global learning rate tuning (Abdukhakimov et al., 2023).
  • Momentum and exponential moving average: AdaGrad-type diagonal preconditioning can be unified with Nesterov acceleration, yielding provably faster rates (accelerated η>0\eta>07 for smooth convex objectives), and this interplay helps explain the practical efficiency of Adam (Kovalev, 30 Jun 2025).

5. Practical Complexity, Implementation, and Empirical Results

The computational trade-offs for AdaGrad preconditioning depend on the degree of structure enforced:

  • Diagonal and AdaGrad-Norm: η>0\eta>08 per step, nearly free memory-wise, and highly effective for sparse gradients or high-dimensional applications. Retain their popularity in deep learning workflows (Xie et al., 13 Mar 2025, Kovalev, 30 Jun 2025, Abdukhakimov et al., 2023).
  • Full-matrix AdaGrad: Dominates poorly scaled loss surfaces but scales as η>0\eta>09--MtM_t0, limiting use to modest dimensions.
  • Structured (Kronecker/low-rank) forms: Cost matches or slightly exceeds diagonal but achieves near full-matrix performance for low-enough rank or when the parameter naturally factorizes (e.g., for matrix-valued parameters) (Matveeva et al., 28 Aug 2025, Mei et al., 2023).

Empirical studies on GLM benchmarks, synthetic correlated features, and diverse deep-learning tasks demonstrate that low-rank or Kronecker factorizations (AdaGram, KrADagrad*) can match or outperform diagonal AdaGrad and Shampoo in both convergence speed and final accuracy, often at a small computational overhead relative to diagonal forms (Matveeva et al., 28 Aug 2025, Mei et al., 2023).

For badly scaled or ill-conditioned problems, preconditioned Polyak step-size rules using AdaGrad matrices outperform classical AdaGrad and Adam, eliminating the need for hand-tuning global stepsizes (Abdukhakimov et al., 2023). Under heavy-tailed noise, clipped AdaGrad-Norm yields robust and optimal high-probability convergence (Chezhegov et al., 2024).

6. Limitations, Trade-offs, and Theoretical Caveats

The adaptive nature of AdaGrad preconditioning introduces subtleties in convergence and optimization dynamics:

  • Overly aggressive or high-dimensional preconditioning (unstructured full-matrix) can increase the domain-diameter term in regret bounds, negating potential gains from improved gradient scaling (Xie et al., 13 Mar 2025).
  • Structured approximations (diagonal, one-sided Kronecker) can, in some regimes, outperform less structured (full-matrix) schemes, given favorable diameter--gradient trade-offs (Xie et al., 13 Mar 2025).
  • In non-convex or locally non-smooth regimes, convergence guarantees may require additional tuning or bounds on stepsize depending on realized local curvature and noise structure (Wang et al., 2023).
  • Standard AdaGrad can fail, in a high-probability sense, under heavy-tailed noise if unclipped; simple clipping of updates and accumulators eliminates this pathology (Chezhegov et al., 2024).

7. Summary of Impact and Current Research Directions

AdaGrad preconditioning, through dynamic adaptation of step sizes and leveraging of accumulated curvature information, underpins modern adaptive optimizers in large-scale machine learning. Its diagonal, full-matrix, and structured low-rank or factored variants enable deployment across a broad spectrum of optimization regimes, balancing adaptation, computational efficiency, and scalability.

Recent advances establish refined regret and generalization bounds under both convex and non-convex settings, reveal the role of preconditioning in regularized and accelerated optimization, and emphasize the importance of algorithmic modifications—such as clipping or structure-aware factorization—to ensure robust and efficient convergence in practical contexts (Matveeva et al., 28 Aug 2025, Xie et al., 13 Mar 2025, Kovalev, 30 Jun 2025, Ye, 2024, Chezhegov et al., 2024, Mei et al., 2023, Abdukhakimov et al., 2023).

Ongoing research focuses on sharpening the theory of preconditioning under realistic noise and curvature models, further optimizing computational trade-offs via new structured approximators, and developing variants specifically tailored for large-scale, ill-conditioned, or heavy-tailed data regimes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaGrad Preconditioning.