Preconditioned Gradient Descent (PGD)

Updated 13 January 2026

Preconditioned Gradient Descent (PGD) is an optimization method that uses a symmetric positive-definite matrix to rescale gradient updates for faster convergence and improved conditioning.
PGD leverages online, first-order preconditioning and curvature estimators to adapt to anisotropy and complex parameter correlations in high-dimensional landscapes.
PGD is backed by strong theoretical guarantees and empirical success across deep learning, RNNs, distributed optimization, and inverse problem settings.

Preconditioned Gradient Descent (PGD) is a generalization of standard gradient descent methodologies in which the raw gradient direction is modified by a symmetric positive-definite matrix, termed the preconditioner, to accelerate convergence and ameliorate the effects of anisotropy, ill-conditioning, or nontrivial parameter correlations in the objective landscape. By acting to “whiten” or rescale the local geometry of the problem, PGD plays a foundational role in scalable optimization for high-dimensional estimation, machine learning, PDE-constrained optimization, and nonconvex matrix recovery.

1. Mathematical Formalism and Algorithmic Structure

Let $f: \mathbb{R}^d \to \mathbb{R}$ be a differentiable objective, with update at step $t$ given by

$x_{t+1} = x_t - \eta_t P_t \nabla f(x_t)$

where $\eta_t > 0$ is a step size and $P_t$ is a symmetric positive (semi-)definite preconditioning matrix. The preconditioner serves two core purposes:

Directional scaling: eigenvalues of $P_t$ adjust per-coordinate step sizes.
Rotation: non-diagonal $P_t$ realigns updates to address off-diagonal parameter correlations.

Classical choices for $P_t$ include the inverse (approximate) Hessian (Newton’s method), Fisher Information Matrix (natural gradient), or diagonal/quasi-Newton approximations. In high dimensions, direct Hessian inversion is too expensive, motivating first-order, block-wise, or online learning of $P_t$ (Moskovitz et al., 2019).

PGD generalizes to composite objectives, constrained spaces, projected algorithms, infinite-dimensional Hilbert spaces, and even optimization on Wasserstein space via suitable generalizations of the preconditioning norm and proximal/Bregman operators (Park et al., 2020, Bonet et al., 2024, Guo et al., 4 Jun 2025, Park et al., 22 Dec 2025).

2. Construction and Online Learning of the Preconditioner

The ideal preconditioner adapts to local curvature, ideally mimicking the Hessian $H = \nabla^2 f(x)$ . Since $H^{-1}$ is often intractable, modern PGD variants employ:

First-order preconditioning (e.g., FOP): Parameterize $P_t = M_t M_t^\top$ for arbitrary $M_t$ , and learn $M_t$ via an outer-loop hypergradient (meta-gradient) method using only first-order derivatives. The chain rule yields an efficient, online-outer-loop update for $M_t$ :

$M^{(t)} = M^{(t-1)} - \rho H^{(t-1)}$

with the meta-gradient

$H^{(t-1)} = -\epsilon \left[g^{(t)} (g^{(t-1)})^T + g^{(t-1)} (g^{(t)})^T\right] M^{(t-1)}$

where $g^{(t)} = \nabla_{\theta^{(t)}} J^{(t)}$ (Moskovitz et al., 2019).

Stochastic estimators: Estimators for the Fisher matrix (in latent variable models) or “two-point” finite-difference curvature matching (for neural nets) enable online, low-complexity updates to $P_t$ . Notable is the use in Preconditioned Stochastic Gradient Descent (PSGD), which adapts to noisy gradients in RNNs and other deep architectures (Li, 2016, Baey et al., 2023).
Block or Lie-group factorization: For scalability, $P_t$ may be structured (block-diagonal, Kronecker, low-rank, or constrained to Lie groups). Curvature-informed PSGD with Lie-group preconditioners exploits Hessian-vector products and group geometry to fit $P$ robustly to stochastic curvature samples (Pooladzandi et al., 2024).
Problem-structured preconditioning: In matrix factorization, $P_t$ often takes the form $(X_t^\top X_t + \lambda_t I)^{-1}$ , with an adaptive regularizer $\lambda_t$ determined by the scale of deviation from the rank-deficient minimizer or by the observed error (Zhang et al., 13 Apr 2025, Zhang et al., 2023, Zhang et al., 2022). In distributed settings, preconditioners are iteratively learned via decentralized updates (Chakrabarti et al., 2020).

3. Theoretical Guarantees: Convergence Rates and Conditioning

PGD is analyzed under generalized smoothness and convexity assumptions, most commonly strong convexity and local Lipschitz or Polyak–Łojasiewicz (PL) conditions:

For fixed symmetric positive-definite $P$ with eigenvalues in $[ \lambda_{\min}, \lambda_{\max} ]$ and an $L$ -smooth, PL objective, PGD with appropriate step-size

$\rho = \frac{2\lambda_{\min} - \lambda_{\max}^2}{L}$

achieves global linear convergence:

$f(x_k) - f^* \le (1 - \mu \rho)^k (f(x_0) - f^*)$

(Moskovitz et al., 2019).

When the preconditioner is learned online (as in FOP or adaptive schemes), and its spectral norm stabilizes, similar guarantees follow under control of the step size and decay schedules (Moskovitz et al., 2019, Li, 2016).
For nonconvex matrix factorization, right-preconditioned updates restore a PL inequality on the clean (noiseless) loss, and under suitable regularization, convergence remains linear to the minimax-optimal statistical error, regardless of over-parameterization or ground-truth ill-conditioning (Zhang et al., 13 Apr 2025, Zhang et al., 2023, Zhang et al., 2022).
In infinite dimensions or PDE settings, preconditioning is achieved by switching to an induced norm on a (possibly non-Euclidean) inner product space. The construction of invariant sets, Lyapunov functions, and step-size restrictions extends geometric convergence guarantees (e.g., $O((1-\rho)^k)$ ) to strong convexity/smoothness in the preconditioned metric (Park et al., 2020, Guo et al., 4 Jun 2025, Park et al., 22 Dec 2025).
In the Rayleigh–Ritz (preconditioned steepest descent, PSD) approach, the optimal line search and preconditioner scaling are resolved in a $2\times2$ projected space, yielding the sharp, non-asymptotic contraction factor

$\sigma_{\rm PSD} = \frac{\kappa + \gamma(2-\kappa)}{(2-\kappa) + \gamma\kappa} < 1$

strictly outperforming fixed-step preconditioned methods (Neymeyr, 2011).

4. Empirical Evidence and Applications Across Domains

PGD is widely employed and evaluated in large-scale machine learning, inverse problems, distributed systems, and scientific computing. Representative use cases include:

Deep learning: FOP accelerates training and improves test accuracy for CNNs and ResNets on CIFAR-10 and ImageNet with minimal computational overhead (typically <2% extra time) over SGD with momentum. It also widens the range of hyperparameters yielding high accuracy. In RL (PPO, BipedalWalker), PGD hybridized with momentum halves the number of updates needed and outperforms Adam (Moskovitz et al., 2019).
RNNs: Preconditioned SGD (PSGD) addresses exploding and vanishing gradients, succeeding on long-memory synthetic tasks and MNIST with minimal tuning, outperforming both classical SGD and Hessian-free approaches in challenging regimes (Li, 2016).
Distributed optimization: Iterative PGD (IPGD) provably beats distributed gradient descent by learning a dynamic preconditioner, reducing iteration count by factors >2 and achieving mesh/agent independence in iteration complexity (Chakrabarti et al., 2020).
Nonconvex low-rank recovery: In noisy symmetric matrix sensing, right-preconditioned updates with decaying regularization ( $\eta_t$ ) yield condition-number-agnostic linear convergence to the minimax error floor, supporting applications such as high-resolution medical image denoising (Zhang et al., 13 Apr 2025, Zhang et al., 2023).
Optimization in Wasserstein space: PGD with Bregman divergences as regularizers achieves faster rates and better statistical fit for high-dimensional measure alignment than Euclidean-geometric methods (Bonet et al., 2024).
Transformer in-context learning: Trained transformers implement PGD directly; global minima of the in-context risk exactly simulate one or more iterations of preconditioned gradient descent, and the learned preconditioner optimally adapts to the data covariance and sampling variance structure (Ahn et al., 2023).

5. Spectral Bias, Generalization, and Advanced Regimes

PGD plays a central corrective role in regimes characterized by spectral bias or “lazy” neural tangent kernel (NTK)-like dynamics. For neural network training:

PGD with Gauss–Newton or Levenberg–Marquardt preconditioning eliminates the eigenvalue-dependent spread in mode convergence rates, thus uniformly accelerating both low- and high-frequency mode learning. This can mitigate or compress grokking delays (the phase where networks generalize only after training loss is near-zero) (Jiang et al., 6 Jan 2026).
In overparameterized two-layer networks, preconditioned learning (with early stopping) yields generalization rates

$\mathcal{O}(n^{-4\alpha/(4\alpha+1)})$

versus the standard NTK rate $\mathcal{O}(n^{-2\alpha/(2\alpha+1)})$ , reflecting a transition to a lower-complexity kernel regime. This arises because the preconditioner approximates the squared integral operator of the NTK, doubling the spectral decay exponent and thereby improving statistical efficiency (Yang, 2024).

6. Extensions: Constrained and Projected Settings, Infinite-Dimensional Variants

PGD has robust extensions beyond unconstrained, Euclidean spaces:

Projected/constrained PGD: Inexact projected PGD with variable metrics (IPPGD) handles PDE-constrained or conservation-law-constrained optimization, using relaxed projections and inner-product-induced metrics. A Lyapunov function formalism secures linear rates even under inexact projection solvers and dynamic preconditioner/metric updates (Guo et al., 4 Jun 2025).
Composite objectives and perturbed oracles: When working with composite functionals with only approximate oracles for a component, perturbed PGD (PPGD) admits explicit error floors in convergence theorems, with rigorous guarantees provided perturbations vanish on bounded sets (Park et al., 22 Dec 2025).
Wasserstein space: Lifting PGD to Wasserstein space endows it with Bregman-type divergences as preconditioners, which permit geometry-aware, condition-number-reducing algorithms with sublinear or linear rates under adapted relative smoothness/convexity assumptions (Bonet et al., 2024).

7. Practical Considerations and Limitations

Preconditioner update frequency and structure must be controlled to balance computational overhead and conditioning; diagonal, block, or Lie-group constraints are common for scalability (Pooladzandi et al., 2024).
Adaptive or learned regularization parameters (as in low-rank matrix sensing and factorization) are critical for tracking proximity to singularities or noise floors.
Damping (e.g., Levenberg–Marquardt) and hybrid schedules (combined GN/PGD and first-order methods) can be required to recover optimal generalization, particularly as preconditioning may trap optimization in the NTK subspace (Jiang et al., 6 Jan 2026).
Open questions remain on extending the robust theoretical analysis beyond PL- or convex-type regimes, scaling to extreme model sizes (block-wise or distributed PGD), and principled selection of preconditioners under stochastic and adversarial settings.

A comprehensive synthesis of PGD and its variants, including theoretical, algorithmic, and empirical dimensions, demonstrates its centrality in modern numerical optimization and machine learning. Among the key advances are online and structure-aware preconditioner learning, robust convergence guarantees under minimal assumptions, successful deployment across high-dimensional and ill-conditioned landscapes, and deep connections to generalization theory and spectral geometry (Moskovitz et al., 2019, Neymeyr, 2011, Li, 2016, Park et al., 2020, Baey et al., 2023, Pooladzandi et al., 2024, Jiang et al., 6 Jan 2026, Yang, 2024, Zhang et al., 13 Apr 2025, Zhang et al., 2023, Zhang et al., 2022, Liu et al., 2023, Park et al., 22 Dec 2025, Guo et al., 4 Jun 2025, Bonet et al., 2024, Ahn et al., 2023, Chakrabarti et al., 2020).