Nonlinearly Preconditioned Gradient Methods

Updated 23 September 2025

Nonlinearly preconditioned gradient methods are iterative optimization techniques that use adaptive nonlinear operators to transform gradients and enhance convergence.
These methods generalize classical linear preconditioning by incorporating domain-specific iterative solvers, significantly boosting performance on ill-conditioned and complex problems.
Techniques like ALS preconditioning and nonlinear GMRES demonstrate robust convergence properties and modular integration across applications such as tensor decomposition, PDEs, and neural network training.

Nonlinearly preconditioned gradient methods are iterative algorithms that incorporate nonlinear transformations or iterative updates—rather than fixed, linear operators—as preconditioners to enhance the convergence properties of gradient-based optimization. This approach generalizes classical linear preconditioning and enables substantially improved performance on highly nonlinear, ill-conditioned, or domain-structured problems. The nonlinear preconditioning paradigm underpins a range of modern methods, including nonlinear GMRES, nonlinearly preconditioned (P)NCG and quasi-Newton methods, adaptive and stochastic preconditioned gradient algorithms, and domain-specific strategies in fields such as tensor decomposition, PDE-constrained optimization, and large-scale neural network training. The following sections synthesize foundational theory, algorithmic instantiations, convergence properties, relation to classical methods, and practical applications of nonlinearly preconditioned gradient techniques, as supported by contemporary research literature.

1. Fundamentals of Nonlinear Preconditioning

Nonlinear preconditioning replaces the classical linear mapping (e.g., multiplying by an SPD matrix or its inverse) with a problem-dependent nonlinear operator or iterative solver that produces a new search direction or an improved iterate. If $M(\cdot)$ denotes a nonlinear map (often itself an iterative process), the preconditioned gradient direction at $x_k$ can be formalized as

$P(g; x_k) = x_k - M(g; x_k),$

where $g$ denotes the gradient or a generalized subdifferential. Rather than simply "scaling" the gradient, nonlinear preconditioners can adaptively transform steps based on local curvature, domain structure, or fixed-point properties.

Notable instantiations include:

ALS as Preconditioner: Used for canonical tensor approximations, where ALS is naturally slow due to collinearity, but used as a preconditioner (e.g., in nonlinear conjugate gradient or L-BFGS frameworks) it dramatically accelerates convergence (Sterck et al., 2014, Sterck et al., 2018).
Domain Decomposition: Nonlinear Restricted Additive Schwarz (RAS) operators solve nonlinear subproblems in overlapping subdomains and then aggregate updates, accelerating and stabilizing convergence in PDEs and large-scale systems (Kothari, 2022).
Gradient Clipping and Isotropic Preconditioning: Nonlinear preconditioning through reference functions $\phi$ and their convex duals arises in methods based on generalized smoothness (e.g., $(L_0, L_1)$ -smooth functions), including algorithms with anisotropic or componentwise scaling (Oikonomidis et al., 12 Feb 2025, Bodard et al., 19 Sep 2025).

Nonlinear preconditioning can thus be interpreted as embedding robust, structure-respecting transformation/acceleration within iterative optimization.

2. Algorithmic Frameworks and Methodologies

A wide spectrum of algorithms employ nonlinear preconditioning, which can be deployed in simple gradient methods as well as higher-order and stochastic frameworks.

Method Class	Nonlinear Preconditioner Example	Application / Effect
Nonlinear GMRES (N-GMRES)	Steepest Descent, ALS, domain solvers	Krylov subspace acceleration (Sterck, 2011)
Nonlinear Conjugate Gradient	ALS, custom domain routines	Canonical tensor decomposition (Sterck et al., 2014)
Nonlinear Quasi-Newton/ L-BFGS	ALS, Schwarz-type RAS	Tensor methods, PDEs (Sterck et al., 2018, Kothari, 2022)
Stochastic Algorithms	Diagonal, sketchy Hessian preconditioning	Robustness for ill-conditioned ML (Sadiev et al., 2022, Frangella et al., 2023)
Constraint / Geometry Driven	Riemannian/ natural gradients, Hessian-Riemannian	Geometry-adapted steepest descent (Liu et al., 2023)

Each methodology includes modifications to update directions, secant pairs, or trust region subproblems to account for the nonlinear preconditioning step, with corresponding global or local convergence analysis.

3. Convergence Theory and Generalized Smoothness

Convergence of nonlinearly preconditioned gradient methods is established via a variety of analytical frameworks:

Majorant Conditions and Local Analysis: For Newton-like and inexact Newton methods under majorant conditions, convergence (including superlinear/quadratic rates) extends to analytic and Hölder-type functions (Goncalves et al., 2016, Goncalves et al., 2017).
Generalized Smoothness Framework: $(L_0, L_1)$ -smoothness and anisotropic smoothness relative to a kernel function $\phi$ enable convergence even when the Hessian norm is not globally bounded. Sufficient conditions such as

$\|\nabla^2 f(x)\| \leq L_0 + L_1 \|\nabla f(x)\|$

and the anisotropic second-order bound involving $\lambda_{\max}(\nabla^2\phi^*(L^{-1}\nabla f(x)) \nabla^2 f(x))$ permit guarantees for robust methods like gradient clipping and guarantee saddle-escaping dynamics (Oikonomidis et al., 12 Feb 2025, Bodard et al., 19 Sep 2025).

Lyapunov and Energy Analysis: In settings with inexact projections and variable metrics, energy-based Lyapunov functions support global exponential convergence of projected preconditioned flows, both at continuous and discrete levels (Guo et al., 4 Jun 2025, Liu et al., 2023, Park et al., 2020).
Stochastic and Sketchy Analysis: For stochastic and parallelized algorithms, regularity conditions such as quadratic regularity and Hessian dissimilarity serve as effective substitutes to classical condition number analyses, leading to robust linear convergence even with infrequent preconditioner updates (Frangella et al., 2023).

4. Relation to Classical Preconditioning and General GMRES

Nonlinear preconditioning generalizes classical linear (matrix-based) preconditioning by allowing the transformation to depend nonlinearly on the current iterate. In the linear case and for quadratic objectives, the structure reduces to standard preconditioned GMRES or preconditioned CG (PCG):

For quadratic functions $f(x) = \frac{1}{2}(x-x^*)^T A(x-x^*)$ , steepest descent as the preconditioner in N-GMRES advances along directions in $K_k(A, r_0)$ , the standard Krylov subspace (Sterck, 2011).
In nonlinear settings, Krylov-like acceleration via nonlinear GMRES generalizes to subspaces spanned by nonlinear iterates.

Similarly, left- and right-nonlinear preconditioning in QN or BFGS frameworks require construction of associated secant pairs for the preconditioned residuals, which necessitates explicit modification of memory and update formulas (Kothari, 2022, Sterck et al., 2018).

5. Practical Applications and Empirical Observations

Nonlinearly preconditioned gradient methods have demonstrated substantial empirical impact across diverse application domains:

Tensor Decomposition: Nonlinear preconditioning (using ALS or HOOI as preconditioners) within NCG, GMRES, or L-BFGS drastically accelerates convergence, especially in problems with high collinearity or ill-conditioning. NP-L-BFGS and NP-CG outperform both ALS and their unpreconditioned counterparts (Sterck et al., 2014, Sterck et al., 2018).
PDEs and Large-Scale Inverse Problems: Nonlinear Schwarz/RAS preconditioners and inexact projected preconditioners lower the computational cost for enforcing constraints in PDE-discretizations (e.g., for divergence constraints or conservation laws), with near mesh-independent convergence and reduced sensitivity to ill-conditioning (Kothari, 2022, Guo et al., 4 Jun 2025).
Machine Learning and Deep Learning: Adaptive, data-driven, or sketchy curvature-based nonlinear preconditioners (in stochastic/variance-reduced settings or parallel trust-region methods) yield robustness to feature scaling and allow for hyperparameter-insensitive training of large neural networks (Sadiev et al., 2022, Frangella et al., 2023, Alegría et al., 7 Feb 2025).
Optimization with Generalized Smoothness: Methods based on anisotropic preconditioners (e.g., induced by gradient clipping or kernel-based duality) maintain efficiency and guarantee saddle point avoidance even when classical smoothness fails—relevant in phase retrieval, matrix factorization, and deep learning landscapes (Bodard et al., 19 Sep 2025, Oikonomidis et al., 12 Feb 2025).

Numerical studies report that the best nonlinearly preconditioned methods achieve iteration and time-to-solution reductions by factors of 2–4 or more compared to both stand-alone preconditioners and classical competitors.

6. Flexibility, Modularity, and Future Directions

The principal strength of nonlinear preconditioning is its modularity: any problem-adapted iterative solver—exact or approximate, domain or data-driven—can be exploited as a drop-in preconditioner for a higher-order method or acceleration wrapper. This universality enables:

Rapid incorporation of domain-specific knowledge (e.g., alternating minimization, domain decomposition, or splitting routines).
Robustness to ill-conditioning and poor scaling, as verified empirically in ill-posed or data-corrupted problems.
Escaping strict saddle points and attaining higher-order stationarity under nonclassical smoothness.
Extension to stochastic and distributed environments, where preconditioners can be “sketchy,” data-parallel, or lazily updated yet still ensure fast global convergence.

Strong numerical and theoretical evidence points to further exploitation of nonlinear preconditioners in large-scale, structured, and nonconvex optimization as a route to overcoming the limitations of conventional (linear) acceleration and scaling strategies.

7. Representative Methods, Formulas, and Theoretical Guarantees

Key representative algorithms, update formulas, and convergence guarantees include:

N-GMRES update for unconstrained minimization:

$\bar{x}_{i+1} = x_i - \beta_\mathsf{sdls} \cdot \frac{\nabla f(x_i)}{\|\nabla f(x_i)\|},$

with $\beta_\mathsf{sdls}$ by Wolfe-satisfying line search or with $\beta_\mathsf{sd} = \min(\delta, \|\nabla f(x_i)\|)$ for fixed small step (Sterck, 2011).

PNCG direction (ALS as preconditioner):

$\overline{g}_k = x_k - P(x_k), \quad p_{k+1} = -\overline{g}_{k+1} + \widetilde{\beta}_{k+1} p_k$

with tailored update formulas for $\widetilde{\beta}$ and $\widehat{\beta}$ (Sterck et al., 2014).

Global convergence for preconditioned subgradient:

$x_{k+1} = x_k - \alpha_k\, (\nabla c(x_k))^\dagger v_k,\quad v_k \in \partial h(c(x_k)),\quad \alpha_k \text{ per Polyak} [2212.13278].$

Energy-adaptive update (AEPG):

$v_k = T_k \nabla l(\theta_k), \quad r_{k+1} = r_k/(1+2\eta\|v_k\|^2), \quad \theta_{k+1} = \theta_k - 2\eta r_{k+1} v_k$

with unconditional energy dissipation (Liu et al., 2023).

Polynomial preconditioner in first-order methods:

$x^{k+1} = x^{k} - \alpha\, p(A)\, \nabla f(x^{k}),$

with $p(\cdot)$ a symmetric polynomial reducing the spectrum spread of $A$ (Doikov et al., 2023).

Convergence rates, for example, can be exponential in energy/functional gap under Lyapunov analysis or sublinear in the absence of strong convexity, with saddle-escape guarantees under suitable anisotropic smoothness conditions.

The theory and practice of nonlinearly preconditioned gradient methods thus encompass a spectrum of sophisticated strategies for accelerating and stabilizing optimization in complex, large-scale, and ill-conditioned regimes. Rigorous recent developments in this area provide both practical algorithms and broad generalizations to classical convergence theory, with rapidly increasing impact in numerical linear algebra, machine learning, PDEs, and scientific computing.