Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clipped Preconditioned Updates in Optimization

Updated 16 May 2026
  • Clipped preconditioned updates are a class of optimization algorithms that use nonlinear preconditioning and clipping to regulate gradient magnitudes.
  • They are analyzed under generalized anisotropic smoothness conditions, providing robust convergence guarantees even without traditional Lipschitz gradient regularity.
  • Empirical evaluations show significant improvements in neural network training and matrix factorization, offering enhanced stability and faster convergence compared to classic methods.

Clipped preconditioned updates are a principled class of optimization algorithms that apply nonlinear preconditioning—specifically, smooth or hard vector-norm clipping—to control the magnitude of descent directions in first-order methods. Originating in the broader framework of nonlinearly preconditioned gradient methods, this approach interprets gradient clipping as a particular dual-preconditioned update with natural convergence guarantees and extended applicability, especially in problems lacking traditional Lipschitz gradient regularity. The technique has been analyzed rigorously under generalized, anisotropic smoothness conditions, showing both stability and convergence benefits in stochastic and deterministic settings, with demonstrable performance gains on neural networks and matrix factorization tasks (Oikonomidis et al., 13 Oct 2025, Oikonomidis et al., 12 Feb 2025).

1. Nonlinear Preconditioning and Clipping Mechanisms

The core of clipped preconditioned updates is the application of a dual mapping derived from a convex “reference” function ϕ:RnR\phi:\mathbb{R}^n\to\mathbb{R}, where the update takes the form:

xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),

with ϕ\phi^* the Fenchel conjugate of ϕ\phi, γ\gamma the step size, and λ>0\lambda>0 a scaling parameter. Choosing ϕ\phi so that ϕ\nabla\phi^* saturates for large norm arguments yields a smooth clipping effect. For example, with

ϕ(x)=ε(xln(1x)),x{z:z<1},\phi(x) = \varepsilon\left(-\|x\| - \ln(1-\|x\|)\right),\quad x\in\{z:\|z\|<1\},

the dual gradient becomes

ϕ(y)=εε+yy,\nabla\phi^*(y) = \frac{\varepsilon}{\varepsilon+\|y\|}y,

which smoothly caps the descent direction magnitude. Hard clipping is also recovered by other choices of xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),0, such as xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),1, leading to

xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),2

Thus, the preconditioning step generalizes and unifies classical gradient clipping within a convex analysis framework (Oikonomidis et al., 13 Oct 2025, Oikonomidis et al., 12 Feb 2025).

2. Generalized Anisotropic Smoothness

Standard analyses of first-order methods typically assume a global Lipschitz condition on the gradient,

xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),3

which can be restrictive for many practical objectives. Clipped preconditioned updates, however, exploit a weaker anisotropic (or “xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),4-convex”) smoothness condition: for convex xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),5 and xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),6,

xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),7

This formulation captures broader function classes such as those with xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),8-smoothness, and allows for growth modes in the gradient and Hessian that violate Lipschitz conditions but are still manageable through clipping (Oikonomidis et al., 13 Oct 2025, Oikonomidis et al., 12 Feb 2025).

A second-order characterization is also available: xk+1=xkγϕ(λf(xk)),x^{k+1} = x^k - \gamma\, \nabla \phi^*(\lambda\, \nabla f(x^k)),9 is ϕ\phi^*0-anisotropically smooth if and only if

ϕ\phi^*1

This ensures the (strong) monotonicity of the update map, supporting global convergence arguments.

3. Deterministic and Stochastic Convergence Guarantees

Sublinear convergence rates are established for clipped preconditioned updates under the anisotropic descent inequality. For momentum parameter ϕ\phi^*2 in the heavy-ball momentum variant, the following bound holds:

ϕ\phi^*3

where ϕ\phi^*4, ϕ\phi^*5 (Oikonomidis et al., 13 Oct 2025). With a generalized Polyak-Łojasiewicz (PL) condition,

ϕ\phi^*6

and if ϕ\phi^*7 is 2-subhomogeneous, linear convergence is achieved:

ϕ\phi^*8

for some ϕ\phi^*9.

In the stochastic setting, replacing the true gradient with a noisy oracle ϕ\phi0 leads to expected sublinear convergence under suitable “ϕ\phi1-variance” or standard unbiasedness and variance assumptions. Under strong PL-type conditions and bounded ϕ\phi2-variance, the expected suboptimality satisfies:

ϕ\phi3

where ϕ\phi4 quantifies the gradient noise (Oikonomidis et al., 13 Oct 2025).

4. Momentum Augmentation and Heavy-Ball Schemes

Momentum can be incorporated into clipped preconditioned updates through a heavy-ball-type scheme. With ϕ\phi5, the update reads: ϕ\phi6 Unrolling this recursion, the scheme becomes

ϕ\phi7

applying heavy-ball momentum directly to the preconditioned gradient mapping. Theoretical guarantees require ϕ\phi8 for sublinear convergence, with empirical results suggesting that larger values can be effective when greater smoothness is present (Oikonomidis et al., 13 Oct 2025).

5. Parameter Selection and Practical Guidelines

Parameter tuning in clipped preconditioned updates balances the trade-off between step-size, momentum, and clipping strength:

  • Step-size ϕ\phi9: For convex objectives, γ\gamma0 yields standard γ\gamma1 rates; for nonconvex cases, γ\gamma2 for any γ\gamma3.
  • Momentum γ\gamma4: Theoretical sublinear guarantees require γ\gamma5; larger γ\gamma6 up to 0.9 is possible with stronger smoothness assumptions.
  • Clipping threshold (γ\gamma7 or γ\gamma8): Adjusts the maximum allowable update magnitude. Empirical selection is recommended in γ\gamma9 by validation; the threshold should be large enough to accommodate the problem’s growth but small enough to prevent instability.

For PL-type linear rates, λ>0\lambda>00 and λ>0\lambda>01 should minimize λ>0\lambda>02 (Oikonomidis et al., 13 Oct 2025).

6. Empirical Performance and Applications

Clipped preconditioned methods have been evaluated on several machine learning tasks:

  • Neural networks: Experiments on MLPs (MNIST), ResNet-18 (CIFAR-10), and ResNet-34 (CIFAR-100) demonstrate faster attainment of low training loss compared to SGD, SGD with momentum, Adam, and other momentum-accelerated baselines.
  • Matrix factorization: On low-rank matrix factorization (Movielens 100K), heavy-ball clipped preconditioned methods greatly improve stability and converge faster by orders of magnitude versus classical methods.

Key observations include rapid loss decay, robust performance in nonsmooth regimes, and enhanced stability—features directly stemming from the controlled step magnitudes provided by the dual-preconditioned update mechanism (Oikonomidis et al., 13 Oct 2025).

7. Role of Clipping in Modern Optimization Frameworks

Clipped preconditioned updates generalize and formalize the heuristic practice of gradient clipping. Their benefits include:

  • Adaptive magnitude control: By saturating the preconditioner, the methods prevent explosive steps, especially for pathological or rapidly growing gradients.
  • Built-in majorization and monotonicity: The anisotropic descent inequality provides a surrogate that these methods minimize at each step; associated second-order bounds ensure that the update map is monotone, yielding desirable convergence properties.
  • Extension to non-Lipschitz and unbounded-hessian problems: The framework is applicable to objectives exhibiting nonstandard growth in gradient or Hessian, where traditional conditions for first-order methods are not met. A plausible implication is increased robustness in large-scale or ill-conditioned optimization problems typical of deep learning and factorization, mitigating the risks associated with unbounded steps and facilitating application to broader classes of objective functions (Oikonomidis et al., 13 Oct 2025, Oikonomidis et al., 12 Feb 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clipped Preconditioned Updates.