Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Gradient Descent Recursion

Updated 5 December 2025
  • Generalized Gradient Descent Recursion is an iterative numerical optimization framework that replaces the standard update rule with adaptive, geometry- and regularization-sensitive mappings.
  • It leverages generalized smoothness conditions and q-functions to design context-sensitive step-size rules, effectively recovering classical methods like mirror descent and Newton-type updates.
  • The approach offers improved convergence guarantees in both convex and nonconvex settings and extends applicability to complex scenarios including manifold and Bregman-regularized optimization.

A generalised gradient descent recursion refers to any systematic iterative scheme for numerical optimization based on generalized notions of smoothness, geometry, or underlying algebraic structure, wherein the classic update xk+1=xkγkf(xk)x_{k+1} = x_k - \gamma_k \nabla f(x_k) is replaced by a more intricate mapping that adapts to problem-specific curvature, metric, or regularization. The general framework seeks to unify, extend, and improve upon classical gradient descent by replacing the step-size or direction with context-sensitive, theoretically motivated recursions that may recover mirror descent, natural gradient, Newton-type, or other non-Euclidean flows as special cases. The formalism is applicable to a broad range of settings, including nonconvex, nonsmooth, manifold-valued, composite, and Bregman-regularized optimization.

1. Generalized Smoothness and ℓ-Gradient Descent

Classical gradient descent assumes LL-smoothness, i.e., 2f(x)L\|\nabla^2 f(x)\|\le L for some L>0L>0, leading to the canonical step γk=1/L\gamma_k=1/L (Tyurin, 2024). The generalized theory introduces an ℓ-smoothness condition: 2f(x)(f(x))\|\nabla^2 f(x)\| \leq \ell(\|\nabla f(x)\|) with :R0(0,)\ell:\mathbb{R}_{\ge0}\to(0,\infty) nondecreasing, positive, and locally Lipschitz. Choices include

  • (s)=L\ell(s)=L: classical case
  • (s)=L0+L1s\ell(s)=L_0+L_1 s: (L₀,L₁)-smoothness
  • (s)=L0+L1sp\ell(s)=L_0+L_1 s^p for p0p\ge0: polynomial growth

This assumption allows for adaptive, data-driven adjustment of the step size depending on the local gradient norm.

2. One-Dimensional q-Function and Nonquadratic Taylor Bounds

For adaptive step size rules, the key technical device is the "q-function," defined as

q(s;a):=0sdt(a+t)q(s;a) := \int_0^s \frac{dt}{\ell(a+t)}

with a0a\ge0 and s[0,qmax(a))s\in[0,q_{\max}(a)) where qmax(a):=0dt(a+t)q_{\max}(a) := \int_0^\infty \frac{dt}{\ell(a+t)}. Its inverse q1(;a)q^{-1}(\cdot;a) is strictly increasing and C1C^1.

Central consequences:

  • Generalized Lipschitz bound on the gradient difference: f(y)f(x)q1(yx;f(x))\|\nabla f(y) - \nabla f(x)\| \leq q^{-1}(\|y-x\|;\|\nabla f(x)\|) for yxqmax(f(x))\|y-x\|\le q_{\max}(\|\nabla f(x)\|).
  • Generalized upper bound for function values: f(y)f(x)+f(x),yx+yx01q1(yx;f(x))dtf(y) \leq f(x) + \langle \nabla f(x), y-x\rangle + \|y-x\|\int_0^1 q^{-1}(\|y-x\|;\|\nabla f(x)\|)dt

These bounds reduce to the standard quadratic model when \ell is constant (Tyurin, 2024).

3. Derivation of Generalized Gradient Descent Recursion

At iteration kk, the optimal update in the direction h=f(xk)/f(xk)h^*=-\nabla f(x_k)/\|\nabla f(x_k)\| with step length t=q(f(xk);f(xk))t^* = q(\|\nabla f(x_k)\|; \|\nabla f(x_k)\|) yields the general update: xk+1=xkγkf(xk)x_{k+1} = x_k - \gamma_k \nabla f(x_k) with

γk=01dv(f(xk)+vf(xk))\gamma_k = \int_0^1 \frac{dv}{\ell(\|\nabla f(x_k)\| + v\|\nabla f(x_k)\|)}

Bounding \ell shows 1/(2f)γk1/(f)1/\ell(2\|\nabla f\|)\leq \gamma_k \leq 1/\ell(\|\nabla f\|), ensuring the method interpolates between aggressive and conservative step sizes depending on the local gradient scale (Tyurin, 2024).

4. Convergence Theory: Nonconvex and Convex Settings

Nonconvex scenario: If ff is bounded below, one obtains (Theorem 7.1): f(xk+1)f(xk)γkf(xk)2f(x_{k+1}) \leq f(x_k) - \gamma_k \|\nabla f(x_k)\|^2 Summed over TT steps, this yields

min0k<Tf(xk)2(2f(xk))4ΔT\min_{0\leq k < T} \frac{\|\nabla f(x_k)\|^2}{\ell(2\|\nabla f(x_k)\|)} \leq \frac{4\Delta}{T}

with Δ=f(x0)f\Delta = f(x_0)-f^*. For invertible ss/(2s)s\mapsto s/\ell(2s) this recovers the O(1/T)O(1/T) rate in squared gradient norm (Tyurin, 2024).

Convex scenario: Two independent proofs confirm for minimizer xx^* and R=x0xR=\|x_0-x^*\|: min0kT(2f(xk))[f(xk)f(x)]R2/(T+1)\min_{0\leq k \leq T} \ell(2\|\nabla f(x_k)\|)[f(x_k) - f(x^*)] \leq R^2 / (T+1) or via a two-phase argument (without invertibility of \ell) via monotonicity of f(xk)\|\nabla f(x_k)\|, still allowing optimal rates (Tyurin, 2024).

5. Special Cases and Recovery of Classical Schemes

The generalised recursion specializes to all classical first-order step schemes:

  • (s)=L\ell(s)=L: Recovers standard GD with step size $1/L$
  • (s)=L0+L1s\ell(s)=L_0+L_1 s: Clipped step size
  • (s)=L0+L1sp\ell(s)=L_0+L_1 s^p with 0p<20\le p<2: O(1/T) rates even in previously intractable "superquadratic" smoothness regimes

Thus, methodology smoothly interpolates between established methods according to the growth of the Hessian, offering new guarantees where previous approaches failed, e.g., in the case p=2p=2 (Tyurin, 2024).

6. Illustrative Examples

Numerically, the generalised recursion outperforms classical schemes in settings where \ell grows rapidly (Tyurin, 2024):

  • f(x)=logxf(x) = -\log x with (r)=L0+L1r2\ell(r) = L_0 + L_1 r^2: The \ell-GD step converges in tens of iterations where 1/(L0+L1r)1/(L_0+L_1 r) diverges.
  • f(x)=ex+e1xf(x) = e^x + e^{1-x}, (r)=3.3+r\ell(r)=3.3+r: \ell-GD converges in 20\leq20 steps versus >200>200 for classical rules.
  • For (r)=L0+L1rp\ell(r)=L_0+L_1 r^p (0p20\leq p\leq2), improved rates and extension to the otherwise pathological p>2p>2 case when gradients are bounded.

7. Context in General Optimization and Relation to Other Frameworks

The generalised recursion fits within broader optimization frameworks:

  • General cost-geometry (optimal transport, mirror descent): Surrogate minimization schemes where a generic "cost" C(x,y)C(x, y) replaces the quadratic proximity, with the next iterate chosen as the minimizer of the linearized model plus C(xk,y)C(x_k, y) (Léger et al., 2023).
  • Natural and Riemannian gradient descent: The special case where CC corresponds to geodesic distance or a Hessian-induced metric, yielding updates in the pullback metric or local manifold geometry (Dong et al., 2022).
  • Bregman Distance: The update can be viewed as a minimization of first-order model plus a Bregman divergence, generalizing the Euclidean metric and linking to mirror descent and entropic methods (Benning et al., 2016Benning et al., 2017).
  • Discrete Hamilton–Jacobi dynamics: Certain preconditioners (e.g., Laplacian smoothing) correspond to exactly GD on a more convex surrogate functional, sharing the same minima but with improved optimization geometry (Osher et al., 2018).
  • High-level algebraic frameworks: Abstract categorical approaches model gradient descent as a functor on categories of optimization problems, enabling parallel and distributed generalised recursions (Hanks et al., 2024).

Table: Step-Size Rules in Generalized Gradient Descent

(s)\ell(s) choice Generalized step γk\gamma_k Classical limit / method
(s)=L\ell(s)=L $1/L$ Vanilla GD
(s)=L0+L1s\ell(s)=L_0+L_1 s 01dvL0+L1[f(xk)+vf(xk)]\int_0^1 \frac{dv}{L_0+L_1 [\|\nabla f(x_k)\|+v\|\nabla f(x_k)\|]} Clipped/variable step (L₀,L₁)-smooth (Tyurin, 2024)
(s)=L0+L1sp\ell(s)=L_0+L_1 s^p 01dvL0+L1[f(xk)+vf(xk)]p\int_0^1 \frac{dv}{L_0+L_1 [\|\nabla f(x_k)\|+v\|\nabla f(x_k)\|]^p} O(1/T) for p<2p<2, new results for p2p\geq2

This table highlights that the generalized update mechanism provides a structured, theoretically sound means of adapting first-order optimization recursions to local problem geometry.

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalised Gradient Descent Recursion.