Papers
Topics
Authors
Recent
2000 character limit reached

Generalized Gradient Descent Recursion

Updated 5 December 2025
  • Generalized Gradient Descent Recursion is an iterative numerical optimization framework that replaces the standard update rule with adaptive, geometry- and regularization-sensitive mappings.
  • It leverages generalized smoothness conditions and q-functions to design context-sensitive step-size rules, effectively recovering classical methods like mirror descent and Newton-type updates.
  • The approach offers improved convergence guarantees in both convex and nonconvex settings and extends applicability to complex scenarios including manifold and Bregman-regularized optimization.

A generalised gradient descent recursion refers to any systematic iterative scheme for numerical optimization based on generalized notions of smoothness, geometry, or underlying algebraic structure, wherein the classic update xk+1=xkγkf(xk)x_{k+1} = x_k - \gamma_k \nabla f(x_k) is replaced by a more intricate mapping that adapts to problem-specific curvature, metric, or regularization. The general framework seeks to unify, extend, and improve upon classical gradient descent by replacing the step-size or direction with context-sensitive, theoretically motivated recursions that may recover mirror descent, natural gradient, Newton-type, or other non-Euclidean flows as special cases. The formalism is applicable to a broad range of settings, including nonconvex, nonsmooth, manifold-valued, composite, and Bregman-regularized optimization.

1. Generalized Smoothness and ℓ-Gradient Descent

Classical gradient descent assumes LL-smoothness, i.e., 2f(x)L\|\nabla^2 f(x)\|\le L for some L>0L>0, leading to the canonical step γk=1/L\gamma_k=1/L (Tyurin, 16 Dec 2024). The generalized theory introduces an ℓ-smoothness condition: 2f(x)(f(x))\|\nabla^2 f(x)\| \leq \ell(\|\nabla f(x)\|) with :R0(0,)\ell:\mathbb{R}_{\ge0}\to(0,\infty) nondecreasing, positive, and locally Lipschitz. Choices include

  • (s)=L\ell(s)=L: classical case
  • (s)=L0+L1s\ell(s)=L_0+L_1 s: (L₀,L₁)-smoothness
  • (s)=L0+L1sp\ell(s)=L_0+L_1 s^p for p0p\ge0: polynomial growth

This assumption allows for adaptive, data-driven adjustment of the step size depending on the local gradient norm.

2. One-Dimensional q-Function and Nonquadratic Taylor Bounds

For adaptive step size rules, the key technical device is the "q-function," defined as

q(s;a):=0sdt(a+t)q(s;a) := \int_0^s \frac{dt}{\ell(a+t)}

with a0a\ge0 and s[0,qmax(a))s\in[0,q_{\max}(a)) where qmax(a):=0dt(a+t)q_{\max}(a) := \int_0^\infty \frac{dt}{\ell(a+t)}. Its inverse q1(;a)q^{-1}(\cdot;a) is strictly increasing and C1C^1.

Central consequences:

  • Generalized Lipschitz bound on the gradient difference: f(y)f(x)q1(yx;f(x))\|\nabla f(y) - \nabla f(x)\| \leq q^{-1}(\|y-x\|;\|\nabla f(x)\|) for yxqmax(f(x))\|y-x\|\le q_{\max}(\|\nabla f(x)\|).
  • Generalized upper bound for function values: f(y)f(x)+f(x),yx+yx01q1(yx;f(x))dtf(y) \leq f(x) + \langle \nabla f(x), y-x\rangle + \|y-x\|\int_0^1 q^{-1}(\|y-x\|;\|\nabla f(x)\|)dt

These bounds reduce to the standard quadratic model when \ell is constant (Tyurin, 16 Dec 2024).

3. Derivation of Generalized Gradient Descent Recursion

At iteration kk, the optimal update in the direction h=f(xk)/f(xk)h^*=-\nabla f(x_k)/\|\nabla f(x_k)\| with step length t=q(f(xk);f(xk))t^* = q(\|\nabla f(x_k)\|; \|\nabla f(x_k)\|) yields the general update: xk+1=xkγkf(xk)x_{k+1} = x_k - \gamma_k \nabla f(x_k) with

γk=01dv(f(xk)+vf(xk))\gamma_k = \int_0^1 \frac{dv}{\ell(\|\nabla f(x_k)\| + v\|\nabla f(x_k)\|)}

Bounding \ell shows 1/(2f)γk1/(f)1/\ell(2\|\nabla f\|)\leq \gamma_k \leq 1/\ell(\|\nabla f\|), ensuring the method interpolates between aggressive and conservative step sizes depending on the local gradient scale (Tyurin, 16 Dec 2024).

4. Convergence Theory: Nonconvex and Convex Settings

Nonconvex scenario: If ff is bounded below, one obtains (Theorem 7.1): f(xk+1)f(xk)γkf(xk)2f(x_{k+1}) \leq f(x_k) - \gamma_k \|\nabla f(x_k)\|^2 Summed over TT steps, this yields

min0k<Tf(xk)2(2f(xk))4ΔT\min_{0\leq k < T} \frac{\|\nabla f(x_k)\|^2}{\ell(2\|\nabla f(x_k)\|)} \leq \frac{4\Delta}{T}

with Δ=f(x0)f\Delta = f(x_0)-f^*. For invertible ss/(2s)s\mapsto s/\ell(2s) this recovers the O(1/T)O(1/T) rate in squared gradient norm (Tyurin, 16 Dec 2024).

Convex scenario: Two independent proofs confirm for minimizer xx^* and R=x0xR=\|x_0-x^*\|: min0kT(2f(xk))[f(xk)f(x)]R2/(T+1)\min_{0\leq k \leq T} \ell(2\|\nabla f(x_k)\|)[f(x_k) - f(x^*)] \leq R^2 / (T+1) or via a two-phase argument (without invertibility of \ell) via monotonicity of f(xk)\|\nabla f(x_k)\|, still allowing optimal rates (Tyurin, 16 Dec 2024).

5. Special Cases and Recovery of Classical Schemes

The generalised recursion specializes to all classical first-order step schemes:

  • (s)=L\ell(s)=L: Recovers standard GD with step size $1/L$
  • (s)=L0+L1s\ell(s)=L_0+L_1 s: Clipped step size
  • (s)=L0+L1sp\ell(s)=L_0+L_1 s^p with 0p<20\le p<2: O(1/T) rates even in previously intractable "superquadratic" smoothness regimes

Thus, methodology smoothly interpolates between established methods according to the growth of the Hessian, offering new guarantees where previous approaches failed, e.g., in the case p=2p=2 (Tyurin, 16 Dec 2024).

6. Illustrative Examples

Numerically, the generalised recursion outperforms classical schemes in settings where \ell grows rapidly (Tyurin, 16 Dec 2024):

  • f(x)=logxf(x) = -\log x with (r)=L0+L1r2\ell(r) = L_0 + L_1 r^2: The \ell-GD step converges in tens of iterations where 1/(L0+L1r)1/(L_0+L_1 r) diverges.
  • f(x)=ex+e1xf(x) = e^x + e^{1-x}, (r)=3.3+r\ell(r)=3.3+r: \ell-GD converges in 20\leq20 steps versus >200>200 for classical rules.
  • For (r)=L0+L1rp\ell(r)=L_0+L_1 r^p (0p20\leq p\leq2), improved rates and extension to the otherwise pathological p>2p>2 case when gradients are bounded.

7. Context in General Optimization and Relation to Other Frameworks

The generalised recursion fits within broader optimization frameworks:

  • General cost-geometry (optimal transport, mirror descent): Surrogate minimization schemes where a generic "cost" C(x,y)C(x, y) replaces the quadratic proximity, with the next iterate chosen as the minimizer of the linearized model plus C(xk,y)C(x_k, y) (Léger et al., 2023).
  • Natural and Riemannian gradient descent: The special case where CC corresponds to geodesic distance or a Hessian-induced metric, yielding updates in the pullback metric or local manifold geometry (Dong et al., 2022).
  • Bregman Distance: The update can be viewed as a minimization of first-order model plus a Bregman divergence, generalizing the Euclidean metric and linking to mirror descent and entropic methods (Benning et al., 2016Benning et al., 2017).
  • Discrete Hamilton–Jacobi dynamics: Certain preconditioners (e.g., Laplacian smoothing) correspond to exactly GD on a more convex surrogate functional, sharing the same minima but with improved optimization geometry (Osher et al., 2018).
  • High-level algebraic frameworks: Abstract categorical approaches model gradient descent as a functor on categories of optimization problems, enabling parallel and distributed generalised recursions (Hanks et al., 28 Mar 2024).

Table: Step-Size Rules in Generalized Gradient Descent

(s)\ell(s) choice Generalized step γk\gamma_k Classical limit / method
(s)=L\ell(s)=L $1/L$ Vanilla GD
(s)=L0+L1s\ell(s)=L_0+L_1 s 01dvL0+L1[f(xk)+vf(xk)]\int_0^1 \frac{dv}{L_0+L_1 [\|\nabla f(x_k)\|+v\|\nabla f(x_k)\|]} Clipped/variable step (L₀,L₁)-smooth (Tyurin, 16 Dec 2024)
(s)=L0+L1sp\ell(s)=L_0+L_1 s^p 01dvL0+L1[f(xk)+vf(xk)]p\int_0^1 \frac{dv}{L_0+L_1 [\|\nabla f(x_k)\|+v\|\nabla f(x_k)\|]^p} O(1/T) for p<2p<2, new results for p2p\geq2

This table highlights that the generalized update mechanism provides a structured, theoretically sound means of adapting first-order optimization recursions to local problem geometry.

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Generalised Gradient Descent Recursion.