Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proximal Gradient Descent: Theory & Applications

Updated 30 June 2026
  • Proximal Gradient Descent is an optimization framework that decomposes problems into a smooth part handled by gradient descent and a nonsmooth part managed via a proximal operator.
  • It provides provable convergence guarantees, including O(1/k) sublinear rates for convex functions and linear convergence under strong convexity, with accelerated variants available.
  • Practical extensions such as adaptive stepsizes, plug-and-play techniques, and unrolled deep networks enhance its applicability in signal processing, imaging, and machine learning.

Proximal Gradient Descent (PGD) is a fundamental algorithmic framework for the numerical solution of optimization problems that decompose into the sum of a differentiable (often smooth) function and a possibly nonsmooth regularizer. This paradigm encompasses a wide range of problems in signal processing, statistics, machine learning, inverse problems, and more, accommodating both convex and certain classes of nonconvex objectives. PGD operates by alternating between an explicit gradient step on the smooth part and a proximal (implicit) step on the nonsmooth component, effectively generalizing classical gradient descent to the composite setting and enabling tractable treatment of structured regularization.

1. Mathematical Formulation and Algorithmic Structure

PGD targets optimization problems of the form

minxRn  F(x):=f(x)+g(x),\min_{x \in \mathbb{R}^n} \; F(x) := f(x) + g(x),

where ff is differentiable with LL-Lipschitz continuous gradient, and gg is a proper, closed, and possibly nonsmooth function. The PGD update at iteration kk is

xk+1=proxγkg(xkγkf(xk)),proxλg(z):=argminu{g(u)+12λuz2},x_{k+1} = \mathrm{prox}_{\gamma_k g} \big(x_k - \gamma_k \nabla f(x_k)\big), \qquad \mathrm{prox}_{\lambda g}(z) := \arg\min_u \left\{ g(u) + \frac{1}{2\lambda} \|u - z\|^2 \right\},

where γk>0\gamma_k > 0 is a stepsize chosen (often 0<γk1/L0<\gamma_k\le 1/L) to ensure descent and stability (Pong, 2013, Nikolovski et al., 2024). The proximal operator enables handling constraints and regularizers such as 1\ell_1, group sparsity, total variation, indicator functions of convex sets, or separable nonconvex penalties.

The method generalizes straightforwardly to the composite case f(x)+P(Ax+b)f(x) + P(Ax+b) where ff0 admits a simple prox, but ff1 does not; specialized dual or primal-dual methods such as Proximal-Proximal Gradient (PPG) or other splitting methods may be requisite (Pong, 2013).

2. Theoretical Guarantees and Convergence Rates

For convex ff2 and ff3, and constant stepsize ff4, classical PGD ensures

ff5

implying an ff6 ergodic rate in objective (Nikolovski et al., 2024, Pong, 2013, Salim et al., 2020). Under additional ff7-strong convexity, PGD achieves linear convergence: ff8 Accelerated variants (FISTA) provably attain ff9. Recent work further advances the rates achievable by variable stepsize methods; the “silver stepsize schedule” yields an improved rate LL0, where LL1, outperforming classical constant-step PGD but not reaching the LL2 optimum attained with Nesterov acceleration (Bok et al., 2024).

For weakly convex or nonconvex LL3, convergence is to stationary points, with explicit LL4 guarantees for suitable parameter regimes (Hurault et al., 2023, Rotaru et al., 6 Mar 2025). The tightest one-step decrease characterizations are now available via DCA-based analysis, which refines constants and parameter selection beyond the classical analysis (Rotaru et al., 6 Mar 2025).

PGD has also been analyzed in infinite-dimensional Wasserstein spaces for measure optimization, where the analogous forward–backward scheme retains LL5 suboptimality and linear metric convergence under strong convexity (Salim et al., 2020).

3. Proximal Mapping, Implementation, and Extensions

For many regularizers, LL6 is computable in closed form or admits fast routines. For LL7-norm regularization, the proximal is the soft-thresholding mapping

LL8

For general indicator functions or constraints, the proximal recovers projection onto the feasible set.

Modern implementations often adapt the basic PGD scheme:

  • Variable and adaptive stepsizes: Per-iteration local curvature estimates can yield step selections superior to global LL9, often halving iteration counts and wall times (Malitsky et al., 2023, Nikolovski et al., 2024).
  • Plug-and-play and learned prox: Replacing the explicit proximal with a deep or learned denoiser (matching the form gg0) yields effective “PnP-PGD” methods, with provable sublinear convergence (even under prior mismatch and for nonconvex implicit regularizers) under weak contractivity assumptions (Hurault et al., 2023, Xu et al., 14 Jan 2026).
  • Inexact prox and plug-and-play: When the prox is intractable or absent in closed form, inexact schemes such as Cadzow plug-and-play gradient descent (CPGD) employ alternating projections or approximate denoisers in lieu of the proximal operator, while retaining convergence to local minimizers under mild additional structure (e.g., locally nonexpansive alternating Toeplitz-SVD projections) (Simeoni et al., 2020).
  • Bregman and mirror-prox frameworks: For problems with non-Euclidean geometry or relative smoothness, the prox term is replaced with a Bregman divergence, with correspondingly generalized convergence guarantees (Elshiaty et al., 4 Jun 2025).
  • Primal-dual schemes and adversarial optimization: PGD underlies primal-dual strategies for constrained min-max or variational problems (e.g., adversarial robustness), with extensions to nonsmooth norm penalties and dual stepsizes (Matyasko et al., 2021).

4. Advanced Variants: Accelerated, Adaptive, and Bregman Extensions

The landscape of PGD variants encompasses several axes of enhancement:

  • Adaptive Proximal Gradient: Secant-type curvature tracking (local gg1 estimates) allows stepsizes exceeding the “safe” gg2, without added computation, yielding provable gg3 convergence under local Lipschitz continuity (Malitsky et al., 2023, Nikolovski et al., 2024).
  • Silver Stepsize and Non-monotonic Schedules: Non-monotonic, fractal-like stepsize schedules outperform constant steps in constrained and composite problems, with provably optimal rates among all momentum-free schemes (Bok et al., 2024).
  • Bregman Proximal Gradient and Multilevel BPGD: In high-dimensional or constrained settings, leveraging Bregman divergence steps and multilevel hierarchy can yield global linear convergence and substantial acceleration for structured inverse problems, as in ML-BPGD (Elshiaty et al., 4 Jun 2025).
  • Unrolled and Deep Proximal Networks: Finite PGD iterations reparameterized and learned as layers in a neural architecture (“deep unfolding”) with explicit step size and gradient-transform parameterization, often with end-to-end AutoML-driven hyperparameter selection, provide data-efficient, interpretable, and high-speed solvers for structured waveform and inverse imaging tasks (Kaplan, 18 Mar 2026, Chen et al., 2020).

5. Nonconvexity, Nonsmoothness, and Beyond: Weak Convexity, Plug-and-Play, and Piecewise Convex Regularization

PGD generalizes beyond the classical convex case:

  • For gg4 weakly convex (i.e., gg5 convex), PGD and relaxed updates converge to stationary points with explicit rates and parameter regimes (Hurault et al., 2023, Rotaru et al., 6 Mar 2025).
  • For composite objectives with piecewise convex, possibly nonconvex regularizers (e.g., capped-gg6 or gg7 penalties), projective PGD (PPGD) with momentum and piecewise projection projection achieves gg8 locally after finitely many piece transitions, without reliance on the KL property (Yang et al., 2023, Yang et al., 2017).
  • For adversarial optimization and norm-constrained problems, primal–dual PGD efficiently solves min-max or min-norm perturbation tasks for arbitrary prox-friendly norms (Matyasko et al., 2021).

6. Applications, Empirical Performance, and Practical Recommendations

PGD underpins numerous applications, often as the backbone of sparse and low-rank recovery, large-scale regularized estimation, imaging, wireless waveform optimization, structured prediction with optimization-in-the-loop, and learned iterative schemes:

  • In ill-conditioned inverse problems, PGD with back-projection (BP) objectives converges substantially faster than with least-squares fidelity terms, as the BP Hessian is perfectly conditioned over the row-space (Tirer et al., 2020).
  • Plug-and-play PGD with learned denoisers achieves state-of-the-art image restoration and inference under both matched and mismatched priors, with convergence guarantees scaling gracefully with denoiser suboptimality (Hurault et al., 2023, Xu et al., 14 Jan 2026).
  • For compressive MRI and signal processing, unrolled PGD architectures (e.g., ProxNet and AutoPGD) yield high accuracy with drastic reductions in computation, data requirements, and network size relative to black-box learning (Chen et al., 2020, Kaplan, 18 Mar 2026).
  • In large-scale, ill-conditioned, or structurally constrained domains, multilevel Bregman PGD and adaptive variants deliver accelerated convergence and computational tractability (Elshiaty et al., 4 Jun 2025, Nikolovski et al., 2024, Malitsky et al., 2023).

Typical practical recommendations include using variable/adaptive stepsizes when global gg9 is unknown, Bregman geometry when natural, and plug-and-play or inexact prox when explicit regularizer structure is unknown or not easily proximable. For nonconvex settings, exploiting local piecewise convexity or DCA equivalence affords both theoretical and empirical acceleration (Rotaru et al., 6 Mar 2025, Yang et al., 2023).

7. Connections, Limitations, and Ongoing Research Directions

PGD’s conceptual universality is reflected in its deep connections:

  • Equivalence to DCA: The fundamental PGD step coincides with DCA for natural curvature splittings, enabling tight convergence analysis and parameter selection beyond classical theory (Rotaru et al., 6 Mar 2025).
  • Generalization to Wasserstein and other geometries: Extension to measure spaces and geodesic convexity underpins modern approaches to learning on distributions and infinite-dimensional spaces (Salim et al., 2020).
  • Implicit layers and end-to-end learning: PGD forms the computational and conceptual backbone of optimization-in-the-loop and differentiable programming frameworks; advanced backward passes such as LPGD abstract the envelope and smoothing view for automatic differentiation through optimization layers (Paulus et al., 2024).
  • Limits and open questions: While classical PGD fails in some highly nonconvex and nonsmooth regimes, recent advances in piecewise convexity, PL conditions, and plug-and-play analysis have dramatically extended its reach. Optimality of adaptive stepsizes, structure-exploiting parameterizations, and unifying analysis frameworks remain active research areas (Bok et al., 2024, Rotaru et al., 6 Mar 2025, Hurault et al., 2023).

In sum, Proximal Gradient Descent is a foundational instrument of modern algorithmic optimization, continuously advancing as new theoretical tools, adaptivity schemes, and application-driven innovations emerge across the mathematical, engineering, and data sciences (Pong, 2013, Nikolovski et al., 2024, Bok et al., 2024, Hurault et al., 2023, Rotaru et al., 6 Mar 2025, Elshiaty et al., 4 Jun 2025, Simeoni et al., 2020, Malitsky et al., 2023, Salim et al., 2020, Kaplan, 18 Mar 2026, Chen et al., 2020, Matyasko et al., 2021, Yang et al., 2023, Yang et al., 2017, Paulus et al., 2024, Tirer et al., 2020, Xu et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal Gradient Descent (PGD).