Proximal Gradient Descent: Theory & Applications
- Proximal Gradient Descent is an optimization framework that decomposes problems into a smooth part handled by gradient descent and a nonsmooth part managed via a proximal operator.
- It provides provable convergence guarantees, including O(1/k) sublinear rates for convex functions and linear convergence under strong convexity, with accelerated variants available.
- Practical extensions such as adaptive stepsizes, plug-and-play techniques, and unrolled deep networks enhance its applicability in signal processing, imaging, and machine learning.
Proximal Gradient Descent (PGD) is a fundamental algorithmic framework for the numerical solution of optimization problems that decompose into the sum of a differentiable (often smooth) function and a possibly nonsmooth regularizer. This paradigm encompasses a wide range of problems in signal processing, statistics, machine learning, inverse problems, and more, accommodating both convex and certain classes of nonconvex objectives. PGD operates by alternating between an explicit gradient step on the smooth part and a proximal (implicit) step on the nonsmooth component, effectively generalizing classical gradient descent to the composite setting and enabling tractable treatment of structured regularization.
1. Mathematical Formulation and Algorithmic Structure
PGD targets optimization problems of the form
where is differentiable with -Lipschitz continuous gradient, and is a proper, closed, and possibly nonsmooth function. The PGD update at iteration is
where is a stepsize chosen (often ) to ensure descent and stability (Pong, 2013, Nikolovski et al., 2024). The proximal operator enables handling constraints and regularizers such as , group sparsity, total variation, indicator functions of convex sets, or separable nonconvex penalties.
The method generalizes straightforwardly to the composite case where 0 admits a simple prox, but 1 does not; specialized dual or primal-dual methods such as Proximal-Proximal Gradient (PPG) or other splitting methods may be requisite (Pong, 2013).
2. Theoretical Guarantees and Convergence Rates
For convex 2 and 3, and constant stepsize 4, classical PGD ensures
5
implying an 6 ergodic rate in objective (Nikolovski et al., 2024, Pong, 2013, Salim et al., 2020). Under additional 7-strong convexity, PGD achieves linear convergence: 8 Accelerated variants (FISTA) provably attain 9. Recent work further advances the rates achievable by variable stepsize methods; the “silver stepsize schedule” yields an improved rate 0, where 1, outperforming classical constant-step PGD but not reaching the 2 optimum attained with Nesterov acceleration (Bok et al., 2024).
For weakly convex or nonconvex 3, convergence is to stationary points, with explicit 4 guarantees for suitable parameter regimes (Hurault et al., 2023, Rotaru et al., 6 Mar 2025). The tightest one-step decrease characterizations are now available via DCA-based analysis, which refines constants and parameter selection beyond the classical analysis (Rotaru et al., 6 Mar 2025).
PGD has also been analyzed in infinite-dimensional Wasserstein spaces for measure optimization, where the analogous forward–backward scheme retains 5 suboptimality and linear metric convergence under strong convexity (Salim et al., 2020).
3. Proximal Mapping, Implementation, and Extensions
For many regularizers, 6 is computable in closed form or admits fast routines. For 7-norm regularization, the proximal is the soft-thresholding mapping
8
For general indicator functions or constraints, the proximal recovers projection onto the feasible set.
Modern implementations often adapt the basic PGD scheme:
- Variable and adaptive stepsizes: Per-iteration local curvature estimates can yield step selections superior to global 9, often halving iteration counts and wall times (Malitsky et al., 2023, Nikolovski et al., 2024).
- Plug-and-play and learned prox: Replacing the explicit proximal with a deep or learned denoiser (matching the form 0) yields effective “PnP-PGD” methods, with provable sublinear convergence (even under prior mismatch and for nonconvex implicit regularizers) under weak contractivity assumptions (Hurault et al., 2023, Xu et al., 14 Jan 2026).
- Inexact prox and plug-and-play: When the prox is intractable or absent in closed form, inexact schemes such as Cadzow plug-and-play gradient descent (CPGD) employ alternating projections or approximate denoisers in lieu of the proximal operator, while retaining convergence to local minimizers under mild additional structure (e.g., locally nonexpansive alternating Toeplitz-SVD projections) (Simeoni et al., 2020).
- Bregman and mirror-prox frameworks: For problems with non-Euclidean geometry or relative smoothness, the prox term is replaced with a Bregman divergence, with correspondingly generalized convergence guarantees (Elshiaty et al., 4 Jun 2025).
- Primal-dual schemes and adversarial optimization: PGD underlies primal-dual strategies for constrained min-max or variational problems (e.g., adversarial robustness), with extensions to nonsmooth norm penalties and dual stepsizes (Matyasko et al., 2021).
4. Advanced Variants: Accelerated, Adaptive, and Bregman Extensions
The landscape of PGD variants encompasses several axes of enhancement:
- Adaptive Proximal Gradient: Secant-type curvature tracking (local 1 estimates) allows stepsizes exceeding the “safe” 2, without added computation, yielding provable 3 convergence under local Lipschitz continuity (Malitsky et al., 2023, Nikolovski et al., 2024).
- Silver Stepsize and Non-monotonic Schedules: Non-monotonic, fractal-like stepsize schedules outperform constant steps in constrained and composite problems, with provably optimal rates among all momentum-free schemes (Bok et al., 2024).
- Bregman Proximal Gradient and Multilevel BPGD: In high-dimensional or constrained settings, leveraging Bregman divergence steps and multilevel hierarchy can yield global linear convergence and substantial acceleration for structured inverse problems, as in ML-BPGD (Elshiaty et al., 4 Jun 2025).
- Unrolled and Deep Proximal Networks: Finite PGD iterations reparameterized and learned as layers in a neural architecture (“deep unfolding”) with explicit step size and gradient-transform parameterization, often with end-to-end AutoML-driven hyperparameter selection, provide data-efficient, interpretable, and high-speed solvers for structured waveform and inverse imaging tasks (Kaplan, 18 Mar 2026, Chen et al., 2020).
5. Nonconvexity, Nonsmoothness, and Beyond: Weak Convexity, Plug-and-Play, and Piecewise Convex Regularization
PGD generalizes beyond the classical convex case:
- For 4 weakly convex (i.e., 5 convex), PGD and relaxed updates converge to stationary points with explicit rates and parameter regimes (Hurault et al., 2023, Rotaru et al., 6 Mar 2025).
- For composite objectives with piecewise convex, possibly nonconvex regularizers (e.g., capped-6 or 7 penalties), projective PGD (PPGD) with momentum and piecewise projection projection achieves 8 locally after finitely many piece transitions, without reliance on the KL property (Yang et al., 2023, Yang et al., 2017).
- For adversarial optimization and norm-constrained problems, primal–dual PGD efficiently solves min-max or min-norm perturbation tasks for arbitrary prox-friendly norms (Matyasko et al., 2021).
6. Applications, Empirical Performance, and Practical Recommendations
PGD underpins numerous applications, often as the backbone of sparse and low-rank recovery, large-scale regularized estimation, imaging, wireless waveform optimization, structured prediction with optimization-in-the-loop, and learned iterative schemes:
- In ill-conditioned inverse problems, PGD with back-projection (BP) objectives converges substantially faster than with least-squares fidelity terms, as the BP Hessian is perfectly conditioned over the row-space (Tirer et al., 2020).
- Plug-and-play PGD with learned denoisers achieves state-of-the-art image restoration and inference under both matched and mismatched priors, with convergence guarantees scaling gracefully with denoiser suboptimality (Hurault et al., 2023, Xu et al., 14 Jan 2026).
- For compressive MRI and signal processing, unrolled PGD architectures (e.g., ProxNet and AutoPGD) yield high accuracy with drastic reductions in computation, data requirements, and network size relative to black-box learning (Chen et al., 2020, Kaplan, 18 Mar 2026).
- In large-scale, ill-conditioned, or structurally constrained domains, multilevel Bregman PGD and adaptive variants deliver accelerated convergence and computational tractability (Elshiaty et al., 4 Jun 2025, Nikolovski et al., 2024, Malitsky et al., 2023).
Typical practical recommendations include using variable/adaptive stepsizes when global 9 is unknown, Bregman geometry when natural, and plug-and-play or inexact prox when explicit regularizer structure is unknown or not easily proximable. For nonconvex settings, exploiting local piecewise convexity or DCA equivalence affords both theoretical and empirical acceleration (Rotaru et al., 6 Mar 2025, Yang et al., 2023).
7. Connections, Limitations, and Ongoing Research Directions
PGD’s conceptual universality is reflected in its deep connections:
- Equivalence to DCA: The fundamental PGD step coincides with DCA for natural curvature splittings, enabling tight convergence analysis and parameter selection beyond classical theory (Rotaru et al., 6 Mar 2025).
- Generalization to Wasserstein and other geometries: Extension to measure spaces and geodesic convexity underpins modern approaches to learning on distributions and infinite-dimensional spaces (Salim et al., 2020).
- Implicit layers and end-to-end learning: PGD forms the computational and conceptual backbone of optimization-in-the-loop and differentiable programming frameworks; advanced backward passes such as LPGD abstract the envelope and smoothing view for automatic differentiation through optimization layers (Paulus et al., 2024).
- Limits and open questions: While classical PGD fails in some highly nonconvex and nonsmooth regimes, recent advances in piecewise convexity, PL conditions, and plug-and-play analysis have dramatically extended its reach. Optimality of adaptive stepsizes, structure-exploiting parameterizations, and unifying analysis frameworks remain active research areas (Bok et al., 2024, Rotaru et al., 6 Mar 2025, Hurault et al., 2023).
In sum, Proximal Gradient Descent is a foundational instrument of modern algorithmic optimization, continuously advancing as new theoretical tools, adaptivity schemes, and application-driven innovations emerge across the mathematical, engineering, and data sciences (Pong, 2013, Nikolovski et al., 2024, Bok et al., 2024, Hurault et al., 2023, Rotaru et al., 6 Mar 2025, Elshiaty et al., 4 Jun 2025, Simeoni et al., 2020, Malitsky et al., 2023, Salim et al., 2020, Kaplan, 18 Mar 2026, Chen et al., 2020, Matyasko et al., 2021, Yang et al., 2023, Yang et al., 2017, Paulus et al., 2024, Tirer et al., 2020, Xu et al., 14 Jan 2026).