Nonsmooth Proximal Terms in Optimization

Updated 19 September 2025

Nonsmooth proximal terms are convex functions that enable regularization in optimization via tractable proximal mappings.
They leverage splitting methods like the proximal gradient and its variants to handle complex composite structures and ensure convergence.
Applications include sparse recovery, signal processing, and optimal control, where efficient proximal computations enhance performance.

Nonsmooth proximal terms refer to the inclusion of nonsmooth, often convex, regularization or penalty functions within optimization algorithms via their proximal operators. These terms arise in variational formulations across signal processing, statistical learning, inverse problems, and control, and their efficient algorithmic treatment has driven much of the progress in large-scale and structured optimization. This article surveys the theoretical foundations, algorithmic developments, and applications of nonsmooth proximal terms, synthesizing major results from recent research.

1. Definition and Role of Nonsmooth Proximal Terms

A nonsmooth proximal term is a (typically proper, closed, convex) function $g : \mathbb{R}^n \to (-\infty, +\infty]$ whose possible lack of differentiability is offset by tractable computation of its proximal mapping: $\operatorname{prox}_{\lambda g}(x) = \arg\min_{y} \left\{ g(y) + \frac{1}{2\lambda} \| y - x \|^2 \right\},$ for $\lambda > 0$ . The "prox-friendliness" of $g$ generalizes smooth regularization by enabling efficient splitting and composite optimization. In composite minimization problems of the form

$\min_{x} f(x) + g(x)$

where $f$ is smooth (Lipschitz-gradient, possibly nonconvex) and $g$ is nonsmooth, the proximal operator allows algorithms to handle the nonsmoothness explicitly within iterative update steps, bypassing the pitfalls associated with subgradient approaches.

Beyond additive nonsmooth penalties, the concept has been extended to compositions, e.g., $P(Mx-b)$ for a linear operator $M$ , where the composite $P\circ M$ (particularly when $P$ is nonsmooth but "proximable" and $M$ is nontrivial) poses extra computational complexity addressed by advanced algorithms.

2. Algorithmic Treatment: Proximal Gradient, Proximal-Proximal Gradient, and Variants

Central to exploiting nonsmooth proximal terms are splitting algorithms such as the proximal gradient method and its variants.

Standard Proximal Gradient: For $F(x) = f(x) + g(x)$ , updates are of the form

$x^{k+1} = \operatorname{prox}_{t g}\left( x^k - t \nabla f(x^k) \right)$

where $t > 0$ (typically $t \leq 1/L$ for $L$ -Lipschitz $\nabla f$ ).

Composite Case and Proximal–Proximal Gradient (PPG): When the nonsmooth term appears composed with an affine map, $F(z) = h(z) + P(Mz-b)$ , direct application of the prox-gradient method requires computing $\operatorname{prox}_{\tau (P \circ M)}$ , which is generally difficult. The PPG algorithm (Pong, 2013) addresses this by formulating a dual problem and employing alternating minimization on the dual, adding an explicit proximal term to the $P^*$ -update. Each iteration involves updates only with easy-to-compute proximal mappings of $P$ or $P^*$ , circumventing the coupling induced by $M$ . The PPG algorithm reduces to standard proximal gradient when $M$ is identity, thus generalizing the basic approach to a significantly broader class of problems.

Full-Splitting and Alternating Minimization: Many modern proximal methods extend splitting strategies to handle additional structure, e.g., block variables, linear operators, or nonconvexity, by introducing auxiliary variables and dual updates (e.g., (Bot et al., 2018)).

Smoothing and Generalized Prox Distance: In cases where the proximal mapping for a composition is not easy to evaluate, smoothing or majorization techniques (e.g., Nesterov’s smoothing or surrogate majorants) are used, combined with variable metric or generalized distances to better match the structure of $g$ (or $P$ ) and enhance algorithmic performance (Nguyen et al., 2017, Yashtini, 2022, Yagishita et al., 1 May 2025).

3. Convergence Theory and Complexity Results

The convergence properties of proximal methods with nonsmooth terms have been comprehensively analyzed under various conditions.

Global Convergence: For convex or weakly convex objectives, proximal gradient-based methods (and their PPG or inexact variants) generate sequences globally converging to minimizers or critical points, provided mild regularity on $g$ and step sizes (Pong, 2013, Davis et al., 2017, Khanh et al., 2023).
Complexity Guarantees: Under Lipschitz-gradient assumptions, convexity, and error bounds, sharp iteration complexity results have been shown. For example, in the absence of strong convexity but under quadratic growth or related error-bound properties, proximal methods converge at a global linear rate in objective value and iterates (Drusvyatskiy et al., 2016). Without such conditions, the worst-case complexity is typically $O(1/\epsilon^2)$ for achieving $\epsilon$ -stationarity (Aravkin et al., 2021, Liu et al., 7 Jan 2024).
Nonconvex and Riemannian Settings: The Kurdyka–Łojasiewicz (KL) property has been systematically used to derive finite length, linear, or sublinear convergence rates in nonconvex settings, including conic or manifold domains (Bot et al., 2017, Bot et al., 2018, Yashtini, 2022, Yagishita et al., 1 May 2025, Jiang et al., 10 Sep 2025). When KL holds at a cluster point, global convergence (whole sequence convergence) and variable metric stability (in quasi-Newton or variable metric schemes) are established (Jia et al., 24 Jul 2025).
Inexact Proximal Mapping: Allowing errors in solving the proximal subproblem does not fundamentally alter convergence when error control is appropriately linked with the iteration (for instance, through Moreau envelope smoothness and suitable desingularization), with explicit complexity bounds being derived even for weakly convex $g$ (Khanh et al., 2023, Jiang et al., 10 Sep 2025).

4. Practical Challenges: Computation of Proximal Mappings and Algorithm Design

The practical effectiveness of proximal methods in the presence of nonsmooth terms is highly contingent on the tractability of the proximal mapping for $g$ (or $P$ , in the compositional case). Several substantial algorithmic themes address this:

Duality and Decoupling: Leveraging Fenchel duality or Lagrangian duality to separate the complexity contributed by linear operators or coupling terms, and recasting updates so the proximal operator applies to (scaled) $P$ or $g$ directly (Pong, 2013, Takeuchi, 2020, Bot et al., 2018).
Adaptive and Variable Metric Proximal Terms: Employing variable metric (preconditioned) proximal operators, possibly adaptively tuned, both to accelerate convergence and to efficiently handle constraints or geometry (e.g., via Bregman divergence or weighted $\ell_2$ ) (Yashtini, 2022, Yagishita et al., 1 May 2025, Liu et al., 7 Jan 2024).
Stochastic and Incremental Methods: In large-scale regimes or with data partitioned across multiple machines, only subsets of nonsmooth terms may be updated at each iteration. The stochastic multi-proximal method (SMPM) extends variance reduction and arbitrary sampling schemes to the nonsmooth setting, enabling communication-efficient distributed optimization (Condat et al., 18 May 2025).
Smoothing and Majorization: When the direct computation of $\operatorname{prox}_{g}$ is not feasible, algorithms utilize smoothed surrogates or tangent majorant functions, as in composite penalties of the form $\varphi \circ \psi$ (Nguyen et al., 2017, Yashtini, 2022).
Trust-Region and Quasi-Newton Algorithms: For problems with challenging nonconvexity or variable curvature, algorithms based on trust-region (e.g., (Dao et al., 9 Jan 2025, Aravkin et al., 2021)) or model-based quasi-Newton updates (Jia et al., 24 Jul 2025) have been developed, leveraging the proximal operator for $h$ (regardless of nonsmoothness) while incorporating curvature information for efficient step calculation and robust globalization.

5. Applications and Empirical Performance

Nonsmooth proximal terms are ubiquitous in modern optimization-based applications:

Structured Regularization: Sparse recovery (LASSO, elastic net), total variation denoising (TV penalties), and nuclear norm minimization (for low-rank matrices in system identification and matrix completion) are all formulated with nonsmooth proximal terms, often composed with operators.
Large-Scale Learning: Problems such as the logistic fused lasso, basis pursuit, and robust regression employ nonsmooth regularizers or constraints that permit a simple proximal operator, facilitating splitting or stochastic incremental updates (Pong, 2013, Latafat et al., 2019, Condat et al., 18 May 2025).
Signal Processing and Inverse Problems: Nonconvex regularization (e.g., log-sum or $\ell_p^p$ with $0 < p < 1$) and composite nonconvex-nonsmooth structures arise in high-resolution imaging and MRI reconstruction, addressed by variable metric and majorization-based CPALM and related algorithms (Yashtini, 2022).
Optimal Control and Engineering: Convex and nonconvex constraints (such as box, simplex, or semidefinite cone projections) are handled via nonsmooth indicator functions and designed proximal terms, with applications in PDE control and feasibility problems (Dao et al., 9 Jan 2025, Jia et al., 24 Jul 2025).
Riemannian Optimization: Nonsmooth DC objectives on manifolds, such as for sparsity-enforcing principal component analysis, exploit nonsmooth proximal terms that extend $\ell_1$ –type penalties and difference-of-convex formulations (Jiang et al., 10 Sep 2025).

Empirical evaluations typically show that algorithms designed to exploit the specific structure of the nonsmooth proximal term (via duality, variable metrics, stochastic sampling, or majorization) substantially outperform generic subgradient or interior-point schemes, especially in large-scale and high-accuracy regimes.

6. Impact on Theory and Future Research Directions

The development and analysis of algorithms tailored to nonsmooth proximal terms have dramatically expanded the scope of tractable convex and nonconvex optimization. Critical theoretical advances include:

Decoupling via Proximal Duality: Mapping challenging composite problems to dual or alternate primal-dual iterations that only require easy-to-compute proximal mappings.
Convergence Beyond Global Descent Lemma: Proving convergence for generalized prox distances and Bregman divergences even in the absence of global descent properties (Yagishita et al., 1 May 2025).
Extension to Riemannian, Stochastic, and Inexact Regimes: Adapting the concept of a nonsmooth proximal term to manifold-valued problems, as well as to online/zeroth-order and inexact subproblem settings, thereby expanding the algorithmic and applicability horizon (Jiang et al., 10 Sep 2025, Khanh et al., 2023, Liu et al., 7 Jan 2024).
Unification and Generalization of Incremental Algorithms: SMPM unifies several recent advances in stochastic and distributed nonsmooth optimization, bridging variance-reduction techniques with arbitrary operator sampling (Condat et al., 18 May 2025).

Ongoing research interests include accelerated and adaptive variants, better exploitation of nonsmooth structure via stochastic and distributed computing, advanced regularization (difference-of-convex, rank constraints, etc.), and refined complexity guarantees under minimal assumptions or randomization. The interplay between local model fidelity, inexactness of prox computations, and adaptive metric selection remains a particularly active and promising area.

In summary, nonsmooth proximal terms form the algorithmic backbone of much of contemporary optimization, enabling efficient, flexible, and robust treatment of regularization, structure, and constraints in both convex and nonconvex regimes. Advances in their mathematical and algorithmic handling continue to have broad and deep impact across optimization theory and practical applications.