Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
46 tokens/sec
GPT-5 Medium
19 tokens/sec
GPT-5 High Premium
32 tokens/sec
GPT-4o
87 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
465 tokens/sec
Kimi K2 via Groq Premium
226 tokens/sec
2000 character limit reached

Zeroth-Order Projected Stochastic Subgradient Method

Updated 18 August 2025
  • The paper introduces a zeroth-order method that uses Gaussian smoothing to approximate Clarke subgradients in constrained, nonsmooth, nonconvex optimization settings.
  • It employs a two-timescale iterative scheme where fast gradient tracking and slow projected updates ensure feasibility over compact convex sets.
  • The approach guarantees almost sure convergence to a neighborhood of Clarke stationary points with explicit bias control, advancing classical stochastic methods.

A zeroth-order projected stochastic subgradient method is an algorithmic framework for solving constrained stochastic optimization problems when gradients or subgradients of the objective function are unavailable or inaccessible, and only noisy function evaluations can be queried. These methods approximate generalized (in particular, Clarke) subgradients by using randomized smoothing, and combine stochastic gradient tracking with projection steps to handle convex constraints. This framework is motivated by optimizing Lipschitz continuous, nonsmooth, nonconvex objectives over compact convex sets, a setting for which classical gradient-based techniques are infeasible or insufficiently robust.

1. Smoothing-Based Zeroth-Order Subgradient Approximation

The main technical challenge addressed is the lack of a Taylor-like expansion or analytical handle on the Clarke subdifferential for nonsmooth functions, which impedes both subgradient approximation and theoretical analysis. To overcome this, the method utilizes Gaussian smoothing: for a given λ>0\lambda > 0, a smoothed version of the objective is defined as

fλ(x)=EuN(0,I)[f(x+λu)],f_\lambda(x) = \mathbb{E}_{u \sim \mathcal{N}(0, I)} [f(x + \lambda u)],

which is differentiable even if ff is nondifferentiable. The gradient of the smoothed function can be written as

fλ(x)=1λEu[(f(x+λu)f(x))u].\nabla f_\lambda(x) = \frac{1}{\lambda} \mathbb{E}_u \left[ (f(x + \lambda u) - f(x)) u \right].

A key structural result is that, under mild regularity (Lipschitz continuity) conditions, for every xx,

fλ(x)f(x)+B(0,r(λ)),\nabla f_\lambda(x) \in \partial f(x) + B(0, r(\lambda)),

where B(0,r(λ))B(0, r(\lambda)) is a ball centered at zero with vanishing radius r(λ)0r(\lambda) \to 0 as λ0\lambda \to 0 (see (Paul et al., 14 Aug 2025)). Thus the expectation of the Gaussian-smoothed subgradient lies within an explicitly bounded distance of the Clarke subdifferential.

2. Two-Timescale Coupled Iterative Scheme

The algorithm employs a two-timescale stochastic approximation architecture:

  • The fast timescale recursively tracks the (randomized, noisy) smoothed subgradient. At iteration nn, given xnx_n and an independent standard Gaussian UnU_n, the algorithm draws two function evaluations F(xn+λUn,ζn1)F(x_n + \lambda U_n, \zeta_n^1), F(xnλUn,ζn2)F(x_n - \lambda U_n, \zeta_n^2) with potential independent noise ζ\zeta and computes

g~(n)=F(xn+λUn,ζn1)F(xnλUn,ζn2)2λUn.\widetilde{g}(n) = \frac{ F(x_n + \lambda U_n, \zeta_n^1) - F(x_n - \lambda U_n, \zeta_n^2) }{2\lambda} U_n.

The auxiliary variable yny_n is updated by

yn+1=yn+β(n)(g~(n)yn),y_{n+1} = y_n + \beta(n)(\widetilde{g}(n) - y_n),

with step-sizes β(n)\beta(n) satisfying nβ(n)=\sum_n \beta(n) = \infty, nβ2(n)<\sum_n \beta^2(n) < \infty.

  • The slow timescale performs the projected update:

xn+1=PX(xnα(n)yn),x_{n+1} = \mathcal{P}_\mathcal{X}( x_n - \alpha(n) y_n ),

where PX\mathcal{P}_\mathcal{X} denotes orthogonal projection onto the compact convex set X\mathcal{X}, and α(n)\alpha(n) is a sequence such that α(n)/β(n)0\alpha(n) / \beta(n) \to 0 (ensuring the timescales are well separated).

This two-timescale design ensures that yny_n closely tracks the expected smoothed subgradient for the current xnx_n, while the projected descent—using yny_n as the "search direction"—enforces iterates remain feasible.

3. Convergence Properties and Neighborhood Characterization

By leveraging continuous-time dynamical systems theory and robust perturbation analysis (specifically, Lyapunov-based arguments and properties of set-valued Marchaud maps), the analysis establishes almost sure convergence of the iterates to a neighborhood of Clarke stationary points of the original, nonsmooth, nonconvex problem. The critical points of the limiting projected dynamical system satisfy

0f(x)+NX(x),0 \in \partial f(x) + N_{\mathcal{X}}(x),

where NX(x)N_{\mathcal{X}}(x) denotes the normal cone at xXx \in \mathcal{X}. Due to smoothing, the neighborhood size is controlled explicitly by the smoothing parameter λ\lambda via r(λ)r(\lambda). As λ0\lambda \to 0, the bias in the subgradient approximation vanishes, and the iterates become arbitrarily close (in limit) to the Clarke stationary set. This result yields the first almost sure convergence for zeroth-order methods with projections in the constrained, nonsmooth, nonconvex stochastic optimization regime (Paul et al., 14 Aug 2025).

4. Role and Adaptation of Gaussian Smoothing for Clarke Subdifferentials

Gaussian smoothing regularizes the nonsmooth objective without requiring an explicit subdifferential oracle. For a function ff that is merely Lipschitz, the smoothed version fλf_\lambda is always differentiable (by convolution with the Gaussian kernel), and its gradient can be efficiently and unbiasedly estimated by finite differences and random sampling. Importantly, while standard zeroth-order methods for smooth/nonconvex objectives approximate classical gradients, the approach here rigorously approximates elements of the Clarke subdifferential, which is fundamental for nonsmooth nonconvex analysis.

The explicit control of the bias r(λ)r(\lambda)—quantified in the error between the smoothed gradient and the true Clarke subgradient—permits a tradeoff: making λ\lambda small improves accuracy but increases variance and possibly the number of function evaluations required.

5. Comparisons to Classical and Contemporary Methodologies

Earlier zeroth-order stochastic methods have established guaranteed convergence only for unconstrained smooth problems or have provided non-almost-sure statements (e.g., convergence in L1L^1 only). Traditional techniques for nonsmooth, nonconvex problems presuppose access to subgradient oracles, which is implausible in many simulation optimization or black-box contexts.

Distinctive features of this method (Paul et al., 14 Aug 2025):

  • Generalizes stochastic projected subgradient methods from subgradient-available settings to pure black-box (function value only) contexts.
  • Handles constraints exactly via Euclidean projections, rather than through penalization.
  • Achieves almost sure convergence to a quantified neighborhood, an advancement over prior results limited to asymptotic gaps or expectation guarantees.

This approach is complementary to smoothing-based zeroth-order approaches for unconstrained problems (Marrinan et al., 2023), but specifically overcomes additional technical obstacles in the analysis of constrained, nonsmooth landscapes (notably, the lack of a Taylor expansion for Clarke subdifferentials).

6. Practical Applicability and Further Implications

The method is designed for, and directly applicable to, scenarios where gradient or subgradient information is unavailable—such as simulation-based optimization, black-box machine learning, and all settings where only noisy function evaluations are feasible. The guaranteed feasibility of iterates (via projection), ability to handle nonconvexity and nonsmoothness simultaneously, and rigorous convergence characterization to Clarke stationary neighborhoods provide robustness for practical deployments. The separation of timescales and explicit bias-variance tradeoff (via the smoothing parameter) allow practitioners to tailor algorithmic performance to problem requirements and noise regimes.

Potential extensions suggested by the methodology include accelerated two-timescale schemes, adaptivity in the selection of smoothing and step-size parameters, and application to constraints beyond compact convex sets using more general projection or proximal operators. This methodology enables a novel class of zeroth-order projected methods for challenging nonsmooth stochastic optimization problems in high-dimensional black-box settings.