Randomized Smoothing Gradient Algorithms

Updated 20 August 2025

Randomized smoothing gradient algorithms are methods that use controlled random perturbations to create smooth surrogates for optimizing nonsmooth, nonconvex functions.
They leverage probabilistic convolution with Gaussian or uniform distributions to estimate gradients via Monte Carlo or finite differences, enhancing robustness in noisy, high-dimensional settings.
Integrating normalization and variance reduction techniques, these algorithms achieve near-optimal sample complexity under weak subgradient growth, broadening practical optimization applications.

Randomized smoothing gradient algorithms constitute a versatile family of gradient-free and gradient-based methods for optimizing nonsmooth and/or nonconvex functions by leveraging probabilistic convolution to yield smooth surrogates amenable to stochastic or zero-order optimization. These methods replace the objective or its gradients with expectations under controlled random perturbations—most commonly uniform or Gaussian distributions—enabling both theoretical guarantees on convergence and practical, robust performance in high-dimensional, noisy, and even non-differentiable regimes. Recent developments have extended randomized smoothing far beyond the traditional global Lipschitz setting, allowing convergence on a broad class of locally Lipschitz or even subgradient-growth-constrained functions, and accelerating algorithmic rates via normalization and variance reduction.

1. Subgradient Growth Conditions Beyond Global Lipschitz

Classical randomized smoothing analyses are predicated on global Lipschitz continuity: the existence of a global L such that $|f(x) - f(y)| \leq L\|x-y\|$ for all $x, y \in \mathbb{R}^d$ . However, many modern nonsmooth or composite objectives violate this assumption due to local irregularities or unbounded subgradients. The (α, β) subgradient growth condition (Xia et al., 19 Aug 2025) generalizes the landscape:

For every $x$ , the norm of any Clarke subgradient is bounded by a function $\alpha(x)$ : $\sup_{\zeta \in \partial f(x)}\|\zeta\| \leq \alpha(x)$ ;
The variation of $\alpha$ is itself controlled: $|\alpha(x) - \alpha(y)| \leq \beta(x, \|y - x\|)$ for some function $\beta$ .

This leads to a generalized local Lipschitz property,

$|f(x) - f(y)| \leq (\alpha(x) + \beta(x, \|y-x\|)) \|x - y\|.$

The standard Lipschitz case is recovered when $\alpha(x) = L$ , $\beta(x,r) = 0$ . This framework encompasses locally Lipschitz, subgradient-bounded, and certain non-polynomial growth scenarios, providing a rigorous baseline for algorithmic analysis in practical, non-globally-smooth optimization.

2. Smoothing Mechanisms and Algorithmic Schemes

The core randomized smoothing operator replaces $f(x)$ with a smoothed surrogate:

$f_\delta(x) = \mathbb{E}_{w \sim U(S^{d-1})} f(x + \delta w)$

for $w$ uniformly distributed on the unit sphere and smoothing parameter $\delta > 0$ . The smoothed surrogate $f_\delta$ is differentiable, with gradient given by the expectation

$\nabla f_\delta(x) = \mathbb{E}_w [ \nabla f(x + \delta w) ]$

and $\nabla f_\delta(x) \in \partial_\delta f(x)$ , the Goldstein $\delta$ -subdifferential of $f$ . Practically, the gradient is estimated via Monte Carlo or central finite difference estimators, for example,

$\hat{g} = \frac{d}{2\delta} (f(x + \delta u) - f(x - \delta u)) u,$

with $u$ sampled uniformly on the sphere. Normalization ( $\hat{g}/\|\hat{g}\|$ ) may be applied.

Three main algorithmic structures are supported by this smoothing framework (Xia et al., 19 Aug 2025):

RS-GF: Standard randomized smoothing gradient-free with step-size adapted to local $\alpha$ , $\beta$ .
RS-NGF: Normalized variant, dividing by the estimator norm to reduce dimension dependence.
RS-NVRGF: Incorporates variance reduction; a "heavy-ball" strategy updates the gradient estimate incrementally via larger and smaller batches, substantially lowering the sample complexity.

The smoothing mechanism yields surrogates for which generalized descent lemmas hold, allowing stochastic or batch-based update rules analogous to smooth nonconvex optimization but valid for non-globally-Lipschitz domains.

3. Convergence Rates and Sample Complexity

The target notion of optimality is a $(\delta, \epsilon)$ -Goldstein stationary point: $x$ such that $\min\{\|g\|: g \in \partial_\delta f(x)\} \leq \epsilon$ . Under the (α, β) subgradient growth condition, the following high-probability complexity guarantees are established:

RS-GF: $\tilde{\mathcal{O}}(d^{5/2} \delta^{-1} \epsilon^{-4})$ total function queries to find $(\delta, \epsilon)$ -Goldstein stationary points.
RS-NGF: Incorporating normalization, this is improved to $\tilde{\mathcal{O}}(d^{3/2} \delta^{-1} \epsilon^{-4})$ .
RS-NVRGF: With variance reduction, the logarithmic term is preserved and the $\epsilon$ -dependence matches the known optimal for global-Lipschitz: $\tilde{\mathcal{O}}(d^{3/2} \delta^{-1} \epsilon^{-3})$ .

These complexities mirror those for classic stochastic zeroth-order methods, up to dimension and logarithmic factors, but crucially no longer require global Lipschitzness, only the milder growth condition. Key technical developments include new descent lemmas and local smoothness constants of the form

$\ell(x, r) = \frac{c\sqrt{d} (2\alpha(x) + \beta(x, \delta) + \beta(x, r + \delta))}{2\delta}.$

This suggests a precise tradeoff between the smoothing radius, the local growth of subgradients, and the sample complexity.

4. Variance Reduction and Normalization

Variance reduction techniques are critical to reach the $\epsilon^{-3}$ sample complexity and to ensure stability in the presence of stochastic noise inherent to zeroth-order or Monte Carlo gradient estimation. The RS-NVRGF method (Xia et al., 19 Aug 2025) utilizes a two-tier batch scheduling inspired by SPIDER:

Every $q = \mathcal{O}(\epsilon^{-1})$ iterations, a large-batch full estimator is computed.
Between such epochs, gradient estimates are updated recursively via mini-batch differences,

$m_t = m_{t-1} + g(x_t, S_\text{small}) - g(x_{t-1}, S_\text{small}).$

Normalization of gradient steps (unit vector steps) further improves dimension-dependence, ensuring convergence even as $d$ grows. The result is a more efficient and stable convergence, matching minimax lower bounds in $\epsilon$ up to log factors for nonsmooth nonconvex optimization.

5. Experimental Validation and Applications

Empirical evaluation (Xia et al., 19 Aug 2025) comprises synthetic unconstrained problems and black-box adversarial attacks:

On synthetic distance estimation tasks with polynomial and exponential (nonsmooth) losses, normalized and variance-reduced algorithms (RS-NGF, RS-NVRGF) outperform classic smoothing-based methods, achieving lower objective values and smoother, more stable convergence trajectories.
In adversarial attack generation on CIFAR-10/ResNet architectures (nonsmooth, nonconvex, box-constrained), RS-NGF and RS-NVRGF achieve higher attack success rates and lower post-attack accuracy relative to VRGF and non-variance-reduced smoothing baselines. Normalization and variance reduction are necessary to obtain practical speed and solution quality.

The applicability extends to a broad class of derivative-free or simulation-based optimization tasks where cheap, unbiased derivative information is unavailable, and where only mild local smoothness or growth can be assumed.

6. Broader Impact and Future Directions

Randomized smoothing under weak subgradient growth has generalized the algorithmic toolkit for nonsmooth, nonconvex optimization beyond classical convex or globally smooth settings. This has several significant ramifications:

The method is robust to local non-Lipschitz irregularities that are prevalent in practical machine learning, signal processing, adversarial robustness, and engineering design.
Normalization and variance reduction are not algorithmic artefacts, but are theoretically necessary to achieve minimax sample complexity; further refinements are expected as tightness characterizations mature.
The approach opens the door to design of new smoothing distributions and to adaptation to problem structures (e.g., exploiting effective dimension reduction, as in random smoothing regularization (Ding et al., 2023), anisotropy (Starnes et al., 18 Nov 2024), or kernel-based adaptivity).

A plausible implication is that as more general local growth and structure-aware techniques are incorporated, randomized smoothing gradient algorithms will remain a foundation for reliable nonsmooth nonconvex optimization under realistic modeling assumptions.

Table: Key Algorithmic Elements

Variant	Update Rule	Complexity
RS-GF	$\hat{g}_t$ via central difference, scaled step	$\tilde{O}(d^{5/2}\delta^{-1}\epsilon^{-4})$
RS-NGF	Normalized $\hat{g}_t$ , fixed-length step	$\tilde{O}(d^{3/2}\delta^{-1}\epsilon^{-4})$
RS-NVRGF	Normalized, variance-reduced updates	$\tilde{O}(d^{3/2}\delta^{-1}\epsilon^{-3})$

Conclusion

Randomized smoothing gradient algorithms under generalized subgradient growth present a theoretically sound and empirically validated approach to zeroth-order optimization of nonsmooth, nonconvex, and locally irregular functions. The relaxation from global Lipschitz allows this class to subsume a vastly larger set of practical objectives, while the incorporation of normalization and variance reduction ensures near-optimal sample complexity. These features solidify the framework as a key tool in modern stochastic optimization and derivative-free algorithmics (Xia et al., 19 Aug 2025).