Randomized Stochastic Gradient-Free Method
- The method leverages randomized smoothing and finite-difference estimators to approximate gradients, enabling optimization without direct gradient access.
- It achieves established convergence rates and improved complexity via techniques like recursive variance reduction and momentum acceleration.
- RSGF is pivotal for optimizing noisy, nonconvex functions in high-dimensional settings, with successful applications in training neural networks.
The randomized stochastic gradient-free (RSGF) method is a class of zeroth-order stochastic optimization algorithms for nonconvex (and possibly nonsmooth) objective functions, which operate in settings where only noisy function evaluations are available. RSGF leverages randomized smoothing and finite-difference gradient estimators to enable stochastic optimization without direct access to gradients. The method achieves established convergence rates to approximate Goldstein stationary points, with practical extensions that yield high-probability guarantees and improved complexity using recursive variance reduction and momentum techniques (Ghadimi et al., 2013, Luo et al., 2019, Lin et al., 2022, Chen et al., 2023).
1. Mathematical Model and Problem Setting
RSGF targets optimization of objectives of the form
where is a possibly nonconvex, stochastic black-box function, and access is limited to (possibly noisy) function evaluations. Depending on the setting:
- may be smooth or nonsmooth but is typically Lipschitz or mean-squared Lipschitz.
- The accessible oracle satisfies bounded variance: .
- Direct gradient, subgradient, or higher-order information is not available.
The central computational goal is to reach an -stationary point—characterized by small gradient norm for smooth , or a -Goldstein stationary point for nonsmooth functions (Ghadimi et al., 2013, Lin et al., 2022).
2. Randomized Smoothing and Gradient-Free Estimation
Without gradient information, RSGF uses randomized smoothing to make more amenable to (finite-difference) estimation. The standard smoothing constructs a function or via convolution: 0 This operation ensures that:
- 1, 2 become differentiable even if 3 is merely Lipschitz.
- 4, i.e., in the Goldstein subdifferential of 5 at 6 (Lin et al., 2022, Chen et al., 2023).
- Smoothing and two-point finite-difference estimation underpin all major RSGF variants.
The RSGF gradient estimator is generally: 7 or the Gaussian variant: 8 These estimators are unbiased for 9 (or 0), with variance 1 (Ghadimi et al., 2013, Chen et al., 2023).
3. Baseline RSGF Algorithms
All RSGF algorithms iterate as follows:
- Sample a random direction 2 (or 3), and a data sample 4.
- Compute the stochastic finite-difference estimator 5.
- Update the iterate: 6, for some stepsize 7.
A widely-analyzed version (uniform smoothing, two-point estimator) is (Lin et al., 2022, Chen et al., 2023): 3 The output is 8 for randomly 9.
In the setting where 0 is smooth, the estimator is precisely tailored so that the bias due to smoothing 1 can be matched or dominated by variance via the choice 2, delivering optimal bias–variance tradeoff (Ghadimi et al., 2013).
4. Convergence Theory and Complexity
Expectation bounds:
Under Lipschitz assumptions, the smoothed RSGF algorithm guarantees: 3 with sample complexity (number of zero-order calls): 4 where 5 is the dimension, 6 is the smoothing parameter, and 7 the stationarity target (Lin et al., 2022, Chen et al., 2023).
For the smooth case, if 8 is 9-smooth and the stepsize/parameters are appropriately chosen,
0
with required number of calls 1 (Ghadimi et al., 2013).
High-probability guarantees (Two-phase RSGF):
Using 2 independent runs, and validating candidate solutions with a mini-batch estimator, the two-phase RSGF achieves: 3 with total cost
4
(Lin et al., 2022, Chen et al., 2023).
5. Advanced Extensions: Acceleration and Variance Reduction
Momentum/acceleration
Accelerated RSGF algorithms incorporate momentum, as in: 5 with 6 and 7 a normalization. This approach yields convergence rates 8 for strongly convex objectives (with bias and variance both 9) (Luo et al., 2019).
Recursive variance reduction (SPIDER/SARAH):
Utilizing recursive gradient estimators 0, complexity with respect to 1 can be improved from 2 to 3. Specifically, the GFM+ variant [Editor's term] forms
4
with suitable choice of epoch length 5, batch sizes 6 (Chen et al., 2023).
The total zeroth-order oracle complexity becomes: 7 which is a dimension-dependent but tighter rate than vanilla RSGF.
6. Stationarity Concepts and Smoothing-Subdifferential Mapping
For nonsmooth objectives, stationarity is formalized using the Goldstein subdifferential: 8 where 9 is the Clarke subdifferential. 0-Goldstein stationarity is achieved when
1
Uniform smoothing ensures 2, and RSGF constructs its stationary guarantees using this mapping. This equivalence is central to both theoretical convergence and complexity analysis (Lin et al., 2022, Chen et al., 2023).
7. Practical Implementation and Applications
Parameter selection:
Step-size choices (3 for nonsmooth, 4 for smooth) are critical for balancing estimation bias and variance. The smoothing parameter (5 or 6) is usually tied to 7 via 8 (Ghadimi et al., 2013, Lin et al., 2022, Chen et al., 2023).
Two-phase validation:
Employing multiple independent runs and selecting via post hoc validation using mini-batch gradient estimators yields strong large-deviation bounds.
Applications:
Two-phase RSGF (2-SGFM) has been demonstrated to train small-scale convolutional neural networks on MNIST, showing competitive accuracy with classical gradient-based methods on this task even when only function-value queries are available. Batch size and validation sample size influence empirical stability and match predicted theory (Lin et al., 2022).
Complexity comparison and guidelines:
| Algorithm | Complexity (oracle calls) | Key features |
|---|---|---|
| RSGF (basic) | 9 | Two-point, smoothing |
| RSGF (two-phase) | 0 | High-prob confidence |
| Accelerated RSGF (GFM+) | 1 | Recursive variance red. |
| Classic gradient-based | 2 (for reference) | First-order only |
Summary:
The RSGF framework and its accelerations provide general, robust zeroth-order algorithms for high-dimensional, nonsmooth, nonconvex stochastic optimization, with theoretically grounded oracle complexity and demonstrated practical feasibility (Ghadimi et al., 2013, Luo et al., 2019, Lin et al., 2022, Chen et al., 2023).