Recursive Variance-Reduced ZO Methods

Updated 24 April 2026

The paper presents a novel formulation extending variance reduction frameworks like SVRG, SPIDER, and SAGA to noisy zeroth-order settings for improved convergence in high-dimensional, nonconvex optimization.
It details strategies such as SPIDER/SARAH-type recursion and incremental Jacobian updates that effectively suppress the intrinsic variance of gradient estimators.
Empirical and theoretical analyses demonstrate that these recursive methods achieve convergence rates comparable to first-order techniques while significantly reducing query complexity.

Recursive variance-reduced zeroth-order methods constitute a theoretically and practically significant class of algorithms for stochastic optimization when only noisy function evaluations are available, with an explicit focus on reducing the intrinsic high variance of zeroth-order (ZO) gradient estimators via recursive or incremental control variate schemes. These methods extend variance reduction frameworks such as SVRG, SPIDER, and SAGA to the ZO domain, often drawing on recursive update structures and control variates to suppress estimator noise without incurring the prohibitive costs of full-batch or high-query complexity estimators. This article surveys core formulations, algorithmic strategies, variance reduction mechanisms, complexity bounds, and representative algorithmic instantiations, drawing on representative advances from theoretical and applied literature.

1. Fundamental Problem Setup and Smoothing

Recursive variance-reduced zeroth-order methods address the problem of minimizing a (possibly nonsmooth, nonconvex) stochastic objective where direct gradient access is unavailable: $\min_{x\in\mathcal{X}} f(x) = \mathbb{E}_\xi[\tilde{f}(x,\xi)]$ with $\mathcal{X}$ a closed convex subset of $\mathbb{R}^n$ and $\tilde{f}$ only available via noisy queries. The classical zeroth-order approach is to replace $f$ with a smoothed version, e.g., spherical or Gaussian smoothing: $f_\eta(x) = \mathbb{E}_u[f(x + \eta u)], \qquad u \sim \text{Uniform}(S^{n-1})$ so that gradient information about $f_\eta$ , which is smooth, can be estimated unbiasedly from function evaluations of $f$ at perturbed points. This smoothing is crucial for both theory (to guarantee stationarity in a generalized Clarke sense) and for reducing infinite variance when $f$ is only Lipschitz (Marrinan et al., 2023).

2. Zeroth-Order Gradient Estimators and Their Variance

Key ZO estimators include the two-point symmetric estimator: $g_\eta(x; v, \xi) = \frac{n}{2\eta}\left[f(x + v, \xi) - f(x - v, \xi)\right]\frac{v}{\|v\|}$ with $\mathcal{X}$ 0, and various single-directional estimators (random Gaussian, coordinate-finite-difference, etc.). The estimator

$\mathcal{X}$ 1

is unbiased for $\mathcal{X}$ 2: $\mathcal{X}$ 3 with variance scaling as $\mathcal{X}$ 4. Importantly, in standard ZO regimes this variance remains non-vanishing as $\mathcal{X}$ 5 and $\mathcal{X}$ 6 grow, motivating the development of recursive variance-reduction techniques that can efficiently drive estimator variance toward zero (Marrinan et al., 2023, Ji et al., 2019).

3. Recursive Variance-Reduction Strategies

Recursive variance-reduction in ZO optimization leverages control variates and recursions over historical gradient estimates, typically in one of two paradigms:

SPIDER/SARAH-Type Recursion: At iteration $\mathcal{X}$ 7,

$\mathcal{X}$ 8

where batch gradient snapshots are periodically taken, while inner iterations only incur low-variance updates via differences, and thus avoid full-batch cost (Ji et al., 2019, Zhang et al., 2022).

Incremental Jacobian or Control-Variate Updating: For finite-sum or composite problems, maintain a Jacobian estimate $\mathcal{X}$ 9 and update only a fraction of its entries at each step,

$\mathbb{R}^n$ 0

then use

$\mathbb{R}^n$ 1

to achieve both variance reduction and computational scalability (Zhang et al., 8 Jan 2026).

The recursive structure ensures that estimator variance diminishes as historical information is incrementally adaptively reused or updated, ultimately matching theoretical rates of first-order variance-reduced methods up to a dimension factor.

4. Convergence Rates and Complexity Bounds

Recursive variance-reduced zeroth-order methods achieve complexity bounds that interpolate between standard zeroth-order SGD and first-order SVRG or SPIDER, with a critical dependence on batch schedule and recursion depth. Representative results:

Algorithm Class	Stationarity Target	Oracle Complexity	Batch Cost Per Step	Reference
ZO-GD (two-point)	$\mathbb{R}^n$ 2	$\mathbb{R}^n$ 3	$\mathbb{R}^n$ 4	(Ji et al., 2019)
ZO-SGD (random direction)	$\mathbb{R}^n$ 5	$\mathbb{R}^n$ 6	$\mathbb{R}^n$ 7	(Ji et al., 2019)
ZO-SPIDER-Coord (recursive)	$\mathbb{R}^n$ 8	$\mathbb{R}^n$ 9	$\tilde{f}$ 0	(Ji et al., 2019, Zhang et al., 2022)
ZO-VRG-ZO (spherical, one-loop)	$\tilde{f}$ 1	$\tilde{f}$ 2	$\tilde{f}$ 3	(Marrinan et al., 2023)
ZIVR (incremental composite)	$\tilde{f}$ 4	$\tilde{f}$ 5 (nonconvex), $\tilde{f}$ 6 (convex)	$\tilde{f}$ 7	(Zhang et al., 8 Jan 2026)

These rates are achieved by ensuring that recursive variance reduction ensures estimator error contracts fast enough to permit non-diminishing step sizes and low total sample complexity, even in the presence of high-dimensional or constrained domains (Marrinan et al., 2023, Zhang et al., 8 Jan 2026, Ji et al., 2019).

5. Algorithmic Instantiations and Extensions

Prominent algorithmic frameworks include:

VRG-ZO (spherical smoothing, projection-based) (Marrinan et al., 2023): Implements one-loop variance-reduced two-point smoothing with growing batch size $\tilde{f}$ 8 and spherical perturbations, achieving $\tilde{f}$ 9 projections and $f$ 0 function calls for guaranteed $f$ 1-Clarke stationarity.
ZO-SPIDER-Coord (Ji et al., 2019, Zhang et al., 2022): Alternates full and mini-batch central-difference gradient estimators in a recursive control variate fashion, with provable $f$ 2 complexity and no need for diminishing step sizes.
Incremental ZIVR (Zhang et al., 8 Jan 2026): Maintains an explicit Jacobian approximation for finite-sum composite problems, incrementally randomizing updates for scalability, and supporting proximal structures.
Networked/DZOVR (Chen et al., 2023): Combines two-point estimators with momentum and gradient tracking for distributed nonconvex optimization, maintaining $f$ 3 per-node sample complexity regardless of network topology.

These algorithms may employ randomization over directions, direction sampling schedules, adaptive batch sizes, and low-memory blockwise Jacobian storage to achieve practical scalability in high-dimensional or distributed settings.

6. Connections, Variations, and Practical Considerations

Recursive variance-reduced ZO methods are closely related to their first-order progenitors (SVRG, SARAH, SPIDER, SAGA), but must contend with the unique high-variance characteristics of finite-difference estimators. Key connections and considerations include:

Smoothing Parameter ( $f$ 4 or $f$ 5) Tuning: The smoothing radius directly controls bias-variance tradeoffs, with smaller $f$ 6 reducing the distance to Michelson–Clarke stationarity but increasing $f$ 7 and step-size constraints (Marrinan et al., 2023).
Memory–Variance–Query Complexity Tradeoff: Algorithms such as ZIVR reduce variance with minimal memory overhead, compared to classical multi-loop methods that store full batch gradients (Zhang et al., 8 Jan 2026).
Distributed and Composite Optimization: Extensions to distributed (decentralized) and composite regularized domains (e.g., $f$ 8 regularization, indicator constraints) have been achieved with careful split of estimator updates over blocks or network nodes (Chen et al., 2023, Zhang et al., 8 Jan 2026).
Negative Curvature Finding: In recent work, recursive ZO variance-reduced frameworks have been merged with negative curvature discovery heuristics to guarantee second-order stationarity efficiently (Zhang et al., 2022).
Adaptive Query Reuse: LAZO and related paradigms employ instance-adaptive reuse of queries, further reducing effective estimator variance and total query complexity (Xiao et al., 2022).

7. Practical Impact and Empirical Performance

Empirical studies demonstrate that recursive variance-reduced ZO frameworks consistently outperform classical ZO-SGD and non-recursive methods in terms of wall-clock query complexity, convergence rate, and stability—across applications ranging from robust regression, decentralized learning, bandit settings, composite regularization, and diffusion model fine-tuning (Marrinan et al., 2023, Chen et al., 2023, Zhang et al., 8 Jan 2026, Ren et al., 2 Feb 2025). For instance, VRG-ZO matches the best complexity of Gaussian-smoothed approaches for nonconvex and nonsmooth problems while delivering genuine $f$ 9-Clarke stationarity, and ZIVR demonstrates faster convergence and lower oracle count in regularized learning tasks.

In summary, recursive variance-reduced zeroth-order methods represent the principal advance in rendering zeroth-order stochastic optimization efficient and scalable for complex, nonsmooth, nonconvex, distributed, and constraint-laden problems, achieving convergence rates previously accessible only to first-order methods but relying solely on function-value information. Representative algorithms, their analysis, and practical instantiations validate the effectiveness and theoretical optimality of recursive variance reduction in zeroth-order regimes (Marrinan et al., 2023, Ji et al., 2019, Chen et al., 2023, Zhang et al., 8 Jan 2026).