Accelerated ZO-SVRG Methods

Updated 8 February 2026

The paper’s main contribution is the development of accelerated ZO-SVRG variants that reduce query complexity while enhancing convergence rates.
These methods employ refined gradient estimators such as coordinate-wise and random-direction averaging to balance accuracy and computational cost.
Empirical results show significant efficiency gains in memory-limited, large-scale tasks like fine-tuning language models and black-box adversarial training.

Zeroth-order stochastic variance-reduced gradient methods (ZO-SVRG) constitute a prominent class of algorithms designed for stochastic optimization when gradient access is unavailable, but function evaluations are permitted. Accelerated variants of ZO-SVRG deploy refined gradient estimators and improved variance-control strategies, enabling faster convergence—both theoretically and empirically—while reducing function query complexity. Such algorithms are of increasing importance in large-scale learning, black-box model tuning, and regimes where backpropagation is prohibitively expensive or unsupported, particularly in fine-tuning LLMs, black-box adversarial applications, and chemistry/material tasks.

1. Background: Zeroth-Order SVRG and Acceleration Principles

The classic ZO-SVRG paradigm (Liu et al., 2018) replaces first-order gradients in SVRG [Stochastic Variance-Reduced Gradient] with finite-difference or random-direction gradient approximations, leveraging only function values. “Plain” ZO-SVRG employs a two-point random gradient estimator per sample: $\widehat\nabla f_i(x)=\frac{d}{\mu}[f_i(x+\mu u_i)-f_i(x)]u_i,\quad u_i\sim\mathrm{Uniform}(\mathbb S^{d-1})$ where $\mu$ is a smoothing parameter and $u_i$ is a random direction. The SVRG update structure is preserved: periodic full gradient (“snapshot”) estimation alternates with inner-loop variance-reduced updates using mini-batches.

All accelerated variants target the statistical and computational bottleneck in ZO-SVRG’s error bound: an $O(1/b)$ bias term (with $b$ the mini-batch size), introduced by the high variance of the naive two-point estimator. Averaging over directions or switching to coordinate-based estimators meaningfully mediates this issue. The most prominent acceleration approaches are:

Random-direction averaging (ZO-SVRG-Ave)
Coordinate-wise finite-difference estimators (ZO-SVRG-Coord)
Seeded, data-parallel SPSA with snapshot-driven variance reduction (e.g., MeZO-SVRG (Gautam et al., 2024))
Mixed coordinate/random estimators (ZO-SVRG-Coord-Rand (Ji et al., 2019))

2. Core Algorithms and Their Update Rules

Reference ZO-SVRG Skeleton

All ZO-SVRG-type algorithms use a common SVRG structure:

Epochs: periodic “snapshot” at reference $\tilde x$ , estimate anchored full (zo-)gradient.
Inner loop: update $x_{k+1} = x_k - \eta v_k$ , where $v_k$ is a variance-reduced ZO estimator.

Accelerated Variants

Variant	Estimator Type	Update Equation (per sample $f_i$ )	Query Complexity per Estim.
ZO-SVRG (plain)	2-point random	$\frac{d}{\mu}[f_i(x+\mu u_i)-f_i(x)]u_i$	$\mu$ 0
ZO-SVRG-Ave	$\mu$ 1 random avg.	$\mu$ 2	$\mu$ 3
ZO-SVRG-Coord	coordinate-wise	$\mu$ 4	$\mu$ 5
ZO-SVRG-Coord-Rand	coord. ref. / rand inner	Reference: coord; inner: random 2-point (Ji et al., 2019)	coord/ref: $\mu$ 6 or $\mu$ 7
MeZO-SVRG	batchwise SPSA + seed	Shared $\mu$ 8 for each batch; 2-point SPSA estimation per batch (Gautam et al., 2024)	$\mu$ 9 (minibatch size $u_i$ 0)

ZO-SVRG-Ave and ZO-SVRG-Coord reduce variance at increased per-gradient query cost. MeZO-SVRG achieves acceleration through variance-reduction, shared noise direction, and minimal memory overhead.

3. Variance Reduction Mechanisms and Estimator Analysis

Variance reduction in these methods hinges on a “control variate”: $u_i$ 1 where $u_i$ 2 is the full-batch ZO estimator at the snapshot point, $u_i$ 3 the minibatch ZO estimator at the current iterate, and $u_i$ 4 the minibatch estimator at the snapshot. This structure ensures: $u_i$ 5 with variance substantially smaller than complete independence.

In ZO-SVRG-Ave, averaging $u_i$ 6 direction vectors per estimate reduces the offending $u_i$ 7 variance blowup to $u_i$ 8, replacing $u_i$ 9 error with $O(1/b)$ 0. In ZO-SVRG-Coord, deterministic coordinate estimators eliminate this variance source and “extra error” entirely.

MeZO-SVRG applies variance reduction using a shared perturbation vector $O(1/b)$ 1 for the entire batch, attaining the benefits of data-parallel SPSA and the SVRG control variate (Gautam et al., 2024).

4. Theoretical Convergence Guarantees

Accelerated ZO-SVRG variants achieve significantly improved convergence rates compared to ZO-SGD and classic zeroth-order methods. Key theoretical results include:

ZO-SVRG (plain):

$O(1/b)$ 2

Query complexity: $O(1/b)$ 3 ( $O(1/b)$ 4 epochs, $O(1/b)$ 5 steps) (Liu et al., 2018).

ZO-SVRG-Coord:

$O(1/b)$ 6

with no $O(1/b)$ 7 term; query complexity $O(1/b)$ 8 (Liu et al., 2018, Ji et al., 2019).

ZO-SVRG-Ave:

$O(1/b)$ 9

at $b$ 0-times higher query cost per estimator (Liu et al., 2018).

MeZO-SVRG:

$b$ 1

Yields $b$ 2-stationarity in $b$ 3 iterations, strictly outperforming ZO-SGD's $b$ 4 (Gautam et al., 2024).

Further, (Ji et al., 2019) establishes for ZO-SVRG-Coord(-Rand) a query complexity of $b$ 5, improving on ZO-GD and ZO-SGD in all settings with $b$ 6.

5. Practical Implementation: Memory, Hyperparameters, and Use Cases

Accelerated ZO-SVRG methods are particularly suited to applications with tight memory budgets and expensive backpropagation. Key implementation details from (Gautam et al., 2024):

Memory Footprint: MeZO-SVRG requires only one extra copy of parameters and reference gradient; overhead is $b$ $b$ 7 and remains constant in batch size $b$ $b$ 8.
- For large autoregressive models, MeZO-SVRG achieves $b$ 9 GPU memory savings relative to FO-SGD (e.g., 19GB vs. 38GB for GPT2-XL).
- For large batch sizes, MeZO-SVRG reduces memory 3–4 $\tilde x$ 0 versus FO-SGD (e.g., 4.7GB vs. 18.6GB for RoBERTa-large at $\tilde x$ 1).
Computation Cost: MeZO-SVRG reaches comparable or higher test accuracy than ZO-SGD (MeZO) in half the GPU-hours or less.
Hyperparameters: Best empirical performance with perturbation $\tilde x$ 2; two learning rates $\tilde x$ 3, typically $\tilde x$ 4; batch sizes $\tilde x$ 5 or $\tilde x$ 6; snapshot frequency $\tilde x$ 7 or $\tilde x$ 8.
Applications: Large-scale LLM fine-tuning, black-box adversarial training, hyperparameter tuning, settings where only function value access exists, and memory-limited optimization.

6. Comparative Summary and Trade-offs

Accelerated ZO-SVRG variants offer a flexible spectrum of variance reduction and function query cost trade-offs, as summarized below:

Algorithm	Query Complexity (for $\tilde x$ 9-stationarity)	Pros	Cons
ZO-SVRG (plain)	$x_{k+1} = x_k - \eta v_k$ 0	Minimal queries per step	Slower convergence, $x_{k+1} = x_k - \eta v_k$ 1 error
ZO-SVRG-Ave	$x_{k+1} = x_k - \eta v_k$ 2	Lower variance, smaller $x_{k+1} = x_k - \eta v_k$ 3 error	Higher function query cost
ZO-SVRG-Coord	$x_{k+1} = x_k - \eta v_k$ 4	No $x_{k+1} = x_k - \eta v_k$ 5 error, matches best-known rate	$x_{k+1} = x_k - \eta v_k$ 6-fold increase in functions
MeZO-SVRG	$x_{k+1} = x_k - \eta v_k$ 7	Low memory overhead, fast convergence	Not fully coordinate-wise

Selection among these variants depends on the allowable computational resources (specifically, number of function evaluations per iteration) and the degree of variance reduction required for the application at hand.

7. Empirical Performance and Future Directions

Empirical results from (Gautam et al., 2024) demonstrate that MeZO-SVRG outperforms basic MeZO (plain ZO-SGD) with up to 20 test-accuracy point gains across multiple LLMs and standard GLUE tasks. GPU-hours to target accuracy are halved on large LMs, and memory savings are up to $x_{k+1} = x_k - \eta v_k$ 8 compared to FO-SGD. The methods close the convergence gap with first-order SGD/Adam in the large batch, large parameter regimes, and are especially robust for non-prompted fine-tuning scenarios.

Recent theoretical advances (Ji et al., 2019) further suggest continued improvements via tighter analysis and hybrid estimators, enabling constant stepsizes and improving query complexity beyond all earlier ZO-GD/SGD variants. A plausible implication is expanding deployment in large, black-box, or memory-challenged environments, especially where function-only or low-level API access is available.

References

"Variance-reduced Zeroth-Order Methods for Fine-Tuning LLMs" (Gautam et al., 2024)
"Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization" (Liu et al., 2018)
"Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization" (Ji et al., 2019)

Markdown Report Issue Upgrade to Chat

References (3)

Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization (2018)

Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models (2024)

Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Accelerated Variants of ZO-SVRG.