Papers
Topics
Authors
Recent
Search
2000 character limit reached

Accelerated ZO-SVRG Methods

Updated 8 February 2026
  • The paper’s main contribution is the development of accelerated ZO-SVRG variants that reduce query complexity while enhancing convergence rates.
  • These methods employ refined gradient estimators such as coordinate-wise and random-direction averaging to balance accuracy and computational cost.
  • Empirical results show significant efficiency gains in memory-limited, large-scale tasks like fine-tuning language models and black-box adversarial training.

Zeroth-order stochastic variance-reduced gradient methods (ZO-SVRG) constitute a prominent class of algorithms designed for stochastic optimization when gradient access is unavailable, but function evaluations are permitted. Accelerated variants of ZO-SVRG deploy refined gradient estimators and improved variance-control strategies, enabling faster convergence—both theoretically and empirically—while reducing function query complexity. Such algorithms are of increasing importance in large-scale learning, black-box model tuning, and regimes where backpropagation is prohibitively expensive or unsupported, particularly in fine-tuning LLMs, black-box adversarial applications, and chemistry/material tasks.

1. Background: Zeroth-Order SVRG and Acceleration Principles

The classic ZO-SVRG paradigm (Liu et al., 2018) replaces first-order gradients in SVRG [Stochastic Variance-Reduced Gradient] with finite-difference or random-direction gradient approximations, leveraging only function values. “Plain” ZO-SVRG employs a two-point random gradient estimator per sample: ^fi(x)=dμ[fi(x+μui)fi(x)]ui,uiUniform(Sd1)\widehat\nabla f_i(x)=\frac{d}{\mu}[f_i(x+\mu u_i)-f_i(x)]u_i,\quad u_i\sim\mathrm{Uniform}(\mathbb S^{d-1}) where μ\mu is a smoothing parameter and uiu_i is a random direction. The SVRG update structure is preserved: periodic full gradient (“snapshot”) estimation alternates with inner-loop variance-reduced updates using mini-batches.

All accelerated variants target the statistical and computational bottleneck in ZO-SVRG’s error bound: an O(1/b)O(1/b) bias term (with bb the mini-batch size), introduced by the high variance of the naive two-point estimator. Averaging over directions or switching to coordinate-based estimators meaningfully mediates this issue. The most prominent acceleration approaches are:

  • Random-direction averaging (ZO-SVRG-Ave)
  • Coordinate-wise finite-difference estimators (ZO-SVRG-Coord)
  • Seeded, data-parallel SPSA with snapshot-driven variance reduction (e.g., MeZO-SVRG (Gautam et al., 2024))
  • Mixed coordinate/random estimators (ZO-SVRG-Coord-Rand (Ji et al., 2019))

2. Core Algorithms and Their Update Rules

Reference ZO-SVRG Skeleton

All ZO-SVRG-type algorithms use a common SVRG structure:

  • Epochs: periodic “snapshot” at reference x~\tilde x, estimate anchored full (zo-)gradient.
  • Inner loop: update xk+1=xkηvkx_{k+1} = x_k - \eta v_k, where vkv_k is a variance-reduced ZO estimator.

Accelerated Variants

Variant Estimator Type Update Equation (per sample fif_i) Query Complexity per Estim.
ZO-SVRG (plain) 2-point random dμ[fi(x+μui)fi(x)]ui\frac{d}{\mu}[f_i(x+\mu u_i)-f_i(x)]u_i $1$
ZO-SVRG-Ave qq random avg. dμqj=1q[fi(x+μuj)fi(x)]uj\frac{d}{\mu q}\sum_{j=1}^q [f_i(x+\mu u_j)-f_i(x)]u_j qq
ZO-SVRG-Coord coordinate-wise =1dfi(x+μe)fi(xμe)2μe\sum_{\ell=1}^d \frac{f_i(x+\mu e_\ell)-f_i(x-\mu e_\ell)}{2\mu} e_\ell dd
ZO-SVRG-Coord-Rand coord. ref. / rand inner Reference: coord; inner: random 2-point (Ji et al., 2019) coord/ref: dd or qq
MeZO-SVRG batchwise SPSA + seed Shared ztN(0,Id)z_t\sim N(0,I_d) for each batch; 2-point SPSA estimation per batch (Gautam et al., 2024) $2b$ (minibatch size bb)

ZO-SVRG-Ave and ZO-SVRG-Coord reduce variance at increased per-gradient query cost. MeZO-SVRG achieves acceleration through variance-reduction, shared noise direction, and minimal memory overhead.

3. Variance Reduction Mechanisms and Estimator Analysis

Variance reduction in these methods hinges on a “control variate”: Δ=vcurrentvref+gref\Delta = v_{\text{current}} - v_{\text{ref}} + g_{\text{ref}} where grefg_{\text{ref}} is the full-batch ZO estimator at the snapshot point, vcurrentv_{\text{current}} the minibatch ZO estimator at the current iterate, and vrefv_{\text{ref}} the minibatch estimator at the snapshot. This structure ensures: E[Δx,xref]=f(x)\mathbb{E}[\Delta\,|\,x,x_\text{ref}] = \nabla f(x) with variance substantially smaller than complete independence.

In ZO-SVRG-Ave, averaging qq direction vectors per estimate reduces the offending O(d)O(d) variance blowup to O((d+q)/q)O\bigl((d+q)/q\bigr), replacing O(1/b)O(1/b) error with O(1/(bmin{d,q}))O(1/(b\min\{d,q\})). In ZO-SVRG-Coord, deterministic coordinate estimators eliminate this variance source and “extra error” entirely.

MeZO-SVRG applies variance reduction using a shared perturbation vector ztz_t for the entire batch, attaining the benefits of data-parallel SPSA and the SVRG control variate (Gautam et al., 2024).

4. Theoretical Convergence Guarantees

Accelerated ZO-SVRG variants achieve significantly improved convergence rates compared to ZO-SGD and classic zeroth-order methods. Key theoretical results include:

  • ZO-SVRG (plain):

Ef(xˉ)2O(dT+1b)\mathbb{E}\|\nabla f(\bar x)\|^2 \le O\left(\frac{d}{T} + \frac{1}{b}\right)

Query complexity: O(nS+bT)O(nS + bT) (SS epochs, TT steps) (Liu et al., 2018).

  • ZO-SVRG-Coord:

Ef(xˉ)2O(dT)\mathbb{E}\|\nabla f(\bar x)\|^2 \le O\left(\frac{d}{T}\right)

with no O(1/b)O(1/b) term; query complexity O(d(nS+bT))O\bigl(d(nS+bT)\bigr) (Liu et al., 2018, Ji et al., 2019).

  • ZO-SVRG-Ave:

Ef(xˉ)2O(dT+1bmin{d,q})\mathbb{E}\|\nabla f(\bar x)\|^2 \le O\left(\frac{d}{T} + \frac{1}{b\min\{d,q\}}\right)

at qq-times higher query cost per estimator (Liu et al., 2018).

  • MeZO-SVRG:

Ef(θT)2f(θ0)fTη+C1Lμ2d+C2η2dσ2bq\mathbb{E}\|\nabla f(\theta_T)\|^2 \le \frac{f(\theta_0)-f^*}{T\eta}+C_1L\mu^2 d + C_2 \frac{\eta^2 d\sigma^2}{bq}

Yields ϵ\epsilon-stationarity in O(dbϵ2)O\left(\frac{d}{b\epsilon^2}\right) iterations, strictly outperforming ZO-SGD's O(dϵ4)O\left(\frac{d}{\epsilon^4}\right) (Gautam et al., 2024).

Further, (Ji et al., 2019) establishes for ZO-SVRG-Coord(-Rand) a query complexity of O(min{n2/3dϵ1,dϵ5/3})O(\min\{n^{2/3}d\epsilon^{-1},d\epsilon^{-5/3}\}), improving on ZO-GD and ZO-SGD in all settings with n>1n>1.

5. Practical Implementation: Memory, Hyperparameters, and Use Cases

Accelerated ZO-SVRG methods are particularly suited to applications with tight memory budgets and expensive backpropagation. Key implementation details from (Gautam et al., 2024):

  • Memory Footprint: MeZO-SVRG requires only one extra copy of parameters and reference gradient; overhead is O(d)O(d) and remains constant in batch size bb.
    • For large autoregressive models, MeZO-SVRG achieves 2×\sim 2\times GPU memory savings relative to FO-SGD (e.g., 19GB vs. 38GB for GPT2-XL).
    • For large batch sizes, MeZO-SVRG reduces memory 3–4×\times versus FO-SGD (e.g., 4.7GB vs. 18.6GB for RoBERTa-large at b=64b=64).
  • Computation Cost: MeZO-SVRG reaches comparable or higher test accuracy than ZO-SGD (MeZO) in half the GPU-hours or less.
  • Hyperparameters: Best empirical performance with perturbation μ103\mu\approx 10^{-3}; two learning rates η1>η2\eta_1 > \eta_2, typically η1=10×η2\eta_1 = 10\times\eta_2; batch sizes b=32b=32 or $64$; snapshot frequency q=2q=2 or $5$.
  • Applications: Large-scale LLM fine-tuning, black-box adversarial training, hyperparameter tuning, settings where only function value access exists, and memory-limited optimization.

6. Comparative Summary and Trade-offs

Accelerated ZO-SVRG variants offer a flexible spectrum of variance reduction and function query cost trade-offs, as summarized below:

Algorithm Query Complexity (for ϵ\epsilon-stationarity) Pros Cons
ZO-SVRG (plain) O(n/ϵ+d/ϵ2)O(n/\epsilon + d/\epsilon^2) Minimal queries per step Slower convergence, O(1/b)O(1/b) error
ZO-SVRG-Ave O(q(n/ϵ+d/ϵ2))O(q(n/\epsilon + d/\epsilon^2)) Lower variance, smaller O(1/b)O(1/b) error Higher function query cost
ZO-SVRG-Coord O(min{n2/3dϵ1,dϵ5/3})O(\min\{n^{2/3}d\,\epsilon^{-1},d\epsilon^{-5/3}\}) No O(1/b)O(1/b) error, matches best-known rate dd-fold increase in functions
MeZO-SVRG O(d/(bϵ2))O(d/(b\epsilon^2)) Low memory overhead, fast convergence Not fully coordinate-wise

Selection among these variants depends on the allowable computational resources (specifically, number of function evaluations per iteration) and the degree of variance reduction required for the application at hand.

7. Empirical Performance and Future Directions

Empirical results from (Gautam et al., 2024) demonstrate that MeZO-SVRG outperforms basic MeZO (plain ZO-SGD) with up to 20 test-accuracy point gains across multiple LLMs and standard GLUE tasks. GPU-hours to target accuracy are halved on large LMs, and memory savings are up to 4×4\times compared to FO-SGD. The methods close the convergence gap with first-order SGD/Adam in the large batch, large parameter regimes, and are especially robust for non-prompted fine-tuning scenarios.

Recent theoretical advances (Ji et al., 2019) further suggest continued improvements via tighter analysis and hybrid estimators, enabling constant stepsizes and improving query complexity beyond all earlier ZO-GD/SGD variants. A plausible implication is expanding deployment in large, black-box, or memory-challenged environments, especially where function-only or low-level API access is available.

References

  • "Variance-reduced Zeroth-Order Methods for Fine-Tuning LLMs" (Gautam et al., 2024)
  • "Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization" (Liu et al., 2018)
  • "Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization" (Ji et al., 2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Accelerated Variants of ZO-SVRG.