Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Zeroth-Order Optimization Methods

Updated 12 November 2025
  • Zeroth-order methods are gradient-free optimization techniques that use only function evaluations without gradients to approximate search directions.
  • They employ finite-difference schemes using coordinate, sphere, or Gaussian directions to estimate smoothed gradients in various problem settings.
  • Empirical implementations demonstrate robust performance in high-dimensional, non-differentiable tasks, aiding simulation-based tuning and adversarial optimization.

Zeroth-order methods, also termed gradient-free or derivative-free optimization, are a broad class of algorithms that optimize an objective function using only function value queries, without relying on explicit gradient or Hessian information. These methods are essential when gradients are unavailable or unreliable, as is common in simulation-based optimization, black-box tuning, policy optimization in reinforcement learning, or adversarial attacks where only output evaluations are available. Modern research rigorously characterizes the theoretical foundations, estimator constructions, convergence complexities, saddle-point escape properties, and practical performance of zeroth-order methods across convex, nonconvex, smooth, nonsmooth, constraint, and distributed settings.

1. Core Problem Formulation and Oracle Model

Zeroth-order optimization focuses on finding approximate stationary (or optimal) points of an objective function F(x)F(x) given only access to a function-value oracle. A canonical nonconvex stochastic formulation is

minxRdF(x):=EξD(x)[f(x,ξ)]\min_{x \in \mathbb{R}^d} F(x) := \mathbb{E}_{\xi \sim D(x)}[f(x, \xi)]

where xRdx \in \mathbb{R}^d is the decision variable, ξ\xi is a random variable whose law D(x)D(x) may be unknown and potentially depends on xx, and f(x,ξ)f(x, \xi) is a smooth per-sample loss. The solver can query f(x,ξ)f(x, \xi) at any xx, drawing fresh samples ξD(x)\xi \sim D(x), but cannot directly evaluate or compute gradients with respect to xx.

The goal is typically to find an ϵ\epsilon-stationary point xˉ\bar{x}, i.e., EF(xˉ)2ϵ2\,\mathbb{E}\Vert \nabla F(\bar{x})\Vert^2 \leq \epsilon^2\,, using the fewest possible function evaluations.

2. Gradient Estimator Constructions and Variants

All gradient-free methods rely on finite-difference schemes that probe the objective at carefully chosen points to estimate search directions. A general “two-point” estimator has the form: g(x)=12μi=1N[F~(x+μvi)F~(xμvi)]vig(x) = \frac{1}{2\mu} \sum_{i=1}^N [\widetilde{F}(x+\mu v_i) - \widetilde{F}(x-\mu v_i)] v_i where {vi}\{v_i\} is a set of search directions, μ>0\mu > 0 is a smoothing parameter, and each F~\widetilde{F} is a mini-batch average over i.i.d. samples from D(x±μvi)D(x \pm \mu v_i).

The established choices for viv_i include:

  • Coordinate directions: vi=eiv_i = e_i for i=1,,di=1,\ldots,d, yielding N=dN=d; the classical coordinate finite-difference estimator.
  • Uniform sphere: vi=dNsiv_i = \frac{d}{N} s_i with siUnif(s=1)s_i \sim \mathrm{Unif}(\Vert s\Vert=1).
  • Gaussian: vi=1Nuiv_i = \frac{1}{N} u_i with uiN(0,I)u_i\sim \mathcal{N}(0,I).

Special cases include:

  • One-point estimator (bandit setting): g1(x;v)=f(x+μv,ξ)f(x,ξ)μvg_1(x; v) = \frac{f(x+\mu v, \xi) - f(x, \xi)}{\mu} v.
  • Two-point/Single-direction estimator: g2(x;v)=f(x+μv,ξ1)f(xμv,ξ2)2μvg_2(x; v) = \frac{f(x+\mu v, \xi^1) - f(x-\mu v, \xi^2)}{2\mu} v.
  • Multi-direction (average) as above.

When directions are sampled from the sphere or Gaussian, these estimators are unbiased for Fμ(x)\nabla F_\mu(x), the gradient of a smoothed version of FF. Variance can be controlled in terms of the number of directions NN, mini-batch size mm, the function's smoothness, and oracle noise.

3. Sample Complexity, Theoretical Properties, and Parameter Tuning

Given assumptions of finite variance (A1), gradient smoothness (A2: F(x)F(y)Mxy\Vert\nabla F(x)-\nabla F(y)\Vert \leq M \Vert x-y\Vert), and optionally Hessian-Lipschitz (A3), the query complexity for reaching EF(xˉ)2ϵ2\mathbb{E}\|\nabla F(\bar{x})\|^2\leq \epsilon^2 is summarized as:

Gradient Estimator Lipschitz Gradient Gradient+Hessian Lipschitz
Coordinate-wise FD O(d3ϵ6)O(d^3\epsilon^{-6}) O(d5/2ϵ5)O(d^{5/2}\epsilon^{-5})
Sphere / Gaussian smoothing O(d2ϵ6)O(d^2\epsilon^{-6}) O(d2ϵ5)O(d^2\epsilon^{-5})

Multi-direction estimators (sphere or Gaussian) with Nd2/ϵ4N\approx d^2/\epsilon^4 directions achieve the best rates. Typical parameter choices are:

  • Number of directions: Nd2/ϵ4N \approx d^2/\epsilon^4 (or d2/ϵ3d^2/\epsilon^3 if HH-Lipschitz).
  • Mini-batch size: m=1m = 1 when N1N\gg 1, md2/ϵ4m\approx d^2/\epsilon^4 when N=1N=1.
  • Smoothing radius μ\mu: μϵ/(dM)\mu \approx \epsilon/(dM) (sphere), μϵ/dM\mu \approx \epsilon/\sqrt{d}M (Gaussian).
  • Step size η\eta: Any constant 1/(4M)\leq 1/(4M).

These settings optimally balance sample/exploration variance, smoothing bias, and estimation error.

4. Algorithmic Frameworks and Practical Implementations

A generic zeroth-order stochastic descent algorithm is as follows:

1
2
3
4
5
6
7
8
9
Input: x₀∈ℝᵈ, step-size η, smoothing μ, #directions N, batch-size m, #iterations T
For t = 0, ..., T−1:
    Sample directions vⁱ, i=1…N (coordinate, sphere, or Gaussian)
    For i=1…N:
        Draw ξ^{1,i,·}∼D(x_t+μvⁱ), average f(x_t+μvⁱ,·) over m samples → Ŝ⁺ᶦ
        Draw ξ^{2,i,·}∼D(x_t−μvⁱ), average f(x_t−μvⁱ,·) over m samples → Ŝ⁻ᶦ
    Form g_t = (1/(2μ)) · Σ (Ŝ⁺ᶦ−Ŝ⁻ᶦ) · vⁱ
    Update x_{t+1} = x_t − η g_t
Output: Random iterate from {x₀,…,x_T}
High-level variants include:

  • Descent with coordinate-wise vs random-direction differences.
  • Smoothing-based iterative schemes that adjust μ\mu, NN, or perform variance-reduction (see (Chen et al., 6 Oct 2025)).
  • Algorithms exploiting multi-direction averaging for variance reduction and more reliable performance.

Empirical guidance recommends, for moderate dd and stationary tolerance ϵ102\epsilon \sim 10^{-2} to 10310^{-3}, using sphere or Gaussian smoothing with large NN, small m=1m=1, and step size tuned based on MM.

5. Empirical Validation and Robustness

Representative experiments encompass:

  • Multi-product pricing (d=30d=30): Sphere and Gaussian smoothing methods consistently yielded lower final objectives (up to 5%5\% better) and more rapid loss reduction compared to coordinate and single-point estimators, in alignment with d2/ϵ6d^2/\epsilon^6 theoretical scaling.
  • Strategic classification (d=12d=12): Sphere estimator exhibited greater robustness to mild violations of smoothness, outperforming other methods in both train and test AUC, suggesting increased reliability in settings with some model misspecification.

This robust empirical superiority of random-direction (sphere, Gaussian) smoothing over coordinate-based methods is well supported up to dd in the hundreds.

6. Limitations, Open Questions, and Practical Recommendations

Although multi-direction sphere/Gaussian schemes are theoretically superior for large dd (scaling as d2d^2 vs d3d^3 for coordinate schemes), sample complexity remains polynomially high in both dd and ϵ1\epsilon^{-1}. Improvements may be possible for problems with special structure, variance reduction, or combined first-order access.

Practical guidelines include:

  • Prefer sphere or Gaussian-smoothing estimators unless dd is tiny.
  • Uniform-sphere smoothing may have slightly improved constants and implementation simplicity for constrained/structured feasible sets.
  • Avoid coordinate differences except in low-dimensional regimes.
  • Step size η=1/(4M)\eta=1/(4M) works well in practice; can be further tuned.
  • If Hessian-Lipschitz constant is available, set μϵ2/3/H1/3\mu \sim \epsilon^{2/3}/H^{1/3} for improved rates.

Open questions include further reductions in sample complexity via adaptive schemes, robustness to heavy-tailed or heteroscedastic noise, and extending current theory to broader settings (e.g., composite nonsmooth objectives, complex constraints, or highly nonstationary distributions).

7. Broader Context and Impact

Zeroth-order methods are foundational for optimization under limited information. Their estimator design and query-efficient implementation is now well understood for smooth (and, via extensions, nonsmooth and constrained) settings. Advanced variance-reduction, block-coordinate strategies, and adaptive parameter selection mark present research frontiers. The impact of these methods extends across nonconvex learning, black-box adversarial robustness, simulation-based model tuning, and distributed optimization—enabling effective search in high-dimensional, non-transparent, and non-differentiable environments.

Recent analyses establish that under nonconvexity and decision- or data-dependent distributions, state-of-the-art multi-direction random smoothing methods enjoy strictly superior sample complexities, practical performance, and robustness to model misspecification, supporting their application in large-scale, real-world machine learning and operations research tasks (Hikima et al., 28 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Zeroth-Order Methods.