Zeroth Order Optimization: Techniques and Applications

Updated 15 April 2026

Zeroth Order Optimization (ZOO) is a class of algorithms that optimize functions using only function evaluations when gradients are unavailable.
It employs finite-difference estimators—such as one-point and two-point methods—to approximate gradients and tackle high-dimensional query challenges.
Advanced ZOO techniques use adaptive step-sizes, block perturbations, and distributed consensus to boost performance in adversarial attacks, reinforcement learning, and large-scale training.

Zeroth Order Optimization (ZOO) refers to a class of algorithms for optimizing functions when analytic gradients are unavailable, unreliable, or expensive to compute, and only function evaluations (potentially noisy) are accessible. ZOO methods estimate gradient information via finite differences or surrogate constructions, enabling the application of optimization principles to black-box, non-differentiable, or simulation-based objectives. These methods underpin applications ranging from black-box adversarial machine learning, simulation-based reinforcement learning, memory-efficient deep network training, and distributed multi-agent optimization, to high-dimensional hyperparameter tuning.

1. Foundations and Core Methodology

The canonical ZOO paradigm addresses the unconstrained or constrained minimization of $f(x):\mathbb{R}^d\to\mathbb{R}$ :

$\min_{x\in X} f(x)$

where $X$ is typically convex but may be unconstrained, and only queries of $f(x)$ are permitted (Liu et al., 2020).

ZOO methods approximate gradient information by evaluating $f$ along various directions. The two most common finite-difference estimators are:

One-point random-direction estimator:

$\hat{g}_1(x) = \frac{\phi(d)}{\mu}f(x+\mu u)u$

where $u$ is sampled uniformly from the unit sphere or Gaussian, $\mu>0$ is a smoothing parameter, and $\phi(d)=d$ (sphere) or $1$ (Gaussian).

Two-point random-direction estimator:

$\min_{x\in X} f(x)$ 0

This estimator, which is unbiased for the (Gaussian-)smoothed gradient $\min_{x\in X} f(x)$ 1, exhibits lower variance and is standard in modern ZOO (Liu et al., 2020, Ye et al., 2 Feb 2026).

For high-dimensional problems, coordinate-wise finite-differences or block-wise sampling are also standard to reduce query cost (Chen et al., 2017, Jin et al., 22 Oct 2025).

Key theoretical properties:

Bias: $\min_{x\in X} f(x)$ 2, where $\min_{x\in X} f(x)$ 3.
Variance: For $\min_{x\in X} f(x)$ 4-smooth $\min_{x\in X} f(x)$ 5, $\min_{x\in X} f(x)$ 6 (Liu et al., 2020, Zhang et al., 5 Jun 2025).

Standard ZOO schemes lift any first-order method by substituting analytical gradients with ZO estimators, including SGD, Adam, or coordinate descent (Sharma et al., 2020, Shu et al., 3 Feb 2025).

2. Query Complexity, Dimension Dependence, and Regularization

The convergence behavior and query complexity of ZOO methods are strongly affected by the estimator variance and dimension $\min_{x\in X} f(x)$ 7.

Classical rates: ZO-SGD with the two-point estimator provides $\min_{x\in X} f(x)$ 8 convergence in nonconvex settings and $\min_{x\in X} f(x)$ 9 in the strongly convex case (Liu et al., 2020, Jin et al., 22 Oct 2025).
Dimension dependence: The convergence rates and query complexity scale linearly with $X$ 0 unless further structure is exploited.

Recent theoretical advances replace the worst-case $X$ 1 dependence by more refined spectral or effective-dimension metrics (Yue et al., 2023, Ye et al., 13 Oct 2025):

Effective-dimension (ED $X$ 2):

$X$ 3

whereby fast Hessian decay (e.g., in shallow nets or certain ML objectives) enables logarithmic or even dimension-independent rates (Yue et al., 2023, Ye et al., 13 Oct 2025).

Implicit regularization and flat minima: ZOO with two-point estimators exhibits implicit bias towards flat minima, minimizing the trace of the Hessian among global optima (Zhang et al., 5 Jun 2025). This is formalized as:

$X$ 4

so that stochastic optimization with ZOO provably favors solutions with reduced sharpness (lower Hessian trace), a critical property for generalization in deep learning.

3. High-Dimensional, Structured, and Adaptive ZOO

Subspace and Block Perturbations

High dimensionality presents a practical bottleneck in ZOO (variance $X$ 5). Structured perturbations reduce this scaling:

Subspace perturbations: Restricting perturbations to sparse, low-rank, or block-coordinated subspaces of stable rank $X$ 6 reduces the variance to $X$ 7 (Park et al., 31 Jan 2025). The overall convergence rate becomes $X$ 8, and in the case of good subspace alignment with the objective's Hessian, dimension-independent rates are attainable.
Block coordinate and structured descent: In large-scale settings (e.g., LLM fine-tuning), block coordinate ZOO (e.g., MeZO-BCD) updates single architectural blocks, enabling both computational efficiency and wall-clock speedups (up to $X$ 9 over standard MeZO on OPT-13B) (Park et al., 31 Jan 2025).
Compressed sensing for sparse gradients: For objectives with $f(x)$ 0-sparse gradients, new estimators such as GraCe minimize the required function queries to $f(x)$ 1 per step, the first double-logarithmic bound, outperforming previous ZORO methods ( $f(x)$ 2) (Qiu et al., 2024).

Adaptive and Variance-Reduced ZOO

Adaptive ZOO methods, including R-AdaZO and adaptive step size scaling based on local empirical function-value standard deviation, achieve robust convergence across ill-conditioned or non-stationary regimes (Shu et al., 3 Feb 2025, Ye et al., 2 Feb 2026). The empirical variance of sampled function values is shown to tightly approximate gradient norm, enabling principled step normalization (Ye et al., 2 Feb 2026).

Refined moment estimation (first and second moments) in adaptive updates can provably reduce ZO variance and speed up convergence. For instance, R-AdaZO provably reduces the second moment estimate's bias by a factor of up to $f(x)$ 3 compared to previous ZO-AdaMM (Shu et al., 3 Feb 2025).

4. Distributed and Decentralized ZOO Algorithms

Distributed ZOO methods enable optimization over networks or decentralized data splits. Techniques such as gradient tracking, block coordinate communication, and consensus protocols are used to ensure convergence to a global optimum with minimal communication and ZO query cost (Mhanna et al., 2024, Zhang et al., 2022).

In distributed, nonconvex settings, single-point gradient-tracking ZOO algorithms can achieve convergence rates $f(x)$ 4, surpassing centralized counterparts (which achieve $f(x)$ 5), even with a single function evaluation per node per iteration (Mhanna et al., 2024). Decentralized coordinate ZOO with consensus averaging and powerball acceleration, such as ZOOM-PB, provides further variance control and convergence improvements for multi-agent and federated architectures (Zhang et al., 2022).

5. Specializations: Evolution Strategies, RL, and Black-Box Attacks

Evolution Strategies and Ancestral RL

ZOO forms the basis of Evolution Strategies (ES), in which a distribution over parameters is perturbed, batch-evaluated, and updated by estimating gradients via aggregate returns. The standard ES gradient estimator is (Nakashima et al., 2024):

$f(x)$ 6

where $f(x)$ 7 are parameter perturbations.

Ancestral Reinforcement Learning (ARL) combines ZOO with genetic algorithms, maintaining populations with ancestry-based updates and KL-regularization induced by evolutionary selection. The population-fitness objective and its variational form fundamentally endow ZOO RL with exploration-enhancing entropy regularization (Nakashima et al., 2024).

Adversarial Black-Box Attacks

ZOO methods underpin modern attacks that generate adversarial examples for DNNs where no gradient access is available. Coordinate-wise ZOO, stochastic coordinate descent with importance sampling, dimension reduction (e.g., noise upscaling), and hierarchical attack schemes have been shown to match or surpass the efficacy of white-box attacks under practical query budgets (Chen et al., 2017).

Policy Optimization Equivalence

Recent theory shows that ZOO with Gaussian-smoothing is equivalent to single-step policy optimization—i.e., REINFORCE with a Gaussian policy and baseline subtraction (Qiu et al., 17 Jun 2025). The finite-difference estimator precisely matches the REINFORCE gradient, with baseline subtraction reducing variance via the policy gradient paradigm. This connection justifies query reuse and baseline-averaged estimators for further variance reduction.

6. Extensions, Best Practices, and Open Problems

Extensions and Advanced Estimators

Beyond classical finite differences, advanced estimators include:

Regression-based one-point estimators: RESZO constructs a local linear or quadratic surrogate from historical evaluations, achieving two-point-rate convergence with only one function query per iteration (Chen et al., 6 Jul 2025).
Unbiased estimators via telescoping series: Optimal, unbiased gradient estimators based on telescoping directional derivatives that sample over step-scales have been constructed, achieving variance bounds that match the optimal $f(x)$ 8 query complexity for nonconvex smooth objectives (Ma et al., 22 Oct 2025).
Complex-step differentiation: In the presence of analytic objectives and complex-code support, imaginary-step differentiation avoids catastrophic cancellation and attains optimal rates in strongly convex/quadratic problems (Jongeneel, 2021).

Best Practices

Employ subspace/block perturbations or compressed sensing approaches in high dimensions to reduce variance.
Apply adaptive step-size and variance reduction (e.g., using first/second moment estimates, empirical function-value variance normalization) to mitigate anisotropy and heterogeneity in landscape geometry.
Use regression or history-based estimators for single-query scenarios when function queries are costly.
When available, leverage domain structure (e.g., sparsity, block architecture, Hessian spectrum) for sketching and perturbation alignment.
For distributed systems, apply coordinate descent, gradient tracking, and consensus schemes for efficient scaling.

Open Questions

Characterizing tight lower bounds for query complexity in terms of Hessian spectrum or local landscape statistics (Yue et al., 2023, Ye et al., 13 Oct 2025).
Extending ZOO theory and practice to nonsmooth, constrained, or composite-objective settings with complex structure.
Automating subspace adaptation and perturbation scheduling based on local sensitivity.
Exploiting the connections to policy optimization for sequential and reinforcement learning tasks (Qiu et al., 17 Jun 2025, Nakashima et al., 2024).

7. Impact and Representative Applications

ZOO algorithms are foundational for:

Black-box adversarial attack generation and robustness evaluation in neural networks (Chen et al., 2017, Liu et al., 2020).
Simulation-based policy search and reinforcement learning, especially in non-differentiable or model-free environments (Nakashima et al., 2024).
Memory-efficient and hardware-constrained training of deep networks, including LLMs, where backpropagation is impractical (Park et al., 31 Jan 2025, Cao et al., 11 Feb 2026).
Distributed/federated optimization over constrained or nonconvex objectives (Mhanna et al., 2024, Zhang et al., 2022).

The evolution of ZOO continues to address statistical, computational, and system-scaling challenges by leveraging modern advances in variance reduction, compressed sensing, distributed consensus, and cross-fertilization with reinforcement learning and policy-gradient frameworks.