Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

GPT-4o

Gemini 2.5 Pro Pro

o3 Pro

GPT-4.1 Pro

DeepSeek R1 via Azure Pro

2000 character limit reached

Zeroth-Order Optimization Methods

Updated 30 June 2025

Zeroth-order optimization is a suite of methods that solve problems using only function evaluations without requiring explicit gradients.
The approaches utilize finite-difference schemes and random perturbations to approximate gradients for applications in ML, engineering, and adversarial robustness.
They address high-dimensional challenges by incorporating techniques like sparsity exploitation and structure-aware adaptations for improved efficiency.

Zeroth-order optimization—also referred to as derivative-free or gradient-free optimization—encompasses a family of algorithms that solve optimization problems using only function evaluations rather than explicit derivatives. These methods have become fundamental in engineering, scientific computing, and machine learning, particularly in scenarios where explicit gradients are unavailable, unreliable, or computationally expensive. This encompasses optimization of black-box systems, security and adversarial robustness applications, distributed and federated learning without shared gradients, large-scale deep learning where memory limits preclude backpropagation, and situations with simulation-based or legacy code models.

1. Mathematical Foundations and Core Algorithms

Zeroth-order optimization algorithms systematically approximate gradients using finite differences or random perturbations, then proceed with updates akin to first-order methods.

Gradient Estimation

A common approach relies on finite-difference schemes or randomized smoothing. For a smooth function $f : \mathbb{R}^d \to \mathbb{R}$ , typical estimators include:

One-point estimator:

$g_{1\mathrm{pt}}(x) = \frac{n}{\delta} f(x + \delta y) y$

where $y$ is a random unit vector.

Two-point estimator [Nesterov 2017]:

$g_{2\mathrm{pt}}(x) = \frac{n}{\delta} [f(x+\delta y) - f(x)] y$

Central difference (best variance):

$g_{central}(x) = \frac{n}{2\delta} [f(x+\delta y) - f(x-\delta y)] y$

These estimators are unbiased for the gradient of a smoothed version of $f$ , not of $f$ itself; they introduce a bias proportional to the size of the smoothing parameter $\delta$ , but can be computed from function queries alone.

Coordinate-wise Estimators use finite differences on individual input coordinates and can be advantageous when the underlying gradient is sparse.

Once the gradient surrogate is constructed, the update is typically:

$x_{t+1} = x_t - \eta_t \hat{\nabla} f(x_t)$

with $\hat{\nabla} f(x_t)$ the zeroth-order estimate.

Mirror descent, conditional gradient (Frank-Wolfe), and other first-order variants can be directly adapted by swapping in the surrogate gradient.

2. Dimensionality and Scalability

The performance and scalability of zeroth-order methods depend in large part on how their complexity (i.e., sample or query complexity and convergence rate) scales with the problem's dimensionality $d$ :

Classical methods typically exhibit linear or even worse dependency on $d$ : for example, the simplest random direction estimator requires $\mathcal{O}(d)$ function queries per update to achieve a given precision (Liu et al., 2020).
Recent advances exploit structure (sparsity, effective dimension reduction, low-rank curvature) to break or attenuate this dimensionality bottleneck:
- (Wang et al., 2017) shows that, under sparsity assumptions (gradient or function depending on few variables), combining Lasso-based gradient estimation with successive component selection or mirror descent allows regret bounds to depend only logarithmically on $d$ .
- (Yue et al., 2023) introduces the concept of "effective dimension" $\mathrm{ED}_\alpha$ , measured by the $\alpha$ -powers of the spectrum of the Hessian. When the Hessian has fast-decaying spectrum, complexities can be dimension-free or logarithmic.

Practical implementations in high dimensions often rely on block or coordinate descent, adaptive mini-batching, sampling directions from low-dimensional subspaces, or exploiting prior knowledge about sparsity patterns.

Table: Complexity Reduction Approaches

Method	Key Assumption	Dimension Dependence
Random gradient (RG)	None	$\mathcal{O}(d)$
Lasso-GD, Mirror descent	Gradient sparsity ( $s$ )	$\log d$ (via $s \log d$ )
Effective dim. RG/HB-ZGD	Fast Hessian eigendecay	$\mathrm{ED}_\alpha \ll d$
Block/Coordinate descent	Structure known/apprx	$O$ (block size) per update

3. Advanced Techniques: Exploiting Structure and Curvature

Sparsity-Aware Methods

The use of Lasso regression to estimate sparse gradients (Wang et al., 2017) enables efficient identification of active variables—those most relevant to the objective. This insight underpins:

Successive component selection: at each stage, identify and update only the most impactful variables—dimension reduction in action.
Debiased Lasso plus mirror descent: apply a debiased estimator for unbiased gradient estimation, followed by a mirror-descent step for optimization. When the Hessian is smooth, two-scale debiasing further improves convergence rates.

Hessian-Aware ZO Methods

When the objective exhibits low-rank or structured curvature (Ye et al., 2018), Hessian-aware ZO methods approximate curvature using functional queries:

Power iteration, Gaussian sampling, or accumulated diagonals are used to form Hessian approximations.
Updates are made in search directions that account for local curvature, akin to a Newton step but without explicit gradients. This approach sharply reduces iteration and query complexity when the Hessian has low effective rank.

Hybrid Estimators

Blending random and coordinate-wise estimators, possibly with importance-sampled coordinate selection, combines the advantages of low variance and low query complexity (Sharma et al., 2020). Such hybrids offer practical flexibility and can be tuned analytically for optimal performance in various regimes.

4. Applications Across Machine Learning and Engineering

Zeroth-order optimization supports a wide range of high-impact tasks in ML and engineering:

Black-box adversarial attacks: In black-box attack scenarios on classifiers/LLMs, derivatives are inaccessible; ZO methods (including hessian-aware variants and high-dimensional sparse estimators) enable efficient crafting of adversarial examples with far fewer queries and enhanced success rates (Ye et al., 2018, Sharma et al., 2020).
Robustness and model interpretation: ZO-based strategies are used to probe the robustness of models and generate explanations for black-box systems (Liu et al., 2020).
Sensor and resource allocation: Optimal sensing and resource allocation under combinatorial or simulation-based constraints benefit from ZO-iALM for high-precision, constraint-aware optimization (Li et al., 2021).
Large-scale deep learning: Recent work demonstrates scalable ZO-based frameworks for deep model training—incorporating coordinatewise estimation, sparsity, and parallelization—to approach the performance of first-order training on modern architectures (Chen et al., 2023).
Distributed and multi-agent systems: ZO distributed feedback optimization enables large networks of agents to cooperatively optimize shared or coupled cost functions, even under delay, noise, and communication limitations (Tang et al., 2020, Duan et al., 16 Oct 2024).
Safe optimization with hard constraints: SZO-QQ and SZO-LP frameworks guarantee feasibility (all iterates satisfy unknown constraints) by leveraging quadratic or linear constraint proxies and local subproblem solvers (Guo et al., 2022, Guo et al., 2023, Guo et al., 2023).

Table: Representative Applications

Domain	Realization of ZO	Notable Features
Black-box attacks	Two-point, sparse ZO	Query-limited, robust, scalable
Safe control/OPF	SZO-QQ/SZO-LP	Feasibility at all steps, scalable
Model explanation	ZO-SGD/-signSGD	Gradient-free, model-agnostic
Multi-agent cooperative	ZO feedback w/ delay	Coupled constraints, consensus
Deep network training	Pruned coordinate CGE+ZO	Scalable to millions of params

5. Convergence Properties and Limitations

Convergence rates and query complexity are central performance metrics for ZO methods. Dominant factors include:

Problem class: For smooth convex objectives, optimal ZO methods yield $\mathcal{O}(1/\sqrt{T})$ convergence, with $T$ the number of queries; strongly convex settings permit linear or accelerated rates under structure.
Dimensionality: Vanilla methods scale poorly, but sparsity- or spectrum-aware algorithms can reduce effective dimension or make complexity logarithmic in $d$ .
Variance-bias tradeoff: The smoothing parameter $\delta$ must balance estimator variance and bias; hybrid estimators and variance reduction schemes can reduce query cost at given precision.
Structure exploitation: Incorporating sparsity, Hessian structure, or coordinatewise importance sampling yields substantial empirical and theoretical benefits for scaling and accuracy.

Limitations include:

High per-iteration cost for coordinate-wise estimators in dense scenarios.
Numerical instability of naive difference estimators at very small $\delta$ (addressed by complex-step methods (Jongeneel et al., 2021, Jongeneel, 2021)).
Slow rates for one-point ZO methods under stochastic and noisy feedback, particularly in nonconvex regimes (rates improve with higher-order or two-point methods; see (Ren et al., 2022)).

Recent research establishes dimension-free rates in favorable cases (Yue et al., 2023), and addresses challenges such as saddle-point avoidance (Ren et al., 2022) and robust, feasible optimization under functional constraints (Guo et al., 2022, Guo et al., 2023).

6. Emerging Directions, Open Problems, and Impact

Several areas are highly active or open for further advancement:

Effective dimension adaptation: Methods that estimate or adapt to local spectrum decay to optimize sample efficiency in real time.
Noisy or bandit feedback: Robust ZO methods under heavy-tailed or adversarial noise, including settings with limited function evaluation budgets or low-precision outputs.
Constraint handling: Strengthening safety guarantees for nonconvex, nonsmooth, and black-box constraints (see SZO-QQ, SZO-LP).
Distributed and federated learning: Decentralized ZO methods addressing communication costs, privacy, and network delays are being extended to more complex problem classes.
Implicit regularization and landscape bias: Recent analysis reveals that zeroth-order optimization, particularly with standard smoothing, inherently biases solutions toward flat minima (i.e., those with lower trace of Hessian), a property with deep implications for generalization and robustness (Zhang et al., 5 Jun 2025).
Automated ML and scientific simulation: ZO optimization is increasingly used for hyperparameter search, design automation, and coupling with physical/simulation-based learning.

Broader impact arises from enabling optimization in settings fundamentally inaccessible to classic first-order methods, including proprietary models, large-scale distributed systems, and safety-critical infrastructure. The continued development of theory, scalable algorithms, and robust implementations positions zeroth-order optimization as a fundamental tool for modern scientific and engineering challenges.