Zeroth-Order Optimization
- Zeroth-order optimization is a derivative-free method that approximates gradients through function evaluations to address black-box and simulation-based challenges.
- Key techniques include two-point estimators, coordinatewise finite-differences, and advanced variance reduction methods that improve convergence and query efficiency.
- Practical applications span adversarial attacks, molecule design, on-device tuning, and large model fine-tuning, with meta-learning further enhancing adaptability.
Zeroth-order optimization (ZO) refers to a class of algorithms for solving optimization problems without requiring access to explicit gradient information. Instead, ZO algorithms approximate gradients using only function evaluations, thereby enabling derivative-free solutions for black-box, simulation-based, or non-differentiable problems encountered across machine learning, signal processing, control, and engineering. ZO optimization is essential in scenarios where gradients are unavailable, costly to compute, or subject to privacy or operational constraints.
1. Principles and Core Algorithms
ZO optimization methods approximate gradients through finite-difference schemes or randomized smoothing. The basic procedure involves three steps:
- Gradient Estimation: The most common approach is the two-point estimator. For an objective and a random direction , the estimator is
where is a smoothing parameter and is typically drawn from a uniform distribution on the sphere or a standard Gaussian (Liu et al., 2020). Averaging over minibatches of random directions reduces variance.
- Descent Direction Computation: The estimated gradient is used directly, or after transformation (e.g., operation or adaptive scaling), to form the update direction.
- Solution Update: A typical update is
with learning rate and the descent direction (Liu et al., 2020).
Alternatives include coordinatewise finite-difference estimation, random or structured subspace perturbations, block coordinate sampling, and surrogate regression-based updates. Recent works have refined gradient estimators for unbiasedness (Ma et al., 22 Oct 2025), reduced query dependence on dimension (Yue et al., 2023), or used variance reduction and surrogate modeling (Xiao et al., 2022, Chen et al., 6 Jul 2025).
Key Algorithms
| Method | Gradient Estimation | Query Complexity (per iter) | Remarks |
|---|---|---|---|
| Two-point ZO | Random directions | – | Standard, high variance in high , unbiased for small |
| Coordinatewise | Finite difference | More accurate, requires queries per gradient | |
| Hybrid/ZOB-GDA | Block subset per iter | () | Adaptive tradeoff, can reach per step (Jin et al., 22 Oct 2025) |
| Regression-based | Surrogate from history | Uses past function values, matches two-point in query efficiency | |
| Unbiased ZOO | Telescoping estimators | –$4)$ | Constructs unbiased estimators even for finite (Ma et al., 22 Oct 2025) |
2. Variance Reduction, Bias–Variance Tradeoff, and Query Complexity
A central challenge in ZO optimization is the high variance in gradient estimators, especially in high dimensions. Several works have addressed this via:
- Averaging/mini-batching: Reduces estimator variance, but increases query complexity.
- Coordinate importance sampling and hybrid estimators: Convex combination of random and coordinatewise estimates with importance sampling on coordinates minimizes variance for fixed query budget (Sharma et al., 2020).
- Lazy query reuse: Techniques such as LAZO (Xiao et al., 2022) adaptively reuse past function values for variance reduction and query saving, achieving regret bounds matching the symmetric two-point scheme.
- Learned sampling distributions: Adaptive learning of the perturbation distribution or direct reinforcement learning over sampling policies can yield significant variance reduction and accelerate convergence (Ruan et al., 2019, Zhai et al., 2021).
- Subspace and block perturbations: As high dimensionality is the bottleneck, restricting perturbations and updates to structured low-dimensional subspaces (e.g., block coordinate updates, sparsity patterns, or low-rank projections) reduces the variance term from to , , provided the subspace aligns well with curvature directions (Park et al., 31 Jan 2025).
- Unbiased estimators: Recent developments formalize telescoping series for the directional derivative to construct unbiased gradient estimators for finite stepsizes , eliminating systematic bias present in classic ZO estimators, and optimizing the sampling schedule for minimum variance (Ma et al., 22 Oct 2025).
Theoretical bounds indicate that the mean squared error of classic two-point estimators scales as , and achieving convergence rates similar to first-order methods requires more queries per iteration. By exploiting problem structure, subspace alignment, and learned sampling, several algorithms reduce this dependence, sometimes achieving overall complexity scaling of for finding -stationary points (Jin et al., 22 Oct 2025, Ma et al., 22 Oct 2025).
3. Advances in Adaptive and Meta-learning Approaches
Recent ZO research integrates meta-learning and adaptivity into the optimization pipeline:
- Learning update rules: Framing optimizer design as a learning problem, ZO meta-learners use RNNs (e.g., ZO-LSTM) to parameterize and learn both update policies and adaptive variance reduction schemes (Ruan et al., 2019).
- UpdateRNN: LSTM-based update rule for historical smoothing.
- QueryRNN: LSTM-adaptive Gaussian sampling for perturbation, balancing variance-bias at test time.
- Adaptive moment estimation: Extensions of adaptive optimizers to ZO (e.g., ZO-AdaMM and R-AdaZO) use moment averaging over estimated gradients for variance reduction, and refine the estimate of the second moment based on smoothed first moments, achieving improved stability and faster convergence (Shu et al., 3 Feb 2025).
- Policy optimization for sampling: Employing reinforcement learning (e.g., DDPG) or neural networks to learn the sampling policy, allowing the system to focus ZO queries in more informative parameter subspaces (Zhai et al., 2021).
- Regression-based single-point ZO (RESZO): Fits local linear/quadratic surrogates using historical evaluations to provide more accurate gradients using only a single query per iteration, achieving convergence rates comparable to (and sometimes outperforming) two-point methods (Chen et al., 6 Jul 2025).
These approaches extend ZO methods beyond hand-designed heuristics, allowing adaptive exploitation of optimization history and problem-specific structure.
4. Practical Applications and Impact
ZO optimization has enabled practical advances in several fields:
- Black-Box Adversarial Attacks: ZO optimizers consistently outperform hand-designed baselines (e.g., ZO-SGD, ZO-signSGD, ZO-ADAM) in query efficiency and convergence speed when attacking black-box neural network classifiers (Ruan et al., 2019, Sharma et al., 2020).
- Molecule Optimization: When used to optimize latent code vectors of molecular autoencoders, ZO methods—especially sign-based updates—demonstrate robustness in challenging non-smooth, discontinuous, and flat landscapes encountered in molecule generation (Lo et al., 2022).
- On-Device and Memory-Constrained Training: Lightweight ZO protocols (e.g., ElasticZO, PeZO) enable fine-tuning and learning on-memory-constrained devices by avoiding backpropagation and storing only minimal state, making ZO approaches viable for edge/IoT scenarios (Sugiura et al., 8 Jan 2025, Tan et al., 28 Apr 2025).
- LLM Fine-Tuning: ZO block coordinate descent and memory-efficient subspace perturbation techniques make zeroth-order fine-tuning of models with billions of parameters tractable, reducing both wall-clock time and memory requirements (Park et al., 31 Jan 2025).
- Constrained Optimization: ZO algorithms have been extended to composite constrained problems (smooth + nonsmooth, functional constraints), achieving near-optimal query complexity for -stationary points and supporting resource allocation or adversarial example generation with hard constraints (Li et al., 2021, Jin et al., 22 Oct 2025).
- Distributed and Federated Settings: One-point distributed ZO with gradient tracking delivers consensus and convergence in multi-agent, nonconvex, and stochastic environments, even with highly restricted communication and query budgets (Mhanna et al., 8 Oct 2024).
5. Theoretical Foundations: Convergence, Optimality, and Implicit Regularization
A robust ZO literature provides:
- Convergence rates and query complexity: For smooth nonconvex objectives, optimal complexity is achievable using block coordinate or unbiased estimators (Jin et al., 22 Oct 2025, Ma et al., 22 Oct 2025). Effective dimension analysis (ED) shows that empirical complexity can be far better than the worst-case -scaling, especially when problem structure limits the number of high-curvature directions (Yue et al., 2023).
- Stationary points and implicit regularization: ZO methods using random sampling and two-point estimators implicitly bias toward flat minima—solutions with small trace of the Hessian. This is formally connected to minimizing a regularized objective , explaining empirically observed preference for "flat" (more generalizable) minima, even in overparameterized ML models (Zhang et al., 5 Jun 2025).
- Block versus full-dimension estimation: The multi-query paradox is resolved: simple averaging over queries per step (for fixed total budget) is sub-optimal compared with the single-query approach, unless projection alignment (ZO-Align) is used, in which case full-subspace updates become optimal (Lin et al., 19 Sep 2025).
- Constraint handling and control-theoretic frameworks: ZO algorithms leveraging feedback linearization and augmented Lagrangian schemes achieve finite-time feasibility without requiring convex or white-box constraints (Li et al., 2021, Zhang et al., 28 Sep 2025).
6. Implementation and Scaling Considerations
Efficient ZO optimization in real-world systems demands special attention to computational resources and hardware efficiency:
- Hardware-Friendly ZO Training: The cost of random number generation for perturbations in ZO is a primary bottleneck in FPGA/ASIC deployment. Approaches such as random number reuse and hardware-friendly adaptive scaling (PeZO) address this by sharing and scaling uniform random vectors, reducing LUT, FF, and power consumption while maintaining convergence (Tan et al., 28 Apr 2025).
- Sparse and Blockwise Methods: Only a subset of coordinates or subnetworks are updated per step (block coordinate or structured pruning), dramatically lowering per-iteration computational and memory cost while maintaining convergence and generalization (Chen et al., 2023, Park et al., 31 Jan 2025).
- Parallelism and feature reuse: Efficient multi-GPU and distributed implementations leverage the decoupled nature of coordinatewise finite-difference estimation for fast, scalable training (Chen et al., 2023).
7. Open Problems and Frontier Directions
The current landscape of ZO research highlights several promising directions:
- Adaptive and learned substructure selection: Extending block and subspace sampling by learning which directions/subspaces to perturb in an online or data-driven manner for maximal informational gain (Park et al., 31 Jan 2025).
- Unbiased, adaptive, and low-variance estimators: Systematic exploration of telescoping series and adaptive sampling to achieve near-optimal variance/bias tradeoffs in high-dimensional, nonconvex, or constraint-rich settings (Ma et al., 22 Oct 2025).
- Beyond smooth domains: Generalizing ZO theory and practice to non-smooth, highly nonconvex, or discrete-structured objectives using e.g. trust-region or double-randomization methods (Liu et al., 2020).
- Distributed and federated ZO methods: Addressing scalability, privacy, and communication efficiency in multi-agent and federated optimization (Mhanna et al., 8 Oct 2024).
- On-device training and hardware–algorithm co-design: Integrating ZO solvers directly onto hardware accelerators and developing algorithms/architectures jointly for optimal power, speed, and learning performance (Tan et al., 28 Apr 2025).
ZO optimization thus serves as a critical technology for tackling black-box, simulation-based, and hardware-constrained learning tasks, balancing theoretical guarantees with practical efficiency and robustness to problem structure and operational constraints.