Zeroth-Order Gradient Estimation
- Zeroth-order gradient estimation is a derivative-free technique that approximates gradients using only function evaluations and perturbation methods.
- It employs methods like random-direction smoothing and finite differences to balance bias–variance tradeoffs and improve convergence.
- Advanced approaches such as hybrid estimators, subspace projections, and adaptive normalization enhance performance in distributed and high-dimensional scenarios.
Zeroth-order gradient estimation refers to the family of techniques for approximating gradients of a function using only function evaluations, without any access to analytic derivative information. Such estimators, also called derivative-free or black-box gradient estimators, are central to a wide range of optimization, learning, and control problems where gradients are unavailable, expensive, or ill-defined. These estimators underpin much of modern theory and practice in black-box optimization, reinforcement learning, model extraction, and privacy-sensitive, distributed, or constrained settings.
1. Principles and Mathematical Foundations
Let be a function accessible only via queries, i.e., one can evaluate at any point , but not the gradient . Zeroth-order (ZO) gradient estimators approximate via randomized or deterministic combinations of function values at or near .
The prototypical class of ZO estimators is based on the classical finite-difference formula: where is a direction (often random or coordinate-wise) and is a smoothing or perturbation scale. Averaging over directions yields an estimator whose expectation approaches the true gradient as under smoothness assumptions.
Typical estimators include:
- Random directions (Gaussian/sphere smoothing):
This is unbiased for the gradient of a -smoothed version of (Sharma et al., 2020).
- Two-point (central difference) version:
This admits lower variance and improved numerical stability (Sharma et al., 2020).
- Coordinate-wise finite differences:
used when function queries along axes are cheap and high-variance is problematic.
Variance and bias analysis for these estimators (assuming is -smooth) shows that the mean squared error of with samples behaves as (Ma et al., 8 Mar 2025, Hikima et al., 28 Oct 2025, Sharma et al., 2020).
2. Core Estimator Variants: Unbiasedness, Bias-Variance, Complexity
Several refinements and rigorous analyses have sharpened understanding of ZO estimators:
- Unbiased (variance-optimized) constructions: By treating the directional derivative as a telescoping series, one can construct unbiased estimators that match or beat two-point schemes in variance while fully eliminating bias. Explicit formulae for families of such estimators (e.g., 2-, 3-, 4-point) are provided in (Ma et al., 22 Oct 2025). With optimal sampling distributions over perturbation scales and directions (e.g., geometric or Zipf law), the resulting estimator matches the minimax rate for smooth nonconvex problems, namely, achieving oracle complexity for stationarity to error .
- Finite-difference variants (vanilla vs coordinate vs random-direction): Comparative analyses show coordinate-wise finite differences suffer cubic complexity in for reaching stationarity when used alone, i.e., , while random-direction (sphere or Gaussian) methods yield the optimal in the nonconvex, Lipschitz case (Hikima et al., 28 Oct 2025). Averaging across multiple (uniformly sampled) directions reduces both variance and sample complexity.
- Complex-step estimation: Extending to the complex domain and using the imaginary part to probe the derivative leads to single-query, bias-free estimators with bounded variance as , resolving the canonical tradeoff between variance blow-up and cancellation error that afflicts real-domain finite differences (Jongeneel et al., 2021).
- Hybrid estimators: Combining coordinate-wise and random-direction techniques in an adaptive blend (Hybrid Gradient Estimator, HGE) delivers a variance-minimizing, query-efficient approach that strictly generalizes both and allows for automatic tuning of mixing coefficients and per-coordinate sampling probabilities (Sharma et al., 2020).
3. Subspace, Adaptive, and High-Dimensional Innovations
ZO methods face amplified variance and sample-complexity challenges as increases. Recent advances leverage structural properties and adaptive normalization:
- Subspace projection and low-rank structure: By projecting optimization into a learned or random low-dimensional subspace (often inspired by empirical observation that neural network gradients are concentrated in few directions), variance scales linearly with subspace dimension . Techniques such as P-GAP and ZO-Muon utilize low-rank SVD or random projections to concentrate gradient estimation, achieving a substantial reduction in per-step variance and query cost for large-scale models (notably LLMs and ViTs) (Mi et al., 21 Oct 2025, Lang et al., 19 Feb 2026).
- Spectral gradient orthogonalization: Orthogonalizing multiple subspace ZO estimates via singular value decomposition and applying “msign” transforms further filters out noisy, spurious directions and amplifies directions of descent, enhancing convergence over standard mean-aggregation (Lang et al., 19 Feb 2026).
- Bayesian aggregation: Rather than aggregating ZO gradient probes by simple averaging, Bayesian Subspace ZO methods (BSZO) treat probes as noisy, linear observations of the true subspace gradient and apply sequential inference (Kalman filtering) to obtain consistent, adaptive low-variance estimates (Feng et al., 4 Jan 2026). Adaptive shrinkage and noise estimation provide improved convergence speed, especially in large-scale noisy settings.
- Adaptive normalization: Empirical and theoretical results show that normalizing the gradient estimate at each step by its empirical standard deviation—which, with high probability, is closely proportional to —delivers improved convergence and query efficiency, especially across regions of varying gradient magnitude. This adaptive step-size adjustment yields better complexity and practical performance than fixed-step ZO methods (Ye et al., 2 Feb 2026).
- Sparse/compressed ZO estimation: In scenarios where gradients are known or hypothesized to be approximately sparse, compressed sensing–inspired algorithms (e.g., GraCe) identify and refine active coordinates adaptively, achieving O query complexity per step (where is the sparsity) and O convergence in nonconvex settings (Qiu et al., 2024).
4. Distributed, Constrained, and Application-Specific ZO Estimators
ZO estimators are central to distributed, federated, and constrained optimization, where communication costs or privacy constraints further complicate access to (or sharing of) gradient information:
- Distributed/federated ZO tracking: In decentralized setups (e.g., over-the-air FL or edge learning), gradient communication is replaced with compressed scalar projections (inner products) onto random directions, exploiting the superposition property for aggregation and reconstructing unbiased estimates with drastically reduced communication (Jang et al., 2024). This approach, directly inspired by the randomized gradient estimator, preserves convergence rates under minimal overhead.
- Variance-reduced and gradient-tracking estimators: To break the dependence in query cost per agent per dimension (e.g., 2d-point estimators), variance-reduced methods store snapshots and incrementally update partial coordinate estimates, combining with consensus-based gradient-tracking. This enables O(1)-cost per agent per round while attaining O(1/k) convergence with respect to total queries, even for nonconvex objectives (Mu et al., 2024).
- Constraint handling and extra-gradient frameworks: ZO extra-gradient schemes for convex-concave min-max or generally constrained black-box optimization can be constructed using either efficient 2-point random direction estimators or higher-cost 2d-point coordinate schemes. Tight oracle complexity guarantees (e.g., O(d ) for general ZO extra-gradient and O(d ) for coordinate ZO extra-gradient) are obtained, and variants exist for block-coordinate and dual updates (Zhou et al., 25 Jun 2025).
- Discrete and structured input domains: Innovative ZO constructions, such as those used for LLM fingerprinting, adapt third-order tensors, Jacobian estimation, or function perturbation to the combinatorial and high-dimensional structure of text data, typically via semantic-preserving word substitutions or embedding-space finite differences (Shao et al., 8 Oct 2025).
5. Sample Complexity, Convergence Rates, and Practical Performance
The theoretical guarantees for ZO optimization have advanced to match first-order (FO) rates up to known lower bounds dictated by the limitations of black-box access and statistical noise:
- Unconstrained nonconvex: Averaged multi-direction ZO estimators (sphere or Gaussian) attain the best known O sample complexity for -stationarity in general smooth, nonconvex objectives (Hikima et al., 28 Oct 2025). Stronger regularity (e.g., Hessian-Lipschitz) improves exponent.
- Simplex-constrained: Dirichlet-perturbed ZO estimators sample strictly within the constraint set and enable projected gradient and exponential weight updates (PGD/EW) with O(T) convergence, the best attainable for simplex-only smooth functions (Zrnic et al., 2022).
- Variance-bias tradeoff: High variance at small scales and bias at large scales is universal in ZO gradient schemes; balanced parameter selection (e.g., ) is critical for optimal convergence.
- Empirical validation: Across FNN/LLM fine-tuning, federated learning, black-box RL, and adversarial attack generation, state-of-the-art ZO estimators—especially those exploiting subspace, spectral, or compressed/variance-reduced designs—reliably outperform traditional ZO-SGD or coordinate-wise finite differences in both speed and end-task accuracy (Mi et al., 21 Oct 2025, Feng et al., 4 Jan 2026, Ma et al., 8 Mar 2025, Lang et al., 19 Feb 2026).
6. Limitations, Open Challenges, and Future Directions
- Variance under noisy function evaluation: Robust extensions to stochastic function values, potentially with noise correlation or heteroscedasticity, remain an active area for estimator and step-size scheme design (Hikima et al., 28 Oct 2025).
- Structure-exploiting ZO estimation: Incorporating task, model, or domain structure (e.g., block/low-rank, sparsity, group structure) to further decrease dimension dependence or exploit prior knowledge is an emerging research direction (Lang et al., 19 Feb 2026, Qiu et al., 2024).
- Beyond smoothness and convexity: Many applications (e.g., adversarial attacks, federated learning with decision-dependent distributions) violate classical smoothness or convexity assumptions. ZO theory for these generalized settings, including decision-dependent distributions and discrete or high-variance output spaces, is under active development (Hikima et al., 28 Oct 2025).
- Optimum bias–variance scheduling and adaptivity: Adaptive schemes for step size, perturbation scale, and number/direction of function queries per iteration (potentially as a function of real-time variance estimates) offer further potential for query-to-convergence efficiency (Ye et al., 2 Feb 2026).
- Algorithmic complexity in extremely high dimensions: Further diminishing dependence on ambient dimension, and tight lower bounds or impossibility results under information and oracle constraints, are important for the next generation of ZO large-scale optimization frameworks (Qiu et al., 2024).
In summary, zeroth-order gradient estimation constitutes a mathematically rigorous and practically vital toolkit for optimization when derivative information is unavailable, expensive, or strategically withheld. Advances in estimator design, variance reduction, structural exploitation, adaptive normalization, and theoretical complexity bounds have positioned ZO methods as central players in modern large-scale, distributed, and privacy-sensitive optimization and learning (Hikima et al., 28 Oct 2025, Mi et al., 21 Oct 2025, Lang et al., 19 Feb 2026, Zrnic et al., 2022, Ma et al., 8 Mar 2025, Qiu et al., 2024, Ye et al., 2 Feb 2026, Sharma et al., 2020).