- The paper shows that optimizing a smoothed zeroth-order objective is equivalent to single-step policy optimization using a specific stochastic policy.
- It proves that Gaussian finite-difference gradient estimators match the REINFORCE estimator with a baseline, clarifying variance reduction techniques.
- The proposed ZoAR algorithm leverages averaged baselines and query reuse, achieving notable improvements in convergence and sample efficiency.
Zeroth-Order Optimization is Secretly Single-Step Policy Optimization
The paper presents a rigorous theoretical framework that explicitly connects Zeroth-Order Optimization (ZOO) with finite differences to a specific instance of single-step Policy Optimization (PO) from Reinforcement Learning (RL). It not only reveals a surprising mathematical equivalence between gradient estimation in ZOO and the REINFORCE policy gradient estimator with baseline, but also leverages this connection to develop enhanced ZOO algorithms with PO-inspired variance reduction methods.
Theoretical Contributions
ZOO and PO Objective Equivalence
The authors formally show that optimizing the commonly used smoothed objective function in ZOO, Fμ​(θ), is identical to optimizing a single-step PO objective J(θ) with a particular stochastic policy and reward definition (Theorem 1). Specifically, single-step PO, where the action is parameter perturbation and the reward is the negative function value, aligns exactly with the smoothed ZOO objective. This theoretical alignment is nontrivial, as it demystifies the implicit regularization performed by smoothing in ZOO and allows analysis and improvement using RL theory.
Gradient Estimator Equivalence
It is further proved (Theorem 2) that the standard Gaussian-smoothed finite difference gradient estimator in ZOO—widely used for black-box optimization and adversarial attack generation—is mathematically equivalent to the single-step REINFORCE gradient estimator with a specific baseline (f(θ;ξ)) under the same stochastic policy. The subtraction of the function value at the unperturbed point in classical ZOO is therefore interpreted as a baseline for variance reduction, providing clearer theoretical grounding and motivation for this design. The framework also generalizes to other sampling distributions via importance sampling (Theorem 3), yielding a unified interpretation of various ZOO gradient approximations as REINFORCE-style estimators with suitable scalings.
Implications
This equivalence has several practical and theoretical implications:
- Algorithm Design: ZOO algorithms can now borrow advanced variance reduction and experience reutilization methods from PO/PG research for improved sample efficiency and convergence.
- Unified Analysis: Many ZOO techniques (e.g., through different smoothing distributions or baseline choices) can now be analyzed through RL theory, enabling new insights into their behavior and trade-offs.
- Learning Rate Scheduling: The equivalence yields explicit scaling coefficients and learning-rate corrections for ZOO with non-Gaussian perturbations.
Practical Algorithmic Advances
Building on the PO-inspired view, the authors propose a new ZOO algorithm: Zeroth-Order Optimization with Averaged Baseline and Query Reuse (ZoAR). Its design leverages two main ideas from policy optimization and reinforcement learning:
- Averaged Baseline: Rather than using a single-point evaluation as the baseline, ZoAR maintains a buffer of recent function evaluations and uses their average as a baseline for gradient estimation. This is directly motivated by the variance-minimizing baseline in REINFORCE, and is shown to provide lower-variance, more stable estimates.
- Query Reuse: ZoAR stores historical queries and reuses them for multiple gradient estimations, akin to experience replay in RL. This increases effective batch size and further reduces the variance of the gradient estimator without additional function evaluations.
Theoretical analysis (see bias-variance decomposition in the supplementary material) demonstrates that these techniques reduce variance at the cost of a (controllable) bias proportional to historical buffer length. The authors provide bounds quantifying this trade-off, showing that variance reduction can dominate for appropriate settings.
Implementation Details
- The method is straightforward to implement as a modification to typical ZOO procedures (code provided in Algorithm 1). It requires maintaining and updating a bounded-size history buffer, then computing batch-averaged function values and gradient surrogates.
- The approach is compatible with adaptive optimizers (Adam-like) and standard first-order routines used with ZOO.
- Computational and memory overheads are linear in the buffer size and thus tunable.
Empirical Results
Across multiple domains—high-dimensional synthetic benchmarks, black-box adversarial attacks on neural networks, and memory-efficient fine-tuning of LLMs—the proposed method demonstrates substantially faster convergence and improved final performance compared to vanilla ZOO and recent competitive baselines. Notably, empirical gains:
- Up to 16× speedup on Ackley functions in 10,000 dimensions.
- 5.9× reduction in attack queries for adversarial attacks versus standard ZOO.
- Pronounced improvement in sample efficiency in LLM fine-tuning tasks.
- The averaged baseline alone yields significant improvements, and query reuse amplifies these effects.
See the summarized results in Table 1 (adversarial attack) and Figure 1 (synthetic function optimization).
Broader Implications and Future Directions
- Generalization & Transfer: The equivalence allows the transfer of tools (e.g., baselines, replay, control variates) and theoretical guarantees from RL/PG to ZOO, including in areas such as hyperparameter optimization, robotics, federated learning, black-box attacks, and prompt optimization for LLMs.
- Variance Reduction: Many RL advances in variance reduction for policy gradients may be repurposed for ZOO, allowing for more efficient, scalable derivative-free optimization, particularly in very high-dimensional regimes or expensive function evaluations.
- Non-Smooth/Quantized Objectives: Since the approach works with large smoothing parameters, it is suitable for challenging objectives such as quantized networks and ultra-low-precision training, potentially replacing or supplementing straight-through estimators.
- Bias-Variance Scheduling: Dynamic scheduling and bias correction methods from off-policy RL may further mitigate or eliminate bias from query reuse, supporting longer and more aggressive history buffers.
Conclusions
By establishing a deep mathematical correspondence between zeroth-order gradient estimation and single-step policy optimization, and leveraging it for principled variance reduction, the paper provides both new theoretical understanding and practical algorithms that deliver strong empirical gains. This unified theory aligns previously separate research traditions, opens opportunities for cross-pollination between black-box optimization and RL, and offers a foundation for new classes of robust, sample-efficient optimization methods across machine learning domains.