Zeroth-Order Policy Gradient (ZPG)

Updated 18 March 2026

Zeroth-Order Policy Gradient (ZPG) is a reinforcement learning approach that estimates gradients using randomized finite-difference techniques without relying on explicit derivatives.
It leverages unbiased gradient estimators of smoothed objectives, ensuring convergence guarantees and controlled bias-variance tradeoffs in non-differentiable, black-box settings.
ZPG finds applications in control, meta-learning, and RL from human feedback, providing a robust, derivative-free alternative for policy optimization.

Zeroth-Order Policy Gradient (ZPG) methods are a class of reinforcement learning (RL) and control optimization algorithms that estimate policy gradients exclusively via function (or cost) evaluations, leveraging randomized finite-difference techniques without access to explicit gradients. ZPG algorithms are particularly suited to settings where policy or value functions are non-differentiable, black-box, or too complex to admit efficient analytical or automatic differentiation. Recent advances have rigorously established the theoretical and practical viability of ZPG—unifying it with policy optimization concepts, deriving convergence guarantees in both standard control (LQR, output feedback) and RL settings (actor-critic, meta-learning, RL from human feedback), and benchmarking its performance against contemporary RL algorithms (Kumar et al., 2020, Saglam et al., 2024, Pan et al., 1 Mar 2025, Saglam et al., 2024, Zhang et al., 27 Jan 2026, Qiu et al., 17 Jun 2025, Song et al., 21 Feb 2026, Pan et al., 2024).

1. Fundamental Principles and Motivations

ZPG is defined by its reliance on derivative-free gradient estimates based on random perturbations of policy parameters or actions, using only cost/reward function evaluations. The approach is motivated by the following observations:

Many real-world systems involve non-differentiable or black-box policies (e.g., rule-based controllers, simulators).
Differentiability assumptions in standard policy gradient algorithms are sometimes violated, leading to biased updates or the necessity of unreliable function approximation.
Finite-difference smoothing and perturbation approaches, central in black-box optimization, can be interpreted as stochastic policy gradient methods under a locally averaged objective (Qiu et al., 17 Jun 2025).
ZPG enables provable policy improvement even when explicit gradients of the cost, reward, or $Q$ -function are unavailable or inaccurate.

The key theoretical insight is that the expectation of the finite-difference gradient estimator, under suitable randomization, equals the gradient of a smoothed objective. This parallels the REINFORCE/score-function gradient and admits well-controlled bias-variance tradeoffs, tunable by smoothing radius and sampling parameters (Qiu et al., 17 Jun 2025, Pan et al., 2024, Kumar et al., 2020).

2. Mathematical Foundations and Estimator Construction

The generic ZPG update arises from random smoothing identities. For a parameterized policy $\pi_\theta$ , with objective $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ , and finite-difference smoothing radius $\mu$ , consider

$\widehat{g} = \frac{1}{m} \sum_{j=1}^m \frac{f(\theta+\mu u_j) - f(\theta)}{\mu}\, u_j, \quad u_j \sim \mathrm{Unif}(\mathbb{S}^{d-1})$

where $f(\cdot)$ denotes the cost/reward functional, and $u_j$ are random directions. This estimator is unbiased for the gradient of the mollified (smoothed) objective $J_\mu(\theta) = \mathbb{E}_u[f(\theta + \mu u)]$ , with bias scaling as $O(\mu d)$ under $L$ -smoothness (Qiu et al., 17 Jun 2025, Zhang et al., 2024, Pan et al., 2024).

A widely used alternative is the two-point estimator:

$\pi_\theta$ 0

which reduces estimator bias for symmetric objective functions (Kumar et al., 2020, Pan et al., 2024). In deterministic policy optimization, ZPG is often applied in action-space, yielding compatible updates for actor-critic architectures (Saglam et al., 2024, Kumar et al., 2020).

In meta-learning and multi-task control, ZPG naturally extends to hierarchically smoothed objectives, using nested perturbations and Monte-Carlo outer/inner loops to provide unbiased estimators for meta-policy gradients (Pan et al., 1 Mar 2025, Pan et al., 2024).

3. Algorithmic Instantiations and Pseudocode

Prototypical ZPG algorithms follow a stochastic gradient descent (SGD) paradigm, employing sample-based finite-difference policy updates. Generic pseudocode for the canonical case is as follows (Pan et al., 2024, Pan et al., 1 Mar 2025, Zhang et al., 27 Jan 2026, Song et al., 21 Feb 2026, Kumar et al., 2020):

Sample Perturbations: Draw $\pi_\theta$ 1 random unit directions $\pi_\theta$ 2 (in parameter or action space).
Evaluate Costs: For each $\pi_\theta$ 3, evaluate the cost/reward at $\pi_\theta$ 4 (and optionally, $\pi_\theta$ 5).
Estimate Gradient: Form the averaged gradient estimate as above.
Parameter Update: Perform a stochastic descent/ascent step: $\pi_\theta$ 6.
Repeat until convergence criteria (e.g., norm of gradient estimator) is satisfied.

Specific applications build on this prototype:

LQR and Output Feedback Stabilization: Perturb linear feedback gains, simulate rollouts under perturbed policies, estimate the cost, and compute the gradient estimator (Zhang et al., 27 Jan 2026, Song et al., 21 Feb 2026).
Meta-Learning (MAML style): For each task, sample perturbations, take inner ZPG step(s), aggregate outer-loop meta-gradient estimator over tasks (Pan et al., 1 Mar 2025, Pan et al., 2024).
Actor-Critic with ZPG (oCPG): Apply two-point ZPG to the action argument of the $\pi_\theta$ 7-function, integrate within delayed policy updates and replay buffer optimization (Saglam et al., 2024).
RL from Human Feedback without Reward Model: Estimate value differences via collected human preferences on pairs of trajectories, invert a known preference-link function, and use zeroth-order policy update (Zhang et al., 2024).

4. Theoretical Properties and Convergence Guarantees

Rigorous upper bounds are established for estimator bias, variance, and sample complexity under regularity conditions (smoothness, boundedness, stability):

Bias: For $\pi_\theta$ 8-smooth objectives, $\pi_\theta$ 9; symmetric estimators can reduce this.
Variance: Scales as $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 0, controlled by perturbation batch size $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 1 and smoothness.
Convergence Rates: Under Polyak–Łojasiewicz (PL) or gradient domination, ZPG methods attain stationary points in $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 2 iterations for $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 3-smooth objectives with bounded variance (Pan et al., 2024, Pan et al., 1 Mar 2025, Song et al., 21 Feb 2026), and exhibit corresponding sample complexity scaling for LQR, meta-learning, and RLHF tasks (Zhang et al., 2024).

In the RL context, ZPG can circumvent the incompatibility of gradient estimation under function approximation, providing provable convergence to stationary policies in nonconvex, black-box MDPs (Saglam et al., 2024, Kumar et al., 2020).

For RL from human feedback, ZPG achieves the first polynomial query/sample complexity guarantees for stochastic MDPs without reward inference, despite high constants and slow scaling in parameter dimension and horizon (Zhang et al., 2024).

5. Practical Considerations and Variance Reduction

Key algorithmic choices impact ZPG’s efficiency, robustness, and applicability:

Smoothing Radius $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 4: Trades off bias (decreases with $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 5) and variance (increases with $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 6). Adaptive or schedule-based tuning is common.
Perturbation Batch Size $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 7: Larger $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 8 reduces variance; batch sizes in high action- or parameter-dimensions are commonly up to $J(\theta) = \mathbb{E}_{s_0}[Q^{\pi_\theta}(s_0, \pi_\theta(s_0))]$ 9 for gradient-norm tolerance $\mu$ 0.
Two-Point vs. One-Point Estimators: Two-point estimators generally exhibit lower bias; one-point estimators are more query-efficient in certain settings (Qiu et al., 17 Jun 2025, Kumar et al., 2020).
Variance Reduction: Theoretical analysis reveals that symmetric baselines (central function value subtraction) act as optimal variance-reducing baselines in the policy gradient interpretation. Algorithms such as ZoAR further improve variance via averaged baselines and query reuse (experience replay), with provable gains in convergence and empirical performance (Qiu et al., 17 Jun 2025).
Stability and Projection: For control tasks (LQR), explicit construction of projections or initialization in stabilizing sets ensure that all iterates remain stabilizing, a property absent in pure first-order approaches (Pan et al., 1 Mar 2025, Zhang et al., 27 Jan 2026).

6. Applications and Benchmarks

ZPG approaches are applied in areas where gradient access is restricted or unreliable:

Model-free Control: Stabilization of unknown linear and partially observed dynamical systems without requiring system identification (Zhang et al., 27 Jan 2026, Song et al., 21 Feb 2026).
Meta-Policy Optimization: Model-agnostic meta-policy learning across ensembles of ergodic LQRs with stability and sample complexity guarantees (Pan et al., 1 Mar 2025, Pan et al., 2024).
Deterministic Policy Optimization in RL: ZPG integrated into actor-critic frameworks improves compatibility, robustness, and outperforms or matches TD3, SAC on MuJoCo benchmarks under standard settings (Saglam et al., 2024).
RL from Human Feedback: Direct policy optimization using preference-based ZPG sidesteps reward modeling, supporting more general MDPs and preference-link functions with quantifiable sample/query efficiency (Zhang et al., 2024).
Black-Box Optimization/Adversarial Attacks: Finite-difference-based ZPG yields state-of-the-art query complexity when paired with variance-reduced schemes (Qiu et al., 17 Jun 2025).

7. Limitations and Future Directions

Despite broad applicability, ZPG has inherent limitations:

Sample Complexity: Higher than first-order methods, especially for small smoothing radii and high-dimensional parameter spaces.
Variance Scaling: Increases with ambient dimension; advanced variance-reduction and block-coordinate techniques are proposed to mitigate this (Zhang et al., 2024, Qiu et al., 17 Jun 2025).
Bias-Versus-Variance Tradeoff: Requires problem-specific tuning of smoothing and batch size parameters.
Empirical vs. Theoretical Scaling: Theoretical guarantees may be pessimistic compared to observed practical performance, particularly in the RLHF and high-dimensional continuous control regimes.
Convergence Rates: Lower bounds for ZPG are typically $\mu$ 1 in stationary point finding, compared to faster rates in convex or first-order smooth problems.
Hybrid Methods: Research is ongoing into hybrid actor-critic/zeroth-order methods, off-policy variance reduction, and adaptive smoothing (Kumar et al., 2020, Saglam et al., 2024).

Future research directions include integrating variance reduction via control variates or antithetic sampling, combining ZPG with off-policy data reuse, and extending convergence analysis to more general nonconvex RL objective landscapes (Saglam et al., 2024, Qiu et al., 17 Jun 2025, Zhang et al., 2024).