Policy Gradient: Theory and Applications

Updated 25 July 2025

Policy Gradient is a class of reinforcement learning methods that directly optimize parametrized policies by maximizing expected cumulative rewards.
These methods employ techniques such as baselines, control variates, and natural gradients to mitigate variance and boost sample efficiency.
Applications span robotics, control, and decision-making, with extensions addressing safety, exploration, and deterministic policy optimization.

Policy Gradient (PG) algorithms are a central class of reinforcement learning (RL) methods that directly optimize a parameterized policy via gradient ascent on expected cumulative reward. PG algorithms have become prominent tools for solving complex control, robotics, and decision-making problems, especially where policies must be represented with high-dimensional or continuous parameterizations. Unlike value-based methods that learn a value function and derive policies indirectly, PG methods seek the policy parameters that maximize the expected return by ascending (stochastic) estimates of the policy gradient.

1. Foundations and Mathematical Formulation

Policy gradient methods optimize the objective

$J(\theta) = \mathbb{E}_{\tau \sim p(\cdot|\theta)} \Bigg[ \sum_{t=0}^{T-1} \gamma^t\, r(s_t, a_t) \Bigg]$

where $\theta$ parameterizes a stochastic (or deterministic) policy $\pi_\theta$ , $\tau$ denotes trajectories generated by $\pi_\theta$ , $\gamma$ is a discount factor, and $r$ is the reward.

A key result is the policy gradient theorem, which states

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\, a \sim \pi_\theta} \big[ \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta}(s, a) \big]$

where $d^{\pi_\theta}$ is the discounted state visitation distribution and $Q^{\pi_\theta}$ is the expected return from $(s,a)$ . This forms the basis for stochastic gradient ascent, commonly known as REINFORCE:

$\theta_{k+1} = \theta_k + \alpha_k \hat{g}(\theta_k)$

where $\hat{g}(\theta_k)$ is an unbiased estimate of the policy gradient, computed from sampled trajectories (Kämmerer, 2019).

2. Variance Reduction and Sample Efficiency

Sample efficiency and variance reduction are central research topics for PG. The policy gradient estimate's high variance can slow learning and demand many environment interactions. To address this, several approaches are used:

Baselines: Subtracting a (state-dependent) baseline from the return lowers variance without introducing bias. The actor-critic architecture extends this by learning an estimated value function as the baseline (Kämmerer, 2019).
Control Variates: Generalized to use a learned Q-function as a control variate, the doubly robust (DR) estimator combines importance sampling and function approximation, theoretically minimizing variance when the side information is accurate. Differentiating the DR off-policy estimator yields a doubly robust policy gradient (DR-PG) which subsumes state-action-dependent baselines and trajectory-wise variance reduction as special cases (Huang et al., 2019).
Recursive Variance Reduction: Momentum-based and recursive gradient methods (e.g., STORM-PG) blend classic PG with exponential gradient averaging, avoiding periodic restarts and reducing variance over time, with provable $O(1/\epsilon^{3})$ sample complexity (Yuan et al., 2020). Advanced methods further lower the complexity by incorporating momentum, curvature, or second-order information without resorting to importance sampling (Fatkhullin et al., 2023).
Truncation and Safe Gradient Steps: Truncated updates control changes in policy parameters, preventing the explosion of importance sampling weights and stabilizing variance-reduced incremental schemes (Zhang et al., 2021).

3. Policy Parameterizations: Softmax, Natural, and Deterministic PG

Most PG methods adopt one of several parameterizations:

Softmax Parameterization: $\pi_\theta(a|s) \propto \exp(\langle \theta, \phi(s,a) \rangle)$ . While broadly used and enabling smooth policy updates, softmax PG may require exponentially many iterations to converge in worst-case MDPs with large effective horizon or state space (Li et al., 2021). Convergence can be accelerated by natural gradient approaches.
Natural Policy Gradient (NPG): The natural gradient preconditions the update by the Fisher Information Matrix, yielding steps that are invariant to the parameterization and adapt to the geometry of the policy space. The update

$\tilde{\nabla}_\theta J(\theta) = F_\theta^{-1} \nabla_\theta J(\theta)$

improves convergence rates and is robust to ill-conditioning (Kämmerer, 2019, Liu et al., 2022). NPG can enjoy global convergence under milder or weaker assumptions than vanilla PG, as long as the Fisher matrix remains non-degenerate.

Deterministic Policy Gradient (DPG): Removes intrinsic policy stochasticity and optimizes policies that deterministically map states to actions. Zeroth-Order DPG (ZDPG) can approximate gradients through two-point Q-function evaluations, bypassing the need for a critic network and regaining model-free operation (Kumar et al., 2020).

4. Practical Implementation: Safety, Adaptivity, and Exploration

Safety and Monotonic Improvement: For applications where safety is paramount (e.g., robotics), safe PG algorithms enforce monotonic improvement constraints of the form $J(\theta_{k+1}) - J(\theta_k) \geq 0$ . Adaptive selection of step size and batch size, guided by novel variance bounds, ensures that policy updates are statistically guaranteed not to degrade performance (Papini et al., 2019).
Exploration: PG algorithms are local by nature and risk getting stuck in poor optima, especially in sparse-reward or misspecified environments. The Policy Cover PG (PC-PG) algorithm overcomes this by maintaining a policy ensemble (a "cover") and systematically adding exploration bonuses for under-visited state-actions, enabling polynomial sample complexity and strong guarantees under broad forms of misspecification (Agarwal et al., 2020).
Adaptive Step-Size Selection: Recent work replaces fixed or oracle-dependent step sizes with line-search procedures (Armijo line-search), which adjust according to local geometry and observed progress, yielding robust and sometimes faster convergence without requiring knowledge of problem-dependent parameters (Lu et al., 21 May 2024). In practice, exponentially decreasing step sizes in the stochastic setting enable parameter-free convergence at optimal rates.

5. Function Approximation and Global Convergence

The introduction of linear or nonlinear function approximation in PGs—for scalability or generalization—raises new theoretical questions. Classic analyses prioritized approximation error (i.e., realizability), but recent studies show that for softmax PG and NPG, global convergence depends on the ordering induced by the feature representation rather than the raw approximation error (Mei et al., 2 Apr 2025). Specifically:

For NPG, global convergence is achieved if the projection of the reward onto the feature span preserves the rank of the optimal action.
For softmax PG, convergence depends on a non-domination condition that ensures the existence of a representation-preserving order between actions' scores and their true rewards. These ordering-based conditions imply that careful feature or representation design is crucial for convergence, even when expressivity is limited.

6. Extensions and Broader Implications

Continuous-Time PG: Extensions to continuous-time and space RL have been developed by reformulating PG as a policy evaluation problem with martingale representations, enabling both offline (trajectory-based) and online (orthogonality-based) algorithms for control scenarios in finance and ergodic LQ problems (Jia et al., 2021).
Batch, Deep, and Population-Based Methods: Practical improvements for deep PGs include value function search via perturbation populations, which facilitates better local optima escape and more accurate variance reduction, enhancing gradient reliability and sample efficiency (Marchesini et al., 2023).

7. Open Problems and Future Directions

Key challenges for PG methods remain:

Efficient Exploration and Credit Assignment: Incorporating global planning or planning-inspired modules (e.g., Monte Carlo Tree Learning) in PG can help overcome learning plateaus and peakiness in partial observation or long-horizon tasks (Morimura et al., 2022).
Sample Complexity Optimality: Recent momentum- and Hessian-aided algorithms have improved sample complexity from $O(1/\epsilon^4)$ to $O(1/\epsilon^2)$ under non-degeneracy and smoothness assumptions, with single-loop and memory-efficient implementations (Fatkhullin et al., 2023, Ding et al., 2021).
Bridging Actor-Critic and PG: An explicit characterization of the bias between standard AC updates and true PG has enabled the design of bias-corrected actor-critic schemes that combine the benefits of both populaces, improving sample efficiency and final performance (Wen et al., 2021).
Learning Deterministic Controllers via Stochastic PG: Under weak gradient domination and with controlled white-noise-based exploration, learning can proceed via standard PG techniques and, at deployment, yield robust deterministic policies with performance and sample complexity guarantees (Montenegro et al., 3 May 2024).

This body of work consolidates policy gradient methods as a flexible theoretical and algorithmic paradigm for direct policy optimization in RL, underpinning robust, efficient, and safe learning in a broad range of modern sequential decision problems.