Policy Gradient Methods

Updated 15 April 2026

Policy Gradient Method is a class of optimization algorithms that directly adjusts parameterized policies in MDPs using gradient ascent on expected returns.
Techniques such as PPG, softmax PG, and NPG provide global linear and sublinear convergence guarantees, ensuring efficient learning in high-dimensional action spaces.
Recent advances incorporate entropy regularization, risk-sensitive extensions, and variance reduction methods to enhance sample efficiency and stability.

A policy gradient method is a class of optimization algorithms directly designed to optimize parameterized policies in Markov decision processes (MDPs) and related sequential decision problems via gradient ascent on expected return. These approaches exploit the differentiability of a policy $\pi_\theta$ and compute, or estimate, the gradient of the expected performance $J(\theta)$ with respect to $\theta$ , ensuring compatibility with high-dimensional and continuous action spaces. Modern analysis reveals strong global convergence guarantees for exact-gradient policy gradient algorithms in tabular and certain structured domains, and establishes foundational results for both classical and risk/robust extensions.

1. Fundamental Definitions and Formulation

Consider a finite discounted MDP $(\mathcal S, \mathcal A, P, r, \gamma)$ , where $\mathcal S$ is the state set, $\mathcal A$ the action set, $P(s'|s,a)$ the transition kernel, $r(s,a)$ the bounded immediate reward, and $\gamma\in[0,1)$ the discount factor. A stationary Markov policy $\pi$ is a mapping $J(\theta)$ 0. The policy performance objective is

$J(\theta)$ 1

where $J(\theta)$ 2 is the value function, and $J(\theta)$ 3 is the initial state distribution. The canonical goal is

$J(\theta)$ 4

for which the policy gradient approach seeks to compute or estimate $J(\theta)$ 5—the gradient of the expected return with respect to differentiable parameters $J(\theta)$ 6 underlying the policy $J(\theta)$ 7.

Two widely used policy parameterizations are:

Direct (simplex): For each $J(\theta)$ 8, the action probabilities $J(\theta)$ 9 are treated as free variables.
Softmax: With $\theta$ 0,

$\theta$ 1

Updates and theoretical results differ according to parameterization (Liu et al., 2024).

2. Core Policy Gradient Algorithms and Update Rules

The standard policy gradient theorem states: $\theta$ 2 where $\theta$ 3 is the $\theta$ 4-discounted state-occupancy measure and $\theta$ 5 is the (state, action) advantage function.

Key instantiated algorithms:

Projected Policy Gradient (PPG) (direct/simplex): $\theta$ 6 with $\theta$ 7, step size $\theta$ 8.
Softmax Policy Gradient (PG): $\theta$ 9 where $(\mathcal S, \mathcal A, P, r, \gamma)$ 0 is the current advantage function, and $(\mathcal S, \mathcal A, P, r, \gamma)$ 1 normalizes.
Natural Policy Gradient (NPG) (softmax): $(\mathcal S, \mathcal A, P, r, \gamma)$ 2 arising from a mirror-ascent update in the Kullback-Leibler divergence geometry.

Entropy-regularized variants introduce explicit entropy terms, modifying both objective and update rules, e.g.,

$(\mathcal S, \mathcal A, P, r, \gamma)$ 3

leading to softmax-PG and softmax-NPG steps with entropy terms in the exponent (Liu et al., 2024).

3. Non-Asymptotic Convergence Results

Theoretical understanding of policy gradient methods has progressed to include sharp iteration complexity and convergence rates for both direct and softmax parameterizations, with or without regularization.

PPG (direct-simplex): Global linear convergence for any constant step size:

$(\mathcal S, \mathcal A, P, r, \gamma)$ 4

Explicit expressions for the contraction constant $(\mathcal S, \mathcal A, P, r, \gamma)$ 5 are given in terms of problem and algorithm parameters.

Softmax PG: Sublinear, $(\mathcal S, \mathcal A, P, r, \gamma)$ 6 rate for constant step size; matching lower bounds confirm tightness.
Softmax NPG: Global linear convergence in $(\mathcal S, \mathcal A, P, r, \gamma)$ 7 norm, depending on policy support and step size, and sublinear $(\mathcal S, \mathcal A, P, r, \gamma)$ 8 bounds.
Entropy-regularized PG/NPG: Global linear rates over a broad step size regime.
Soft Policy Iteration: Local quadratic convergence for entropy-regularized setting without requiring optimal-policy stationary-distribution assumptions.

Global convergence and rate statements are grounded in direct performance-difference lemmas, state-wise improvement guarantees, and contraction analyses for the Bellman operator and its regularized variants.

4. Extensions: Structured Policies, Risk, and Robustness

Standard softmax parametrization fails to exploit order structure in action spaces. An ordinal regression-based parametrization, with monotonic cutpoints and a single state scoring function, yields

$(\mathcal S, \mathcal A, P, r, \gamma)$ 9

with cutpoints $\mathcal S$ 0 enforced to be strictly increasing. Policy gradient updates split across neural network parameters and cutpoint increments, with closed-form derivations for each (Weinberger et al., 23 Jun 2025).

Experiments demonstrate faster, lower-variance, and more stable learning in ordered discrete or discretized continuous domains, compared to softmax policies. These approaches inherit both global convergence guarantees and sample efficiency improvements attributable to the structural prior.

Robust Policy Gradients

For robust MDPs with rectangular uncertainty in transitions and/or rewards, closed-form policy gradient algorithms have been developed that efficiently compute gradients with respect to the worst-case occupation measure and value functions (Kumar et al., 2023, Wang et al., 2024). Explicit algorithms—e.g., Double-Loop Robust Policy Mirror Descent—combine an inner adversarial optimization over transitions (via mirror ascent) and an outer policy optimization (mirror descent), achieving provably global convergence for both direct and softmax policy classes.

Risk-Sensitive and Distributional Extensions

Policy gradients have been generalized to coherent risk measures (Tamar et al., 2015) and distortion risk measures (Vijayan et al., 2021), requiring only modest modifications to standard likelihood-ratio gradient estimators and, in the dynamic case, introducing robust Bellman evaluation critiquing via actor-critic structures. Non-asymptotic convergence to stationary points has been proven for distortion risk objectives.

5. Sample Efficiency, Off-Policy, and Variance Reduction

A longstanding challenge for policy gradient methods is the high variance of gradient estimators and the poor sample efficiency, especially under long horizons (Kämmerer, 2019). Recent advances include:

Parameter-based Exploration (PGPE): Policy randomness is transferred to parameter draws at the episode level, rendering within-episode gradient estimates deterministic and reducing variance accumulation. Efficient off-policy reuse is enabled by importance sampling over hyperparameters, with analytically optimal baselines to minimize estimator variance (Zhao et al., 2013).
Off-policy Control via PG: Incorporation of policy-gradient terms into off-policy mean-squared projected Bellman error (MSPBE) optimization enables provably convergent incremental updates for adapting policies in control (not just evaluation) settings (Lehnert et al., 2015).
Reward Profiling: Adaptively accepting candidate policy updates only if performance improvement is statistically significant, as measured by Monte-Carlo rollouts, stabilizes learning and yields up to $\mathcal S$ 11.5 faster convergence and $\mathcal S$ 21.75 variance reduction across continuous-control benchmarks (Ahmed et al., 20 Nov 2025).

6. Policy Gradients Beyond Standard MDPs

Recent developments extend policy gradient algorithms to broader and more challenging domains:

Average-reward and bi-level gain-bias optimization: Nearly-Blackwell-optimal policy gradients achieve fine control over both steady-state ("gain") and transient ("bias") performance in continuing tasks, with associated natural gradient preconditioning and logarithmic-barrier optimization for constraint enforcement (Dewanto et al., 2021).
Structured POMDPs and partial observability: Novel identification and min-max estimation techniques enable offline, nonparametric policy gradient optimization over history-dependent stochastic policies in confounded POMDPs, with uniform finite-sample error bounds and global convergence (Hong et al., 2023).
Binary and combinatorial optimization: Policy gradient approaches employing mean-field and MCMC-based variance reduction are now directly applied to large-scale binary optimization via KL divergence minimization and filter-based local search, with convergence rates linked to MCMC mixing and variance of gradient estimators (Chen et al., 2023).
Multi-agent games and Nash equilibria: Regularized and iteratively referenced policy gradient schemes (NashPG) provide monotonic, robust, and scalable convergence in two-player zero-sum games, attaining state-of-the-art exploitability and Elo benchmarks on extensive-form domains (Yu et al., 21 Oct 2025).

7. Analysis Techniques and Implications

The convergence theory for policy gradient methods depends critically on several foundational analytic tools:

Performance difference lemmas enable precise control of incremental improvement.
State-wise improvement bounds yield strong contraction and rate results even when policies or value functions are not updated globally.
Covariance and variational identities clarify the geometry of the update steps and connect standard and natural gradient algorithms.
Natural gradient preconditioning (via Fisher information) is both theoretically beneficial (ensuring mirror ascent geometry) and practically effective (yielding larger step-size stability without sacrificing convergence).
Comparisons to policy iteration and soft policy iteration yield quantifiable performance gaps, and entropy regularization widens the regime of step sizes supporting monotone improvement (Liu et al., 2024).

Collectively, these advances establish policy gradient methods as a principal tool for exact and approximate solution of sequential decision problems across standard, robust, risk-sensitive, and multi-agent settings, underpinned by non-asymptotic rate, sample complexity, and robustness guarantees.