Policy Gradient Optimization

Updated 11 June 2026

Policy gradient optimization is a reinforcement learning method that optimizes policies by computing the gradient of the expected return over stochastic policies.
It employs techniques such as variance reduction, actor–critic methods, and trust-region strategies to reduce high variance and improve sample efficiency.
Recent advancements include natural gradients, surrogate loss functions, and adaptations for non-Markovian and risk-sensitive decision processes to enhance convergence and stability.

Policy gradient optimization encompasses a foundational family of algorithms in reinforcement learning (RL) that seek to maximize expected return by ascending the gradient of the objective function defined over parameterized stochastic policies. Theoretical, algorithmic, and empirical advancements span from classical on-policy estimators and variance reduction, through trust-region and natural gradients, to sample-efficient and risk-sensitive extensions, as well as applications to non-Markovian environments and structured optimization (Kämmerer, 2019, Kar et al., 11 May 2026, Duan et al., 2023, Lorberbom et al., 2019). This entry systematically covers the principal methodological and theoretical aspects of policy gradient optimization, referencing contemporary analyses and algorithms.

1. Policy-Gradient Objective and Theorems

Consider an episodic Markov Decision Process (MDP) with finite horizon $T$ , state space $\mathcal{S}$ , action space $\mathcal{A}$ , transition kernel $p(s_{t+1}|s_t,a_t)$ , and initial state distribution $\mu_0(s_0)$ . A parameterized stochastic policy $\pi_\theta(a|s)$ induces a trajectory distribution

$p_\theta(\tau) = \mu_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$

and the canonical RL objective is the expected return

$J(\theta) = \mathbb{E}_{\tau\sim p_\theta} \left[ G_0(\tau) \right], \quad G_0 = \sum_{t=0}^{T-1} \gamma^t r(s_t, a_t)$

The policy-gradient theorem yields

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_\theta}\left[ \sum_{t=0}^{T-1} \nabla_\theta\log\pi_\theta(a_t|s_t) G_t \right]$

with $G_t = \sum_{k=t}^{T-1} \gamma^{k-t}r(s_k,a_k)$ . The advantage form,

$\mathcal{S}$ 0

holds for any baseline $\mathcal{S}$ 1 independent of $\mathcal{S}$ 2:

$\mathcal{S}$ 3

These identities generalize to non-Markovian settings by incorporating an agent-state recursion and joint differentiation over history-dependent state summarization (Kar et al., 11 May 2026).

2. Variance Reduction, Baselines, and Critic Augmentation

The raw policy-gradient estimator exhibits typically high variance. Subtracting any function $\mathcal{S}$ 4 yields an unbiased estimator while reducing variance; the optimal baseline equals

$\mathcal{S}$ 5

Practically, $\mathcal{S}$ 6 (a value function or learned critic) is nearly always used (Kämmerer, 2019).

Actor–critic methods further replace the empirical return $\mathcal{S}$ 7 by a bootstrapped, parameterized value function or action-value estimator. A canonical temporal-difference (TD) actor–critic update is

$\mathcal{S}$ 8

which implements $\mathcal{S}$ 9 for a critictically estimated $\mathcal{A}$ 0. Critic parameter tuning and stability are significant for deep/online settings (Kämmerer, 2019).

Recent work has established that a value baseline can guarantee almost-sure global convergence of natural policy gradient in bandit and general MDPs, not due to finite variance (which can remain unbounded), but because the baseline damps overly aggressive parameter updates, preserving sufficient exploration (Mei et al., 2023).

3. Sample Complexity and Advanced Algorithms

Vanilla policy gradient (e.g., REINFORCE) requires many complete-episode samples per update and suffers from high variance. Table 1 summarizes the effect of optimization approach on sample complexity (Kämmerer, 2019):

Method	Typical Episodes per Update	Notes
REINFORCE (Monte Carlo)	$\mathcal{A}$ 1	High variance, unbiased
Actor–Critic (TD, A2C)	$\mathcal{A}$ 2	Biased, lower variance
Natural PG, TRPO, PPO	$\mathcal{A}$ 3	Trust-region stabilization

Sample efficiency is further improved by approaches such as TRPO (Trust-Region Policy Optimization), which solves a constrained optimization to keep KL-divergence from the current policy within a fixed trust region, and PPO (Proximal Policy Optimization), which uses a clipped surrogate objective to limit policy changes per batch (Kämmerer, 2019, Markowitz et al., 2023):

$\mathcal{A}$ 4

PPO permits multiple SGD passes over each batch, further enhancing efficiency.

Recent algorithmic innovations include group relative policy optimization (GRPO), which structures the policy gradient as a U-statistic over groups of samples and achieves asymptotic equivalence to oracle baseline approaches, admitting optimal variance scaling and explicit rules for group/batch sizing (Zhou et al., 1 Mar 2026).

Gradient extrapolation-based techniques (GXPO) approximate multi-step lookahead directions efficiently via a small number of backward passes, achieving significant pass@1 and wall-clock speedups in large-model settings (Swapnil et al., 7 May 2026).

4. Natural Gradient, Geometry, and Mirror Descent Generalizations

The standard (vanilla) policy-gradient step is not geometry-aware. Amari's natural policy gradient preconditions the update by the Fisher information matrix

$\mathcal{A}$ 5

$\mathcal{A}$ 6

Natural gradients correspond to steepest ascent under a local KL-divergence trust region; they improve update stability and often greatly reduce the required number of policy updates (Kämmerer, 2019, Duan et al., 2023). Approximate inversion of the Fisher matrix employs conjugate gradient or Kronecker-factorization.

Generalizations such as Bregman gradient policy optimization (BGPO) leverage more general divergences (e.g., KL, Wasserstein) as the trust region, unifying variance-reduced, natural, and mirror-descent based methods in a principled geometry-aware framework. Sample complexity provably improves to $\mathcal{A}$ 7 for accelerated variants (Huang et al., 2021, Zhang et al., 2018).

5. Non-Markovian, Adversarial, and Structured Generalizations

Recent developments extend policy gradient optimization to non-Markovian decision processes. For NMDPs, reward depends on the entire interaction history; the Agent State-Markov (ASM) policy class introduces a recursively updated agent state, generalizing the gradient theorem to this setting (Kar et al., 11 May 2026):

$\mathcal{A}$ 8

The Agent State-Markov Policy Gradient (ASMPG) algorithm leverages the recursive agent state for efficient optimization and demonstrates strong empirical performance versus predictive-objective baselines.

In competitive and zero-sum settings, competitive policy optimization (CoPO) employs bilinear surrogates in place of standard linear approximations, attaining stable convergence to Nash equilibria via updates that account for player interactions and trust-region constraints (Prajapat et al., 2020).

Policy gradient optimization has also been extended to broader structured and risk-sensitive objectives, such as robust Bayesian risk via dual representations and risk measures (Wang et al., 19 Sep 2025), and to combinatorial binary optimization by defining RL-style policy updates over mean-field product distributions and employing parallel MCMC sampling (Chen et al., 2023).

6. Surrogate Losses, Unified Perspectives, and Practical Implementation

Modern approaches frequently optimize surrogate objectives, which unify a large class of approximate and regularized updates. For instance, TRPO employs a KL divergence trust region, PPO applies clipping, and clipped-objective policy gradient (COPG) uses log-policy clipping, which is "pessimistic," promoting enhanced exploration and stable learning (Markowitz et al., 2023).

A generalized framework parameterizes gradient updates along "form" and "scale" axes, enabling recovery and interpolation between classic RL updates (PG, Q-learning), likelihood maximization, self-imitation, and several new variants (Gummadi et al., 2022). This approach exploits functional forms for scaling temporal difference errors and policy ratios, supporting improved empirical and theoretical properties.

For policy gradient in bandit or Bayesian settings, policy optimization of meta-parameters (e.g., in Thompson sampling) is tractable via policy gradient approaches, which can significantly improve cumulative regret metrics through variance-reduced score function estimators and Rao-Blackwellization (Min et al., 2020).

7. Theoretical Properties and Convergence Guarantees

For smooth $\mathcal{A}$ 9, global convergence to a stationary point is achieved with appropriately decaying learning rates; finite-time convergence rates of $p(s_{t+1}|s_t,a_t)$ 0 exist for mean-square gradient norm (Kar et al., 11 May 2026). For convex/quadratic (LQR/SOF) control problems, local and global convergence rates are established, with nearly dimension-free iteration complexity (Duan et al., 2023, Han et al., 2023). The role of baselines in ensuring O(1/t) convergence and sufficient exploration is theoretically established both in bandit and MDP contexts (Mei et al., 2023).

Risk-sensitive objectives, including mean-variance and smooth coherent risk measures, have stationary-point convergence guarantees using smoothed functional estimators, with on- and off-policy template algorithms (Vijayan et al., 2022). For Blackwell-optimal policy gradient, bi-level optimization (maximize average reward, then bias) is achieved by log-barrier methods with natural-gradient preconditioning (Dewanto et al., 2021).

Optimization by continuation provides a new conceptual viewpoint: stochastic policies with entropy regularization or Gaussian noise perform implicit smoothing of the underlying deterministic return landscape, analogous to graduated (continuation) optimization in non-convex problems, justifying the effectiveness of standard exploration and schedule heuristics in policy-gradient updates (Bolland et al., 2023).

References:

(Kämmerer, 2019) On Policy Gradients
(Kar et al., 11 May 2026) Policy Gradient Methods for Non-Markovian Reinforcement Learning
(Duan et al., 2023) Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback
(Lorberbom et al., 2019) Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces
(Markowitz et al., 2023) Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
(Mei et al., 2023) The Role of Baselines in Policy Gradient Optimization
(Zhou et al., 1 Mar 2026) Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
(Swapnil et al., 7 May 2026) Gradient Extrapolation-Based Policy Optimization
(Vijayan et al., 2022) A policy gradient approach for optimization of smooth risk measures
(Huang et al., 2021) Bregman Gradient Policy Optimization
(Wang et al., 19 Sep 2025) Policy Gradient Optimzation for Bayesian-Risk MDPs with General Convex Losses
(Han et al., 2023) Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators
(Bolland et al., 2023) Policy Gradient Algorithms Implicitly Optimize by Continuation
(Prajapat et al., 2020) Competitive Policy Optimization
(Min et al., 2020) Policy Gradient Optimization of Thompson Sampling Policies
(Gummadi et al., 2022) A Parametric Class of Approximate Gradient Updates for Policy Optimization
(Chen et al., 2023) Monte Carlo Policy Gradient Method for Binary Optimization
(Zhang et al., 2018) Policy Optimization as Wasserstein Gradient Flows
(Dewanto et al., 2021) A nearly Blackwell-optimal policy gradient method