Policy Gradient Methods

Updated 7 June 2026

Policy gradient methods are algorithms that directly optimize stochastic policies by computing gradients to maximize cumulative rewards in decision-making tasks.
They employ techniques like baseline subtraction, actor-critic frameworks, and natural gradients to improve sample efficiency and reduce variance.
Extensions include operator-based views, robust and risk-sensitive formulations, and second-order updates, addressing challenges in continuous and non-Markovian environments.

Policy gradient methods are a class of algorithms for solving sequential decision-making problems by directly optimizing the parameters of a stochastic policy with respect to a measure of cumulative expected return. Distinct from value-based or indirect policy improvement schemes, policy gradient methods employ first-order or higher-order information to perform gradient ascent in the space of policy parameters, enabling their application to high-dimensional or continuous action spaces and to scenarios where policies must remain stochastic for optimality or efficient exploration.

1. Formal Foundations and Policy Gradient Theorem

Let $(\mathcal{S}, \mathcal{A}, P, r, \gamma, d_0)$ denote a Markov Decision Process (MDP), with a parameterized stochastic policy $\pi_\theta(a\mid s)$ , $\theta\in\mathbb{R}^d$ . The goal is to maximize the (discounted or undiscounted) objective,

$J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t = 0}^\infty \gamma^t r(s_t, a_t) \right].$

The classical Policy Gradient Theorem states that under mild regularity conditions,

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d_\rho^{\pi_\theta},\,a \sim \pi_\theta(\cdot|s)}\left[ \nabla_\theta \log \pi_\theta(a|s)\,Q^{\pi_\theta}(s, a) \right],$

where $d_\rho^{\pi_\theta}(s)$ is the discounted state visitation distribution from initial distribution $\rho$ and $Q$ is the action-value function under $\pi_\theta$ (Cen et al., 2023, Kämmerer, 2019).

In practice, $Q^{\pi_\theta}(s, a)$ is often replaced by estimators based on rollouts, or by bootstrapped approximations, leading to actor–critic variants with auxiliary value functions. Baseline subtraction is ubiquitous for variance reduction, with no introduction of bias (Kämmerer, 2019).

2. Algorithmic Methodologies and Operator Perspectives

Policy gradient algorithms can be interpreted through operator-theoretic lenses:

Policy improvement operator $\pi_\theta(a\mid s)$ 0 (acting to increase performance via local or global updates in policy space).
Projection operator $\pi_\theta(a\mid s)$ 1 (mapping improved policies back into the parameterized or feasible set) (Ghosh et al., 2020).

The core policy update alternates between improvement (e.g., by weighting the policy by its returns or advantage) and projection (e.g., via KL-regularization, softmax, or trust-region constraints). REINFORCE and PPO are recoverable as operator compositions, where PPO involves additional clipping or entropy-regularized surrogates (Ghosh et al., 2020).

The operator view bridges policy-based and value-based methods by exposing $\pi_\theta(a\mid s)$ 2-divergence interpolations. For instance, classical REINFORCE corresponds to a smooth ( $\pi_\theta(a\mid s)$ 3) improvement path, while value-based greedy methods correspond to the limit $\pi_\theta(a\mid s)$ 4, forming a continuum encompassing both paradigms (Ghosh et al., 2020).

3. Smoothing, Bias-Variance Trade-offs, and Exploration

In continuous control, the prevalent use of Gaussian policies introduces explicit smoothing (mollification) of the $\pi_\theta(a\mid s)$ 5-landscape. The surrogate objective is a convolution of $\pi_\theta(a\mid s)$ 6 with the policy kernel, equivalent to solving a forward heat equation in the policy mean and variance (Wang et al., 2024): $\pi_\theta(a\mid s)$ 7 with $\pi_\theta(a\mid s)$ 8 the heat kernel.

Mollification effect: As the policy variance grows, the landscape is smoothed, reducing gradient estimator variance but increasing bias relative to the true landscape. As variance vanishes, bias decreases but variance and the number of spurious local optima increase.
The uncertainty principle in harmonic analysis dictates a lower bound on the product of exploration (noise in action space) and bias (smoothing in frequency), implying the existence of an optimal exploration/smoothing level for practical performance (Wang et al., 2024).

Empirical work confirms these theoretical insights:

Moderate noise often yields robust and stable learning, but excessive noise obliterates sharp optima critical for tasks with "needle-in-a-haystack" reward structures (e.g., quadrotor balance).
Too little noise causes fractal, high-variance surrogates and training instability (Wang et al., 2024).

4. Convergence Theory and Sample Complexity

Finite-time convergence results for policy gradient methods are founded upon properties such as smoothness and Polyak–Łojasiewicz-type gradient domination inequalities.

Tabular parameterizations: Policy gradient (with direct or softmax parameterization) under suitable step-size conditions achieves global convergence (either sublinear $\pi_\theta(a\mid s)$ 9 or linear) to the optimal policy, despite nonconvexity of the objective. Natural policy gradient and entropy-regularized variants further accelerate convergence, often allowing larger step sizes and guaranteeing contraction to the unique regularized optimum (Liu et al., 2024, Cen et al., 2023).
Function approximation and variance reduction: For policies parameterized by general smooth function classes, stationary convergence is attainable, and with variance-reduction techniques (e.g., SRVR-PG, SRVR-NPG), sample complexity can be improved significantly: Standard policy gradient achieves $\theta\in\mathbb{R}^d$ 0 samples for $\theta\in\mathbb{R}^d$ 1-optimality, while natural gradient reduces this to $\theta\in\mathbb{R}^d$ 2, and variance-reduced methods to $\theta\in\mathbb{R}^d$ 3 in certain regimes (Liu et al., 2022).

Off-policy policy gradient (PGQ) methods extend convergence guarantees to control under behavior-target policy mismatch, provided appropriate corrections for the stationary distribution and Bellman operator are included (Lehnert et al., 2015).

Impact of Distribution Mismatch and Discounting

In practice, trajectory sampling often induces a distribution mismatch relative to the theoretical discounted visitation distribution. Rigorous analysis reveals that, for tabular policies, global optimality is preserved even under mismatched sampling, provided sufficient ergodicity/exploration and discount $\theta\in\mathbb{R}^d$ 4 close to $\theta\in\mathbb{R}^d$ 5 (Wang et al., 28 Mar 2025). If $\theta\in\mathbb{R}^d$ 6 is scheduled appropriately (e.g., $\theta\in\mathbb{R}^d$ 7 as the step-size $\theta\in\mathbb{R}^d$ 8), convergence to stationary points of the undiscounted objective is recovered (Nota, 2022).

5. Extensions, Risk Sensitivity, and Robustness

Non-Markovian Environments and Policy Classes

Policy gradient methods extend to non-Markovian settings (NMDPs) by optimizing over "agent state-Markov" (ASM) policies, which combine a parameterized internal state update and a Markov policy on that internal state. The policy gradient theorem is generalized to this scenario, and finite-time convergence bounds are established. Empirical results suggest clear superiority of reward-centric, end-to-end representation learning over predictive-state or auxiliary approaches in non-Markovian domains (Kar et al., 11 May 2026).

Robust and Risk-Sensitive Policy Gradients

Robust Policy Gradient (RPG): For robust MDPs with rectangular uncertainty, closed-form adversarial corrections to the occupation measure and Q-function yield a robustified policy gradient, enabling efficient gradient-based learning without reliance on expensive inner convex optimization. Scalability and practical equivalence in computational complexity to standard PG are demonstrated (Kumar et al., 2023).
Risk-sensitive Policy Gradient: Distortion risk measures, represented as Choquet integrals over the return CDF, allow direct optimization of risk-sensitive objectives (e.g., CVaR, Wang measures). Through likelihood-ratio gradient estimators and non-asymptotic convergence proofs, PG methods are extended to arbitrary distortion functions, supporting both on- and off-policy updates (Vijayan et al., 2021).

Second-Order Policy Gradient

Second-order PG algorithms (Newton/Quasi-Newton variants) utilize explicit or approximate curvature information (e.g., the Gauss–Newton Hessian in LQR control), yielding superlinear or quadratic local rates, with computational feasibility in domains (e.g., LQR) admitting closed-form derivatives. Gauss–Newton updates are particularly effective in moderate to high-dimensional linear control (Valaei et al., 3 Nov 2025).

6. Practical Stability and Empirical Diagnostics

Monotonic improvement and reliable convergence in policy gradient methods can be fragile in practice, especially in high-variance or high-dimensional settings. Recently, reward profiling—a lightweight empirical performance-estimation wrapper—has been proposed: At each training step, candidate policy updates are accepted only if they clear a reward improvement threshold on fresh rollouts. This yields high-probability guarantees of monotonic improvement, up to $\theta\in\mathbb{R}^d$ 9 speed-ups and $J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t = 0}^\infty \gamma^t r(s_t, a_t) \right].$ 0 reductions in return variance on continuous control benchmarks (Ahmed et al., 20 Nov 2025).

Further empirical insights include:

Stochasticity in policy gradient acts as a beneficial mollifier in rough landscapes but may erase narrow optimality basins.
Variance-reduced and natural gradient methods show clear practical gains in sample efficiency and return stability in empirical benchmarks.

7. Unified View and Future Directions

Policy gradient methodology encompasses a wide landscape: direct and natural gradients, operator-based perspectives, smoothing/entropy regularization, robust/risk-sensitive/generalized objectives, and algorithmic variants spanning stochastic, mini-batch, off-policy, and higher-order update regimes.

Open directions include:

Extending finite-sample guarantees to deep, nonlinear function approximation in fully off-policy and partially observable regimes.
Integrating reward-centric state representation learning in high-memory, non-Markovian domains.
Development of efficient and stable higher-order optimization and model-free second-order estimators for nonlinear and adaptive control.
Practical design of adaptive smoothing, trust region, or reward-profiling wrappers for stability and monotonicity in challenging RL environments.

The operator and variational perspectives unify policy and value-based paradigms and motivate a spectrum of algorithmic innovations for scalable, theoretically grounded policy search (Ghosh et al., 2020, Liu et al., 2022, Cen et al., 2023, Liu et al., 2024).