Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 187 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 177 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Policy Gradient Objective

Updated 10 October 2025

Policy Gradient Objective is a formulation in reinforcement learning that defines a differentiable function whose gradient is used to update policy parameters for maximizing cumulative rewards.
It encompasses methods that leverage on-policy, off-policy, and counterfactual formulations to balance exploration, risk sensitivity, and multi-objective requirements.
Advanced algorithms employ variance reduction and adaptive strategies to improve convergence and scalability in high-dimensional and complex reinforcement learning settings.

A policy gradient objective is a central construct in reinforcement learning (RL) that defines the function whose gradient with respect to policy parameters is optimized in policy search methods. It formalizes how an agent’s policy should be updated to maximize a given return—whether this return is the expected cumulative reward, a risk-sensitive evaluation, or a general scalarization encompassing multiple objectives. Over the past several years, extensive research has critically analyzed, extended, and unified policy gradient objectives for on-policy, off-policy, risk-sensitive, and multi-objective settings, with attention to theoretical guarantees, sampling efficiency, and practical tractability.

1. Foundational Formulation and Principles

The basic policy gradient objective in RL is to find parameters θ of a differentiable policy π(a|s; θ) that maximize expected return, typically expressed as

$J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r_t\right],$

where returns are summed over trajectories generated by πθ and discounted by γ (Kämmerer, 2019).

The classical policy gradient theorem expresses the gradient as

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^\pi, a \sim \pi}\left[ \nabla_\theta \log \pi_\theta(a|s) Q_\pi(s,a) \right],$

where d^π is a (discounted or stationary) state distribution and Qπ is the state-action value function (Kämmerer, 2019). In this form, policy gradient methods reinforce actions that yield high expected return by increasing their probability under the policy parameterization.

Robustness in this objective comes from its generality: it admits extensions to risk-sensitive measures (Vijayan et al., 2021), general utility functions (Kumar et al., 2022), and settings with multiple reward signals via scalarization (Bai et al., 2021, Guidobene et al., 14 Aug 2025).

2. Objective Variants: On-Policy, Off-Policy, and Counterfactual Formulations

On-Policy and Excursion Objectives

Standard on-policy methods estimate expected return and its gradient from trajectories drawn from the same policy π being optimized. However, sample inefficiency motivates off-policy variants that leverage data from a behavior policy μ.

The excursion objective,

$J_\mu = \sum_{s} d_\mu(s) \, i(s) v_\pi(s),$

measures target policy π performance using the state distribution d_μ of the behavior policy, allowing off-policy evaluation but suffering from objective mismatch when π and μ are dissimilar (Zhang et al., 2019).

The alternative life (stationary) objective,

$J_\pi = \sum_{s} d_\pi(s) \, i(s) v_\pi(s),$

instead weights by the stationary distribution of π, directly reflecting deployment performance but not directly estimable from off-policy data.

Counterfactual Objectives and Unification

(Zhang et al., 2019) introduces a parameterized family of counterfactual objectives,

$J_{\hat{\gamma}} = \sum_{s} d_{\hat{\gamma}}(s) \, \hat{i}(s) v_\pi(s),$

where d_{\hat{\gamma}} interpolates between d_μ ( $\hat{\gamma}=0$ ) and d_π ( $\hat{\gamma}=1$ ). By tuning $\hat{\gamma}$ , practitioners navigate the bias-variance tradeoff and closely approximate deployed policy performance while still leveraging off-policy data.

The Generalized Off-Policy Policy Gradient (GOPPG) Theorem provides a general formula for the gradient of J_{\hat{\gamma}}, which decomposes into a standard actor term and an additional correction term involving the gradient of the density ratio between d_{\hat{\gamma}} and d_μ.

Risk-Sensitive, Nonlinear, and Multi-Objective Extensions

For risk-sensitive RL, the policy gradient objective is generalized to optimize distortion risk measures or other functionals of the full return distribution, leading to gradient representations such as

$\nabla_\theta \rho_g(\theta) = -\int g'\left(1 - F_{R^\theta}(x)\right) \nabla F_{R^\theta}(x) dx$

where F_{R^\theta}(x) is the CDF of returns (Vijayan et al., 2021).

For general utilities f(μ^π), the generalized policy gradient theorem becomes

$\nabla_\theta f = \sum_{s,a} \mu^\pi(s,a) Q^{\pi}_{R_\pi}(s,a) \nabla_\theta \pi(a|s)$

with R_π(s,a) = ∂f/∂μ^π(s,a) (Kumar et al., 2022).

In multi-objective RL, scalarization functions f(J₁, ..., J_M) enable policy gradient updates that target nonlinear preferences over multiple objectives (Bai et al., 2021, Guidobene et al., 14 Aug 2025). The gradient typically takes the form: $\nabla_\theta f(J_1^\pi, ..., J_M^\pi) = \sum_m \left(\frac{\partial f}{\partial J_m} \right) \nabla_\theta J_m^{\pi}$ with each ∇_θ J_m^π estimated via policy gradient methods.

3. Theoretical Properties: Convergence, Optimality, and Objective Alignment

Policy gradient objectives, despite being non-convex in general, possess structural properties that guarantee the global optimality of stationary points under certain conditions. In particular, when the policy class is closed under one-step policy improvement and the single-period BeLLMan objective is convex or gradient dominated (satisfies a Polyak–Łojasiewicz (PL) condition), then all stationary points of the multi-step objective are (near-)global optima (Bhandari et al., 2019).

This property carries over to a range of problems, including finite state-action MDPs, linear–quadratic control, and optimal stopping. In these cases, policy gradient methods admit global or PL-constrained convergence guarantees analogous to those in convex optimization.

Research has also demonstrated that, under gradual annealing of the discount factor γ in biased approximations, e.g. when increasing γ along with a decreasing step size, convergence to a stationary point of the undiscounted objective can be recovered, provided the error decays at a rate commensurate with the learning rate (Nota, 2022).

For off-policy methods, alignment of objectives and their gradients (the “on–off gap”) occurs when the chain induced by π is ergodic and γ is chosen sufficiently close to 1, ensuring the off-policy state distribution approximates the stationary distribution (Mambelli et al., 19 Feb 2024).

4. Sample Efficiency and Variance Reduction

Sample inefficiency and high variance of gradient estimates are key challenges in policy gradient methods, especially acute in multi-objective or risk-sensitive formulations. To address this:

Variance reduction strategies—such as using baselines, actor-critic architectures, and natural gradients—are standard in single-objective RL (Kämmerer, 2019).
In the multi-objective setting, recent advances introduce specialized variance reduction techniques, such as sampling independent batches for reward and gradient estimation and leveraging importance sampling to better control variance scaling with the number of objectives (Guidobene et al., 14 Aug 2025). For example, in the MO-TSIVR-PG algorithm (Guidobene et al., 14 Aug 2025), the sample complexity for stationary convergence is improved from Õ(M⁴) to Õ(M²), enhancing practicality for high-dimensional multi-objective problems.
Risk-sensitive objectives and non-linear utilities typically require bias-controlled, rather than unbiased, gradient estimators to ensure tractable sample efficiency; careful analysis bounds the bias and sample complexity (Bai et al., 2021, Vijayan et al., 2021).

5. Practical Algorithmic Realizations

Standard practical algorithms deploy stochastic gradient ascent updates based on the variation of the objective:

Actor-critic methods use bootstrapped value estimates and compatible function approximators for reduced variance (Kämmerer, 2019).
Proximal algorithms, such as PPO or its clipped variants (PPG, COPG), introduce explicit clipping in either the probability ratio or log-probability term to stabilize and regularize the policy updates (Byun et al., 2020, Markowitz et al., 2023). These methods maintain implementation simplicity while controlling variance and overfitting, and can be adapted directly to multi-objective or risk-sensitive variants.
Emphatic approaches for off-policy learning use long-term traces and weighting to obtain unbiased estimates for otherwise intractable correction terms (Zhang et al., 2019).
Estimators for log density gradients provide correction terms to handle the mismatch between using discounted Q-functions and requirements for the exact stationary distribution in gradient estimation, improving sample complexity and convergence under both tabular and function approximation settings (Katdare et al., 3 Mar 2024).
Adaptive step size selection via Armijo line-search (Lu et al., 21 May 2024) or multi-stage temperature annealing for entropy-regularized objectives (Lu et al., 21 May 2024) further enhances robustness and removes dependence on unknown environment parameters.

Table: Variance Reduction in Multi-Objective Policy Gradient Algorithms

Method	Key Idea	Sample Complexity Scaling
Standard MO-PG	Baseline scalarization; no VR	Õ(M⁴)
MO-TSIVR-PG (Guidobene et al., 14 Aug 2025)	Trajectory splitting + IS weights	Õ(M²) (stationary conv.)
MO-PG+VR (Zhang 2021)	Variance reduction, stricter requirements	Õ(M³)–Õ(M⁴)

VR = variance reduction; IS = importance sampling; M = number of objectives.

6. Extensions: Nonlinear Scalarization, Constraints, and General Utilities

Increasingly, policy gradient objectives are defined with nonlinear or constraint-induced structure:

For non-linear utility functions f(μ^π), the policy gradient theorem is generalized to use R_π(s, a) = ∂f/∂μ^π(s, a) as an “adaptive reward” (Kumar et al., 2022), bridging classical RL and structured objectives from apprenticeship learning, pure exploration, and intrinsic control.
For constraints among objectives, as in topological Markov decision processes (TMDPs), policy gradients are combined with Lagrangian penalties or slack-based reweightings to enforce preferences or requirements among objectives (Wray et al., 2022).
Risk-distorted policy gradients involve integration over quantile-weighted gradients to optimize distortion risk measures, with provable sample complexity and non-asymptotic control (Vijayan et al., 2021).
For transient and steady-state performance (nearly Blackwell-optimality), bi-level objectives are tackled with logarithmic barriers and specialized gradient/Fisher matrix estimators (Dewanto et al., 2021).

7. Theoretical and Practical Impact

Theoretical advances in the structure and analysis of policy gradient objectives have underpinned modern empirical successes in continuous control, multi-task learning, safe RL, and RL with complex utility structures. Notably, mollification via stochastic policies smooths fractal or irregular landscapes and admits interpretation via partial differential equations and the uncertainty principle, illuminating the exploration–exploitation and bias–variance trade-offs inherent to RL (Wang et al., 28 May 2024).

Practical implementation of these advanced objectives yields algorithms that are robust, theoretically grounded, and scalable to large state–action spaces and high-dimensional multi-objective problems. Nevertheless, the alignment between practical estimators and the true objective, especially in off-policy and risk-sensitive settings, remains a subtle issue, necessitating continued investigation.

References

Generalized Off-Policy Actor-Critic (Zhang et al., 2019)
Global Optimality Guarantees For Policy Gradient Methods (Bhandari et al., 2019)
Classical Policy Gradient: Preserving BeLLMan's Principle of Optimality (Thomas et al., 2019)
Is the Policy Gradient a Gradient? (Nota et al., 2019)
On Policy Gradients (Kämmerer, 2019)
Variational Policy Gradient Method for RL with General Utilities (Zhang et al., 2020)
Proximal Policy Gradient: PPO with Policy Gradient (Byun et al., 2020)
A nearly Blackwell-optimal policy gradient method (Dewanto et al., 2021)
Joint Optimization of MORL with Policy Gradient (Bai et al., 2021)
Policy Gradient Methods for Distortion Risk Measures (Vijayan et al., 2021)
Multi-Objective Policy Gradients with Topological Constraints (Wray et al., 2022)
Policy Gradient for RL with General Utilities (Kumar et al., 2022)
On the Convergence of Discounted Policy Gradient Methods (Nota, 2022)
When Do Off-Policy and On-Policy Policy Gradient Methods Align? (Mambelli et al., 19 Feb 2024)
Towards Provable Log Density Policy Gradient (Katdare et al., 3 Mar 2024)
Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs (Lu et al., 21 May 2024)
Mollification Effects of Policy Gradient Methods (Wang et al., 28 May 2024)
Variance Reduced Policy Gradient Method for Multi-Objective RL (Guidobene et al., 14 Aug 2025)