Negative Policy Gradients in RL

Updated 5 November 2025

Negative Policy Gradients are update directions in reinforcement learning that decrease performance due to discrepancies like discount omission and estimation bias.
They illustrate how adverse mechanisms—such as worst-case risk measures and improper token penalization—can drive policies toward globally pessimal outcomes.
Mitigation strategies like restoring theoretical discounts, employing KL regularization, and selective gradient attenuation help realign updates with intended performance.

Negative Policy Gradients represent a class of phenomena and mathematical structures in reinforcement learning (RL) where policy gradient-based optimization may yield update directions that lead to decreasing, rather than increasing, performance with respect to intended objectives. This concept encompasses several strands of both theoretical and empirical research, including policy updates that are not the true gradient of any objective, policy gradients designed to minimize risk (such as in adversarial or worst-case settings), pathological convergence to globally pessimal policies, and phenomena such as "Lazy Likelihood Displacement" caused by the naive penalization of semantically similar negative training instances.

1. Mathematical and Theoretical Foundations

The classical Policy Gradient Theorem for a parametric policy $\pi_\theta$ targeting the discounted sum of rewards yields: $\nabla J_\gamma(\theta) = \mathbb{E}\left[ \sum_{t=0}^\infty \gamma^t \nabla_\theta \log \pi_\theta(S_t, A_t) Q^\pi(S_t, A_t) \right].$ However, in practice, most RL algorithms employ an update omitting the outer $\gamma^t$ state discount, producing: $\nabla J_?(\theta) = \mathbb{E}\left[ \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(S_t, A_t) Q^\pi(S_t, A_t) \right].$ Nota and Thomas prove that this practical update is not the gradient of any scalar objective for general MDPs with $\gamma < 1$ (Nota et al., 2019). By application of the Clairaut-Schwarz theorem on mixed second derivatives, the empirical policy gradient can have nonzero curl, precluding the existence of any $J_?$ with $\nabla J_? =$ update direction.

This failure of conservative vector field structure has direct consequences:

No guarantee that steps are ascent: The policy update may reduce the intended performance metric (e.g., expected discounted return).
Pathological fixed points: RL agents can converge to policies that are globally pessimal with respect to both the discounted and undiscounted objectives, as constructions in (Nota et al., 2019) show.

Table: Theoretical Guarantees of Policy Gradient Updates

Property	Theoretical Gradient ( $\nabla J_\gamma$ )	Practical PG ( $\nabla J_?$ )
Outer Discount Present	Yes ( $\gamma^t$ )	No
Proven Gradient of Objective	Yes	No
Stationary Point Optimality	Yes (for $J_\gamma$ )	Can be globally pessimal
SGD Convergence Guarantees	Yes	No

2. Negative Policy Gradients: Pathological and Adversarial Directions

Negative policy gradients can manifest in at least three distinct senses:

A. Policy Gradients That Reduce Performance

Empirical and constructive counterexamples (Nota et al., 2019) demonstrate that RL policy updates may:

Steadily increase the probability of actions leading to the lowest possible reward, if the policy gradient is not computed as the true gradient.
Move the agent away from all optima of reasonable objectives, potentially converging to global minima.

This is a direct effect of algorithms following directions that are not ascent for any relevant or desired functional.

B. Adversarial or Worst-Case Optimization

Adversarial RL and risk-sensitive RL formalize updates that intentionally incorporate negative gradients:

The "Worst Cases Policy Gradients" framework (Tang et al., 2019) optimizes Conditional Value-at-Risk (CVaR) objectives:

$CVaR_\alpha(s,a) = \mathbb{E}[R~|~R \leq pcntl(\alpha)]$

The policy gradient is then with respect to this risk-averse measure:

$\nabla_\theta J_\alpha = \mathbb{E}_{s,a}\left[\nabla_\theta \log \pi_\theta(a|s) \Gamma^\pi(s, a, \alpha)\right]$

where $\Gamma^\pi$ penalizes the left tail of the return distribution, producing updates that "repel" policies from risky actions, even at the cost of mean reward.

In adversarial multi-agent or robust RL, negative gradients represent minimizing the opponent's expected value (i.e., minimax solutions).

C. Negative Gradients from Token- or Response-Level Penalization in LLM RL

In LLM RL training using preference optimization (e.g., Group Relative Policy Optimization, GRPO), negative policy gradients arise from penalizing all tokens in incorrect completions. Recent work (Deng et al., 24 May 2025) identifies "Lazy Likelihood Displacement" (LLD): negative gradients on tokens with high semantic similarity to correct answers can reduce, rather than increase, the likelihood of correct responses—a phenomenon not predicted by the standard framework. Penalization strategies that ignore these token-wise dependencies yield negative transfer effects and are mitigated by selective downweighting (e.g., NTHR).

3. Root Causes: Discount Omission, Surrogate Objectives, and Estimation Bias

The foundational source of most negative policy gradient phenomena is the deviation between the actual update rule and the true functional gradient:

Discount Omission: Common RL implementations drop the outer $\gamma^t$ in state weights, violating the assumptions of the Policy Gradient Theorem (Nota et al., 2019, Pan et al., 2023).
Estimation Bias: Due to state distribution shifts or off-policy sampling, the gradient estimate may be biased away from the direction that increases the desired return (Pan et al., 2023, Ilyas et al., 2018).
Surrogate Objectives: Optimization of proxy objectives, such as PPO's clipped loss or GRPO's group preference loss, can produce misalignment between surrogate ascent and real return improvement (Ilyas et al., 2018, Markowitz et al., 2023, Deng et al., 24 May 2025).

Table: Sources and Mitigations of Negative Policy Gradient Effects

Source	Manifestation	Possible Mitigation
Discount Omission	Updates not ascent for $J_\gamma$	Restore full discount
State Dist. Bias	Updates from off-policy samples	Small/Adaptive lr, KL Reg.
Surrogate Loss	Surrogate ≠ real return direction	Align objective, Control bias
Token Penalization	LLD in LLM optimization	Selective penalty attenuation

4. Empirical Pathologies and Case Studies

Empirical studies and constructed MDPs reveal a range of pathologies traceable to negative policy gradients:

Globally Pessimal Convergence: The constructed environment in (Nota et al., 2019) provides an explicit MDP where the practical policy gradient leads all initializations to the worst-possible fixed point.
Long-Horizon Instability: Sequence modeling with history-based world models can cause gradient explosion, leading to negative or meaningless policy gradients as the planning horizon increases (Ma et al., 2024).
Token-Level Interference in LLMs: In math QA tuning, GRPO with naive negative gradient assignment produces LLD, observable as a stagnation or drop in log-likelihood for correct solutions (Deng et al., 24 May 2025).

5. Negative Policy Gradients and Robust Optimization

While frequently a pathology, negative policy gradients are methodologically central in robust or risk-averse RL:

CVaR and Risk-Constrained Policy Gradients: The "Worst Cases Policy Gradients" objective (Tang et al., 2019) explicitly moves policy parameters to improve under the worst-case quantile of the return distribution, employing negative gradient components to avoid catastrophic events.
Exploration-Driven Pessimism: Clipped-objective policy gradients (COPG) (Markowitz et al., 2023) clip the policy gradient directly, yielding pessimistic step sizes that enhance exploration and reduce the risk of premature convergence.

6. Mitigation Strategies and Corrective Techniques

Addressing unwanted negative policy gradients and their side effects requires methodological alignment with the true objective or the careful design of update protocols:

Restoration of Theoretical Discounts: Algorithms that correctly implement $\gamma^t$ in the policy gradient maintain correspondence with the expected return and guarantee monotonic improvement (Nota et al., 2019).
KL Regularization and Small/Adaptive Learning Rates: These reduce state distribution mismatch, lowering the risk that parameter update steps move away from the goal (Pan et al., 2023).
Selective Negative Gradient Attenuation: In LLM RL via GRPO, the NTHR method identifies tokens in incorrect responses whose penalization would unduly displace probability from correct responses, and downweights negative gradients for these tokens (Deng et al., 24 May 2025).
Architectural Reparameterization: In long-horizon model-based RL, using action-conditioned world models (AWMs) rather than history-based ones can prevent the accumulation of circuitous or adversarial gradient paths (Ma et al., 2024).

7. Practical Implications, Open Problems, and Future Directions

Negative policy gradients—both as a theoretical pathology and as a robust optimization mechanism—continue to shape research on safe, interpretable, and reliable reinforcement learning algorithms:

Empirical Success Despite Pathologies: In practical RL, empirical performance is often strong in spite of the existence of negative policy gradient vectors that can lead to pathological behaviors in theory; the origins of this discrepancy are not yet fully understood (Nota et al., 2019).
Alignment in LLM Tuning: Instructing LLMs using RL from human feedback or preference data has made negative policy gradient effects particularly visible and actionable, as in the development and deployment of techniques like NTHR (Deng et al., 24 May 2025).
Risk-Aware Policy Optimization: The formal use of negative policy gradients for CVaR or other risk measures provides a principled route to safety-critical and distribution-shift robust policy optimization (Tang et al., 2019).

A plausible implication is that developing RL algorithms with both strong theoretical guarantees and robust empirical performance will require techniques that explicitly account for and, where appropriate, exploit negative policy gradients by aligning update directions with user-specified risk profiles or objective semantics.

References

(Nota et al., 2019) Is the Policy Gradient a Gradient?
(Deng et al., 24 May 2025) On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
(Markowitz et al., 2023) Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
(Tang et al., 2019) Worst Cases Policy Gradients
(Pan et al., 2023) Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning
(Ma et al., 2024) Do Transformer World Models Give Better Policy Gradients?
(Ilyas et al., 2018) A Closer Look at Deep Policy Gradients