Papers
Topics
Authors
Recent
2000 character limit reached

Negative Policy Gradients in RL

Updated 5 November 2025
  • Negative Policy Gradients are update directions in reinforcement learning that decrease performance due to discrepancies like discount omission and estimation bias.
  • They illustrate how adverse mechanisms—such as worst-case risk measures and improper token penalization—can drive policies toward globally pessimal outcomes.
  • Mitigation strategies like restoring theoretical discounts, employing KL regularization, and selective gradient attenuation help realign updates with intended performance.

Negative Policy Gradients represent a class of phenomena and mathematical structures in reinforcement learning (RL) where policy gradient-based optimization may yield update directions that lead to decreasing, rather than increasing, performance with respect to intended objectives. This concept encompasses several strands of both theoretical and empirical research, including policy updates that are not the true gradient of any objective, policy gradients designed to minimize risk (such as in adversarial or worst-case settings), pathological convergence to globally pessimal policies, and phenomena such as "Lazy Likelihood Displacement" caused by the naive penalization of semantically similar negative training instances.

1. Mathematical and Theoretical Foundations

The classical Policy Gradient Theorem for a parametric policy πθ\pi_\theta targeting the discounted sum of rewards yields: Jγ(θ)=E[t=0γtθlogπθ(St,At)Qπ(St,At)].\nabla J_\gamma(\theta) = \mathbb{E}\left[ \sum_{t=0}^\infty \gamma^t \nabla_\theta \log \pi_\theta(S_t, A_t) Q^\pi(S_t, A_t) \right]. However, in practice, most RL algorithms employ an update omitting the outer γt\gamma^t state discount, producing: J?(θ)=E[t=0θlogπθ(St,At)Qπ(St,At)].\nabla J_?(\theta) = \mathbb{E}\left[ \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(S_t, A_t) Q^\pi(S_t, A_t) \right]. Nota and Thomas prove that this practical update is not the gradient of any scalar objective for general MDPs with γ<1\gamma < 1 (Nota et al., 2019). By application of the Clairaut-Schwarz theorem on mixed second derivatives, the empirical policy gradient can have nonzero curl, precluding the existence of any J?J_? with J?=\nabla J_? = update direction.

This failure of conservative vector field structure has direct consequences:

  • No guarantee that steps are ascent: The policy update may reduce the intended performance metric (e.g., expected discounted return).
  • Pathological fixed points: RL agents can converge to policies that are globally pessimal with respect to both the discounted and undiscounted objectives, as constructions in (Nota et al., 2019) show.

Table: Theoretical Guarantees of Policy Gradient Updates

Property Theoretical Gradient (Jγ\nabla J_\gamma) Practical PG (J?\nabla J_?)
Outer Discount Present Yes (γt\gamma^t) No
Proven Gradient of Objective Yes No
Stationary Point Optimality Yes (for JγJ_\gamma) Can be globally pessimal
SGD Convergence Guarantees Yes No

2. Negative Policy Gradients: Pathological and Adversarial Directions

Negative policy gradients can manifest in at least three distinct senses:

A. Policy Gradients That Reduce Performance

Empirical and constructive counterexamples (Nota et al., 2019) demonstrate that RL policy updates may:

  • Steadily increase the probability of actions leading to the lowest possible reward, if the policy gradient is not computed as the true gradient.
  • Move the agent away from all optima of reasonable objectives, potentially converging to global minima.

This is a direct effect of algorithms following directions that are not ascent for any relevant or desired functional.

B. Adversarial or Worst-Case Optimization

Adversarial RL and risk-sensitive RL formalize updates that intentionally incorporate negative gradients:

  • The "Worst Cases Policy Gradients" framework (Tang et al., 2019) optimizes Conditional Value-at-Risk (CVaR) objectives:

CVaRα(s,a)=E[R  Rpcntl(α)]CVaR_\alpha(s,a) = \mathbb{E}[R~|~R \leq pcntl(\alpha)]

The policy gradient is then with respect to this risk-averse measure:

θJα=Es,a[θlogπθ(as)Γπ(s,a,α)]\nabla_\theta J_\alpha = \mathbb{E}_{s,a}\left[\nabla_\theta \log \pi_\theta(a|s) \Gamma^\pi(s, a, \alpha)\right]

where Γπ\Gamma^\pi penalizes the left tail of the return distribution, producing updates that "repel" policies from risky actions, even at the cost of mean reward.

  • In adversarial multi-agent or robust RL, negative gradients represent minimizing the opponent's expected value (i.e., minimax solutions).

C. Negative Gradients from Token- or Response-Level Penalization in LLM RL

In LLM RL training using preference optimization (e.g., Group Relative Policy Optimization, GRPO), negative policy gradients arise from penalizing all tokens in incorrect completions. Recent work (Deng et al., 24 May 2025) identifies "Lazy Likelihood Displacement" (LLD): negative gradients on tokens with high semantic similarity to correct answers can reduce, rather than increase, the likelihood of correct responses—a phenomenon not predicted by the standard framework. Penalization strategies that ignore these token-wise dependencies yield negative transfer effects and are mitigated by selective downweighting (e.g., NTHR).

3. Root Causes: Discount Omission, Surrogate Objectives, and Estimation Bias

The foundational source of most negative policy gradient phenomena is the deviation between the actual update rule and the true functional gradient:

  • Discount Omission: Common RL implementations drop the outer γt\gamma^t in state weights, violating the assumptions of the Policy Gradient Theorem (Nota et al., 2019, Pan et al., 2023).
  • Estimation Bias: Due to state distribution shifts or off-policy sampling, the gradient estimate may be biased away from the direction that increases the desired return (Pan et al., 2023, Ilyas et al., 2018).
  • Surrogate Objectives: Optimization of proxy objectives, such as PPO's clipped loss or GRPO's group preference loss, can produce misalignment between surrogate ascent and real return improvement (Ilyas et al., 2018, Markowitz et al., 2023, Deng et al., 24 May 2025).

Table: Sources and Mitigations of Negative Policy Gradient Effects

Source Manifestation Possible Mitigation
Discount Omission Updates not ascent for JγJ_\gamma Restore full discount
State Dist. Bias Updates from off-policy samples Small/Adaptive lr, KL Reg.
Surrogate Loss Surrogate ≠ real return direction Align objective, Control bias
Token Penalization LLD in LLM optimization Selective penalty attenuation

4. Empirical Pathologies and Case Studies

Empirical studies and constructed MDPs reveal a range of pathologies traceable to negative policy gradients:

  • Globally Pessimal Convergence: The constructed environment in (Nota et al., 2019) provides an explicit MDP where the practical policy gradient leads all initializations to the worst-possible fixed point.
  • Long-Horizon Instability: Sequence modeling with history-based world models can cause gradient explosion, leading to negative or meaningless policy gradients as the planning horizon increases (Ma et al., 7 Feb 2024).
  • Token-Level Interference in LLMs: In math QA tuning, GRPO with naive negative gradient assignment produces LLD, observable as a stagnation or drop in log-likelihood for correct solutions (Deng et al., 24 May 2025).

5. Negative Policy Gradients and Robust Optimization

While frequently a pathology, negative policy gradients are methodologically central in robust or risk-averse RL:

  • CVaR and Risk-Constrained Policy Gradients: The "Worst Cases Policy Gradients" objective (Tang et al., 2019) explicitly moves policy parameters to improve under the worst-case quantile of the return distribution, employing negative gradient components to avoid catastrophic events.
  • Exploration-Driven Pessimism: Clipped-objective policy gradients (COPG) (Markowitz et al., 2023) clip the policy gradient directly, yielding pessimistic step sizes that enhance exploration and reduce the risk of premature convergence.

6. Mitigation Strategies and Corrective Techniques

Addressing unwanted negative policy gradients and their side effects requires methodological alignment with the true objective or the careful design of update protocols:

  • Restoration of Theoretical Discounts: Algorithms that correctly implement γt\gamma^t in the policy gradient maintain correspondence with the expected return and guarantee monotonic improvement (Nota et al., 2019).
  • KL Regularization and Small/Adaptive Learning Rates: These reduce state distribution mismatch, lowering the risk that parameter update steps move away from the goal (Pan et al., 2023).
  • Selective Negative Gradient Attenuation: In LLM RL via GRPO, the NTHR method identifies tokens in incorrect responses whose penalization would unduly displace probability from correct responses, and downweights negative gradients for these tokens (Deng et al., 24 May 2025).
  • Architectural Reparameterization: In long-horizon model-based RL, using action-conditioned world models (AWMs) rather than history-based ones can prevent the accumulation of circuitous or adversarial gradient paths (Ma et al., 7 Feb 2024).

7. Practical Implications, Open Problems, and Future Directions

Negative policy gradients—both as a theoretical pathology and as a robust optimization mechanism—continue to shape research on safe, interpretable, and reliable reinforcement learning algorithms:

  • Empirical Success Despite Pathologies: In practical RL, empirical performance is often strong in spite of the existence of negative policy gradient vectors that can lead to pathological behaviors in theory; the origins of this discrepancy are not yet fully understood (Nota et al., 2019).
  • Alignment in LLM Tuning: Instructing LLMs using RL from human feedback or preference data has made negative policy gradient effects particularly visible and actionable, as in the development and deployment of techniques like NTHR (Deng et al., 24 May 2025).
  • Risk-Aware Policy Optimization: The formal use of negative policy gradients for CVaR or other risk measures provides a principled route to safety-critical and distribution-shift robust policy optimization (Tang et al., 2019).

A plausible implication is that developing RL algorithms with both strong theoretical guarantees and robust empirical performance will require techniques that explicitly account for and, where appropriate, exploit negative policy gradients by aligning update directions with user-specified risk profiles or objective semantics.


References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Negative Policy Gradients.