Is the Policy Gradient a Gradient? (1906.07073v2)

Published 17 Jun 2019 in cs.LG and stat.ML

Abstract: The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods drop the discount factor from the state distribution and therefore do not optimize the discounted objective. What do they optimize instead? This has been an open question for several years, and this lack of theoretical clarity has lead to an abundance of misstatements in the literature. We answer this question by proving that the update direction approximated by most methods is not the gradient of any function. Further, we argue that algorithms that follow this direction are not guaranteed to converge to a "reasonable" fixed point by constructing a counterexample wherein the fixed point is globally pessimal with respect to both the discounted and undiscounted objectives. We motivate this work by surveying the literature and showing that there remains a widespread misunderstanding regarding discounted policy gradient methods, with errors present even in highly-cited papers published at top conferences.

Citations (52)

View on Semantic Scholar

Summary

The paper proves that common policy gradient updates are not true gradients as they violate the mathematical conditions needed for proper optimization.
A counterexample shows that these biased updates can lead to the convergence on a globally suboptimal policy in reinforcement learning.
The authors highlight persistent misconceptions in the literature, urging a reexamination of policy optimization methods to improve RL performance.

Analysis of "Is the Policy Gradient a Gradient?"

The paper "Is the Policy Gradient a Gradient?" by Chris Nota and Philip S. Thomas tackles a notable theoretical question concerning reinforcement learning (RL), specifically the optimization mechanisms employed by policy gradient algorithms. This paper offers critical insights into whether these algorithms are optimizing the objectives they claim to target, namely the expected discounted return in RL environments. The principal contribution of this work is the demonstration that the well-accepted practice of removing the discount factor from the state distribution in policy gradient methods results in an update direction that is not the gradient of any objective function.

Summary of Findings

At the core of the paper is the analysis of the policy gradient theorem, which is traditionally used to derive updates to policy parameters aimed at maximizing expected returns. The authors argue that most commonly used policy gradient methods inadvertently deviate from this theoretical foundation by omitting the discount factor from certain parts of the gradient computation. This omission raises the question of what, if anything, these methods are actually optimizing.

Non-Gradient Nature of Policy Updates: The paper mathematically demonstrates that these adjusted updates are not true gradients, because they violate the Clairaut-Schwarz theorem on mixed partial derivatives, thereby proving that there is no underlying objective function of which these updates are gradients. This finding highlights the discrepancy between practical implementations and theoretical guarantees traditionally associated with optimization methods, such as convergence properties.
Counterexample of Non-Convergence: A striking part of the analysis is the construction of a counterexample where policy gradients, as typically implemented, lead to convergence on a globally pessimal policy—the worst possible policy with respect to both discounted and undiscounted objectives. This starkly illustrates the potential pitfalls of assuming unbiased optimization in policy gradient methods.
Misunderstandings in the Literature: The authors review a collection of well-cited papers and demonstrate that errors or ambiguities regarding these update rules persist in the literature. Many papers either mistakenly assert that these biased gradients optimize a specific objective or fail to acknowledge the theoretical implications of their algorithmic choices.

Implications

This research presents significant implications for both theoretical and practical aspects of reinforcement learning. Theoretically, it calls into question the veracity of assumptions and conclusions drawn from prior work that relied on the flawed interpretation of policy gradients. Practically, it suggests that developers of RL algorithms might inadvertently guide policies toward suboptimal or even detrimental solutions due to these misunderstandings.

Speculation on Future Developments

This paper lays the groundwork for several avenues of further research. Future work might explore alternative formulations of the policy gradient to realign theoretical assumptions with practical implementation, perhaps by introducing corrections to the existing algorithms or developing novel frameworks that inherently address these biases. Another potential avenue involves revisiting empirical results of state-of-the-art RL algorithms to assess the impact of these theoretical insights and confirm whether adjustments lead to improvements in performance across various tasks.

Overall, the findings of Nota and Thomas emphasize the necessity of rigorous theoretical scrutiny in the design and evaluation of machine learning algorithms, highlighting the potential discrepancies between theory and practice that can arise as more complex systems are developed. It leverages a fundamental analysis to invite a reassessment of widely used computational strategies, urging the community to adopt more reliable optimization procedures in RL research and applications.

Related Papers

Tweets

https://twitter.com/EugeneVinitsky/status/1819163122516570541

https://twitter.com/bremen79/status/1908530649558114373