Correcting discount-factor mismatch in on-policy policy gradient methods (2306.13284v1)
Abstract: The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the \emph{discounted stationary distribution}. But commonly used on-policy methods based on the policy gradient theorem ignores the discount factor in the state distribution, which is technically incorrect and may even cause degenerate learning behavior in some environments. An existing solution corrects this discrepancy by using $\gammat$ as a factor in the gradient estimate. However, this solution is not widely adopted and does not work well in tasks where the later states are similar to earlier states. We introduce a novel distribution correction to account for the discounted stationary distribution that can be plugged into many existing gradient estimators. Our correction circumvents the performance degradation associated with the $\gammat$ correction with a lower variance. Importantly, compared to the uncorrected estimators, our algorithm provides improved state emphasis to evade suboptimal policies in certain environments and consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.
- Achiam, J. Spinning Up in Deep Reinforcement Learning. 2018.
- Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
- Openai gym, 2016.
- Bayesian q-learning with imperfect expert demonstrations, 2022.
- Phasic policy gradient. In International Conference on Machine Learning, pp. 2020–2027. PMLR, 2021.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3647–3655, 2019.
- Probability and random processes. Oxford university press, 2020.
- Consistent on-line off-policy evaluation. In International Conference on Machine Learning, pp. 1372–1383. PMLR, 2017.
- An off-policy policy gradient theorem using emphatic weightings. Advances in Neural Information Processing Systems, 31, 2018.
- Learning expected emphatic traces for deep rl. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7015–7023, 2022.
- Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
- Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
- Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569, 2015.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016.
- Norris, J. R. Markov chains. Number 2. Cambridge university press, 1998.
- Is the policy gradient a gradient? arXiv preprint arXiv:1906.07073, 2019.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Thomas, P. Bias in natural actor-critic algorithms. In International conference on machine learning, pp. 441–448. PMLR, 2014.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
- Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. Advances in neural information processing systems, 30, 2017.
- A deeper look at discounting mismatch in actor-critic algorithms. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, pp. 1491–1499. International Foundation for Autonomous Agents and Multiagent Systems, 2022. ISBN 9781450392136.