Towards Provable Log Density Policy Gradient (2403.01605v1)
Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.
- Infinite-horizon policy-gradient estimation. J. Artif. Int. Res., 2001.
- The o.d. e. method for convergence of stochastic approximation and reinforcement learning. 2000.
- Boyan, J. A. Least-squares temporal difference learning. In ICML, pp. 49–56, 1999.
- Linear least-squares algorithms for temporal difference learning. Mach. Learn., 1996.
- Off-policy deep reinforcement learning by bootstrapping the covariate shift. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, 2019.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018.
- Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.
- A closer look at deep policy gradients. In 8th International Conference on Learning Representations, ICLR 2020, 2020.
- Chatgpt: Optimizing language models for dialogue, 2023.
- Kakade, S. M. A natural policy gradient. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems 14, 2001.
- Off environment evaluation using convex risk minimization. In 2022 International Conference on Robotics and Automation, ICRA 2022.
- Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR, 2016.
- Finite-sample analysis of proximal gradient TD algorithms. In Meila, M. and Heskes, T. (eds.), Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31, 2018.
- Maei, H. R. Gradient temporal-difference learning algorithms. In Ph.D thesis, 2011.
- Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML, 2016.
- Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural Comput., 2010.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32, 2019.
- Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 2009.
- Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons. 2014.
- A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951. doi: 10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586.
- Rockafellar, R. T. Convex Analysis. Princeton University Press, 2015.
- Trust region policy optimization. CoRR, abs/1502.05477, 2015. URL http://arxiv.org/abs/1502.05477.
- High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR, 2016.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Deterministic policy gradient algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML, 2014.
- Mastering the game of go without human knowledge. Nat., 2017.
- Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn., 1988.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1999.
- Tesauro, G. Practical issues in temporal difference learning. Mach. Learn., 1992.
- Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
- Minimax weight and q-function learning for off-policy evaluation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020.
- Empirical study of off-policy policy evaluation for reinforcement learning. 2019.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 1992.
- A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, AISTATS 2022, Proceedings of Machine Learning Research, 2022.
- Gendice: Generalized offline estimation of stationary values. CoRR, abs/2002.09072, 2020a. URL https://arxiv.org/abs/2002.09072.
- Gradientdice: Rethinking generalized offline estimation of stationary values. CoRR, abs/2001.11113, 2020b. URL https://arxiv.org/abs/2001.11113.