On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process (2302.13152v3)
Abstract: We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that BeLLMan's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.
- R. Bellman, “Dynamic programming,” Princeton University Press, New Jersey, 1957.
- M. Haviv, “On constrained markov decision processes,” Operations research letters, vol. 19, no. 1, pp. 25–28, 1996.
- L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
- J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
- B. Lütjens, M. Everett, and J. P. How, “Safe reinforcement learning with model uncertainty estimates,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8662–8668.
- M. Tejedor, A. Z. Woldaregay, and F. Godtliebsen, “Reinforcement learning application in diabetes blood glucose control: A systematic review,” Artificial intelligence in medicine, vol. 104, p. 101836, 2020.
- J. Ding and C. J. Tomlin, “Robust reach-avoid controller synthesis for switched nonlinear systems,” in 49th IEEE conference on decision and control (CDC). IEEE, 2010, pp. 6481–6486.
- S. Summers, M. Kamgarpour, J. Lygeros, and C. Tomlin, “A stochastic reach-avoid problem with random obstacles,” in Proceedings of the 14th international conference on Hybrid systems: computation and control, 2011, pp. 251–260.
- P. M. Esfahani, D. Chatterjee, and J. Lygeros, “The stochastic reach-avoid problem and set characterization for diffusions,” Automatica, vol. 70, pp. 43–56, 2016.
- R. Wisniewski and M. L. Bujorianu, “Safe dynamic programming,” arXiv preprint arXiv:2109.03307, 2021.
- M. L. Bujorianu, R. Wisniewski, and E. Boulougouris, “Stochastic safety for markov chains,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 427–432, 2020.
- M. El Chamie, Y. Yu, B. Açıkmeşe, and M. Ono, “Controlled markov processes with safety state constraints,” IEEE Transactions on Automatic Control, vol. 64, no. 3, pp. 1003–1018, 2018.
- E. K. Chong, S. A. Miller, and J. Adaska, “On bellman’s principle with inequality constraints,” Operations research letters, vol. 40, no. 2, pp. 108–113, 2012.
- Y. Chow and M. Pavone, “A time consistent formulation of risk constrained stochastic optimal control,” arXiv preprint arXiv:1503.07461, 2015.
- R. C. Chen and G. L. Blankenship, “Dynamic programming equations for discounted constrained stochastic control,” IEEE transactions on automatic control, vol. 49, no. 5, pp. 699–709, 2004.
- V. S. Borkar, “An actor-critic algorithm for constrained markov decision processes,” Systems & control letters, vol. 54, no. 3, pp. 207–213, 2005.
- D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Advances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020.
- Y. Efroni, S. Mannor, and M. Pirotta, “Exploration-exploitation in constrained mdps,” arXiv preprint arXiv:2003.02189, 2020.
- S. Mannor and N. Shimkin, “A geometric approach to multi-criterion reinforcement learning,” The Journal of Machine Learning Research, vol. 5, pp. 325–360, 2004.
- S. Boyd and L. Vandenberghe, “Convex optimization, cambridge univ,” Press, UK, 2004.
- H. J. Kushner, “The gauss-seidel numerical procedure for markov stochastic games,” IEEE transactions on automatic control, vol. 49, no. 10, pp. 1779–1784, 2004.
- E. A. Feinberg, “Constrained discounted markov decision processes and hamiltonian cycles,” Mathematics of Operations Research, vol. 25, no. 1, pp. 130–140, 2000.
- M. Kearns and S. Singh, “Near-optimal reinforcement learning in polynomial time,” Machine learning, vol. 49, pp. 209–232, 2002.