Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process (2302.13152v3)

Published 25 Feb 2023 in eess.SY, cs.LG, cs.SY, math.OC, and stat.ML

Abstract: We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that BeLLMan's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample due to Haviv. We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm). Finally, we consider the reinforcement learning problem for the same and construct a modified $Q$-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. R. Bellman, “Dynamic programming,” Princeton University Press, New Jersey, 1957.
  2. M. Haviv, “On constrained markov decision processes,” Operations research letters, vol. 19, no. 1, pp. 25–28, 1996.
  3. L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  4. J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  5. B. Lütjens, M. Everett, and J. P. How, “Safe reinforcement learning with model uncertainty estimates,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8662–8668.
  6. M. Tejedor, A. Z. Woldaregay, and F. Godtliebsen, “Reinforcement learning application in diabetes blood glucose control: A systematic review,” Artificial intelligence in medicine, vol. 104, p. 101836, 2020.
  7. J. Ding and C. J. Tomlin, “Robust reach-avoid controller synthesis for switched nonlinear systems,” in 49th IEEE conference on decision and control (CDC).   IEEE, 2010, pp. 6481–6486.
  8. S. Summers, M. Kamgarpour, J. Lygeros, and C. Tomlin, “A stochastic reach-avoid problem with random obstacles,” in Proceedings of the 14th international conference on Hybrid systems: computation and control, 2011, pp. 251–260.
  9. P. M. Esfahani, D. Chatterjee, and J. Lygeros, “The stochastic reach-avoid problem and set characterization for diffusions,” Automatica, vol. 70, pp. 43–56, 2016.
  10. R. Wisniewski and M. L. Bujorianu, “Safe dynamic programming,” arXiv preprint arXiv:2109.03307, 2021.
  11. M. L. Bujorianu, R. Wisniewski, and E. Boulougouris, “Stochastic safety for markov chains,” IEEE Control Systems Letters, vol. 5, no. 2, pp. 427–432, 2020.
  12. M. El Chamie, Y. Yu, B. Açıkmeşe, and M. Ono, “Controlled markov processes with safety state constraints,” IEEE Transactions on Automatic Control, vol. 64, no. 3, pp. 1003–1018, 2018.
  13. E. K. Chong, S. A. Miller, and J. Adaska, “On bellman’s principle with inequality constraints,” Operations research letters, vol. 40, no. 2, pp. 108–113, 2012.
  14. Y. Chow and M. Pavone, “A time consistent formulation of risk constrained stochastic optimal control,” arXiv preprint arXiv:1503.07461, 2015.
  15. R. C. Chen and G. L. Blankenship, “Dynamic programming equations for discounted constrained stochastic control,” IEEE transactions on automatic control, vol. 49, no. 5, pp. 699–709, 2004.
  16. V. S. Borkar, “An actor-critic algorithm for constrained markov decision processes,” Systems & control letters, vol. 54, no. 3, pp. 207–213, 2005.
  17. D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Advances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020.
  18. Y. Efroni, S. Mannor, and M. Pirotta, “Exploration-exploitation in constrained mdps,” arXiv preprint arXiv:2003.02189, 2020.
  19. S. Mannor and N. Shimkin, “A geometric approach to multi-criterion reinforcement learning,” The Journal of Machine Learning Research, vol. 5, pp. 325–360, 2004.
  20. S. Boyd and L. Vandenberghe, “Convex optimization, cambridge univ,” Press, UK, 2004.
  21. H. J. Kushner, “The gauss-seidel numerical procedure for markov stochastic games,” IEEE transactions on automatic control, vol. 49, no. 10, pp. 1779–1784, 2004.
  22. E. A. Feinberg, “Constrained discounted markov decision processes and hamiltonian cycles,” Mathematics of Operations Research, vol. 25, no. 1, pp. 130–140, 2000.
  23. M. Kearns and S. Singh, “Near-optimal reinforcement learning in polynomial time,” Machine learning, vol. 49, pp. 209–232, 2002.

Summary

We haven't generated a summary for this paper yet.