Online Reinforcement Learning in Markov Decision Process Using Linear Programming (2304.00155v3)
Abstract: We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and minimize their regret over a finite time horizon through interacting with the environment. We devise a simple and efficient model-based algorithm that achieves $\widetilde{O}(LX\sqrt{TA})$ regret with high probability, where $L$ is the episode length, $T$ is the number of episodes, and $X$ and $A$ are the cardinalities of the state space and the action space, respectively. The proposed algorithm, which is based on the concept of ``optimism in the face of uncertainty", maintains confidence sets of transition and reward functions and uses occupancy measures to connect the online MDP with linear programming. It achieves a tighter regret bound compared to the existing works that use a similar confidence set framework and improves computational effort compared to those that use a different framework but with a slightly tighter regret bound.
- V. Leon and S. R. Etesami, “Online reinforcement learning in Markov decision process using linear programming,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 1973–1978.
- T. Gabel and M. Riedmiller, “On a successful application of multi-agent reinforcement learning to operations research benchmarks,” in 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE, 2007, pp. 68–75.
- J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
- A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.
- K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
- S. G. Khan, G. Herrmann, F. L. Lewis, T. Pipe, and C. Melhuish, “Reinforcement learning and optimal adaptive control: An overview and implementation examples,” Annual Reviews in Control, vol. 36, no. 1, pp. 42–59, 2012.
- B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018.
- H.-D. Tran, F. Cai, M. L. Diego, P. Musau, T. T. Johnson, and X. Koutsoukos, “Safety verification of cyber-physical systems with reinforcement learning control,” ACM Trans. Embed. Comput. Syst., vol. 18, no. 5s, oct 2019.
- S. R. Etesami and T. Başar, “Dynamic games in cyber-physical security: An overview,” Dynamic Games and Applications, vol. 9, no. 4, pp. 884–913, 2019.
- A. Coronato, M. Naeem, G. De Pietro, and G. Paragliola, “Reinforcement learning for intelligent healthcare applications: A survey,” Artificial Intelligence in Medicine, vol. 109, p. 101964, 2020.
- Z. Wang and T. Hong, “Reinforcement learning for building controls: The opportunities and challenges,” Applied Energy, vol. 269, p. 115036, 2020.
- P. Auer and R. Ortner, “Logarithmic online regret bounds for undiscounted reinforcement learning,” in Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman, Eds., vol. 19. MIT Press, 2006.
- T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” J. Mach. Learn. Res., vol. 11, p. 1563–1600, aug 2010.
- P. L. Bartlett and A. Tewari, “Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, ser. UAI ’09. Arlington, Virginia, USA: AUAI Press, 2009, p. 35–42.
- M. G. Azar, I. Osband, and R. Munos, “Minimax regret bounds for reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 263–272.
- C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q-learning provably efficient?” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
- S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: worst-case regret bounds,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- A. Zimin and G. Neu, “Online learning in episodic markovian decision processes by relative entropy policy search,” in Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds., vol. 26. Curran Associates, Inc., 2013.
- A. Rosenberg and Y. Mansour, “Online convex optimization in adversarial Markov decision processes,” in International Conference on Machine Learning, 2019.
- C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu, “Learning adversarial Markov decision processes with bandit feedback and unknown transition,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 4860–4869.
- G. Neu, A. Gyorgy, and C. Szepesvari, “The adversarial stochastic shortest path problem with unknown transition probabilities,” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22. La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 805–813.
- A. Rosenberg and Y. Mansour, “Online stochastic shortest path with bandit feedback and unknown transition function,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
- Vincent Leon (6 papers)
- S. Rasoul Etesami (33 papers)