Optimal Sample Complexity for Average Reward Markov Decision Processes (2310.08833v2)
Abstract: We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}2 \epsilon{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}\epsilon{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
- Model-based reinforcement learning with a generative model is minimax optimal. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR.
- Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349.
- Bramson, M. (2008). Stability of Queueing Networks: École d’Été de Probabilités de Saint-Flour XXXVI - 2006. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653–664.
- A modified form of the iterative method of dynamic programming. The Annals of Statistics, 3(1):203–208.
- Efficiently solving MDPs with stochastic mirror descent.
- Towards tight bounds on the sample complexity of average-reward MDPs.
- Instance-optimality in optimal value estimation: Adaptivity via variance-reduced q-learning.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Is q-learning minimax optimal? A tight sample complexity analysis. arXiv preprint arXiv:2102.06548.
- Settling the sample complexity of model-based offline reinforcement learning.
- Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 12861–12872. Curran Associates, Inc.
- Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition.
- Puterman, M. (1994). Average Reward and Related Criteria, chapter 8, pages 331–440. John Wiley & Sons, Ltd.
- Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
- Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Wainwright, M. J. (2019). Variance-reduced q𝑞qitalic_q-learning is minimax optimal.
- Near sample-optimal reduction-based policy learning for average reward MDP.
- Wang, M. (2017). Primal-dual π𝜋\piitalic_π learning: Sample complexity and sublinear run time for ergodic Markov decision problems.
- Optimal sample complexity of reinforcement learning for uniformly ergodic discounted Markov decision processes.
- Q-learning. Machine Learning, 8(3):279–292.
- Sharper model-free reinforcement learning for average-reward Markov decision processes.