2000 character limit reached
Square-root regret bounds for continuous-time episodic Markov decision processes (2210.00832v2)
Published 3 Oct 2022 in cs.LG and math.OC
Abstract: We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.
- Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/rltheorybook_AJKS.pdf.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR.
- Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. Available at SSRN 3848428.
- Markov decision processes with applications to finance. Springer Science & Business Media.
- Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
- Bradtke, S. J. and M. O. Duff (1995). Reinforcement learning methods for continuous-time markov decision. Advances in Neural Information Processing Systems 7 7, 393.
- Canonne, C. L. (2017). A short note on poisson tail bounds. http://www.cs.columbia.edu/~ccanonne/files/misc/2017-poissonconcentration.pdf.
- Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pp. 1507–1516. PMLR.
- Solving semi-markov decision problems using average reward reinforcement learning. Management Science 45(4), 560–574.
- Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review 9(2), 165–177.
- Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578–598. PMLR.
- Exploration-exploitation in MDPs with options. In Artificial Intelligence and Statistics, pp. 576–584. PMLR.
- Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Science 40(8), 999–1020.
- Gao, X. and X. Y. Zhou (2022). Logarithmic regret bounds for continuous-time average-reward markov decision processes. arXiv preprint arXiv:2205.11168.
- Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research 44(2), 377–399.
- Continuous time markov decision processes: theory and applications. Springer-Verlag.
- Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. arXiv preprint arXiv:2104.09311.
- Finite-horizon optimality for continuous-time markov decision processes with unbounded transition rates. Advances in Applied Probability 47(4), 1064–1087.
- Finite horizon semi-markov decision processes with application to maintenance systems. European Journal of Operational Research 212(1), 131–140.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr), 1563–1600.
- Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research 23, (154)1–55.
- Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research 23, (275)1–50.
- Jia, Y. and X. Y. Zhou (2023). q-Learning in continuous time. Journal of Machine Learning Research 24, (161)1–61.
- Is Q-learning provably efficient? Advances in Neural Information Processing Systems 31.
- Bandit algorithms. Cambridge University Press.
- Lippman, S. A. (1976). Countable-state, continuous-time dynamic programming with structure. Operations Research 24(3), 477–490.
- Mamer, J. W. (1986). Successive approximations for finite horizon, semi-markov decision processes with application to asset liquidation. Operations Research 34(4), 638–644.
- Miller, B. L. (1968). Finite state continuous time markov decision processes with a finite planning horizon. SIAM Journal on Control 6(2), 266–280.
- Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pp. 2701–2710. PMLR.
- Continuous-time markov decision processes. Probability Theory and Stochastic Modelling.
- Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
- Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. arXiv preprint arXiv:2112.10264.
- Making deep q-learning methods robust to time discretization. In International Conference on Machine Learning, pp. 6096–6104. PMLR.
- Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21, (198)1–34.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304–7312. PMLR.