Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Square-root regret bounds for continuous-time episodic Markov decision processes (2210.00832v2)

Published 3 Oct 2022 in cs.LG and math.OC

Abstract: We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/rltheorybook_AJKS.pdf.
  2. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR.
  3. Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. Available at SSRN 3848428.
  4. Markov decision processes with applications to finance. Springer Science & Business Media.
  5. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  6. Bradtke, S. J. and M. O. Duff (1995). Reinforcement learning methods for continuous-time markov decision. Advances in Neural Information Processing Systems 7 7, 393.
  7. Canonne, C. L. (2017). A short note on poisson tail bounds. http://www.cs.columbia.edu/~ccanonne/files/misc/2017-poissonconcentration.pdf.
  8. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pp. 1507–1516. PMLR.
  9. Solving semi-markov decision problems using average reward reinforcement learning. Management Science 45(4), 560–574.
  10. Denardo, E. V. (1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review 9(2), 165–177.
  11. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp.  578–598. PMLR.
  12. Exploration-exploitation in MDPs with options. In Artificial Intelligence and Statistics, pp.  576–584. PMLR.
  13. Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Science 40(8), 999–1020.
  14. Gao, X. and X. Y. Zhou (2022). Logarithmic regret bounds for continuous-time average-reward markov decision processes. arXiv preprint arXiv:2205.11168.
  15. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research 44(2), 377–399.
  16. Continuous time markov decision processes: theory and applications. Springer-Verlag.
  17. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls. arXiv preprint arXiv:2104.09311.
  18. Finite-horizon optimality for continuous-time markov decision processes with unbounded transition rates. Advances in Applied Probability 47(4), 1064–1087.
  19. Finite horizon semi-markov decision processes with application to maintenance systems. European Journal of Operational Research 212(1), 131–140.
  20. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11(Apr), 1563–1600.
  21. Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research 23, (154)1–55.
  22. Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research 23, (275)1–50.
  23. Jia, Y. and X. Y. Zhou (2023). q-Learning in continuous time. Journal of Machine Learning Research 24, (161)1–61.
  24. Is Q-learning provably efficient? Advances in Neural Information Processing Systems 31.
  25. Bandit algorithms. Cambridge University Press.
  26. Lippman, S. A. (1976). Countable-state, continuous-time dynamic programming with structure. Operations Research 24(3), 477–490.
  27. Mamer, J. W. (1986). Successive approximations for finite horizon, semi-markov decision processes with application to asset liquidation. Operations Research 34(4), 638–644.
  28. Miller, B. L. (1968). Finite state continuous time markov decision processes with a finite planning horizon. SIAM Journal on Control 6(2), 266–280.
  29. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pp. 2701–2710. PMLR.
  30. Continuous-time markov decision processes. Probability Theory and Stochastic Modelling.
  31. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  32. Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
  33. Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. arXiv preprint arXiv:2112.10264.
  34. Making deep q-learning methods robust to time discretization. In International Conference on Machine Learning, pp. 6096–6104. PMLR.
  35. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research 21, (198)1–34.
  36. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304–7312. PMLR.
Citations (6)

Summary

We haven't generated a summary for this paper yet.