Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria (2308.07591v3)

Published 15 Aug 2023 in math.OC, cs.SY, and eess.SY

Abstract: For infinite-horizon average-cost criterion problems, there exist relatively few rigorous approximation and reinforcement learning results. In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698, 2001.
  2. Discrete-time controlled Markov processes with average cost criterion: A survey. SIAM J. Control and Optimization, 31:282–344, 1993.
  3. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.
  4. D.P. Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Trans. Autom. Control, 20(3):415–419, Jun. 1975.
  5. Neuro-dynamic programming. Athena Scientific, 1996.
  6. V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov Decision Processes, E. A. Feinberg, A. Shwartz (Eds.), pages 347–375. Kluwer, Boston, MA, 2001.
  7. V.S. Borkar. Average cost dynamic programming equations for controlled Markov chains with partial observations. SIAM J. Control Optim., 39(3):673–681, 2000.
  8. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  9. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE transactions on automatic control, 36(8):898–914, 1991.
  10. O. Costa and F. Dufour. Average control of markov decision processes with feller transition probabilities and general action spaces. Journal of Mathematical Analysis and Applications, 396(1):58–69, 2012.
  11. Reinforcement learning for the near-optimal design of zero-delay codes for markov sources. arXiv preprint arXiv:2311.12609, 2023.
  12. Average cost optimality of partially observed mdps: Contraction of non-linear filters and existence of optimal solutions. arXiv preprint arXiv:2312.14111, 2023.
  13. F. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general state space. J. Math. Anal. Appl., 388:1254–1267, 2012.
  14. Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res., 37(4):591–607, Nov. 2012.
  15. C. Gaskett and A. Zelinsky D. Wettergreen. Q-learning in continuous state and action spaces. In Australasian joint conference on artificial intelligence, pages 417–428. Springer, 1999.
  16. E. Gordienko and O. Hernandez-Lerma. Average cost Markov control processes with weighted norms: Existence of canonical policies. Appl. Math., 23(2):199–218, 1995.
  17. Abhijit Gosavi. Reinforcement learning for long-run average cost. European journal of operational research, 155(3):654–674, 2004.
  18. O. Hernandez-Lerma. Adaptive Markov control processes, volume 79. Springer Science & Business Media, 2012.
  19. Recurrence conditions for markov decision processes with borel state space: a survey. Annals of Operations Research, 28(1):29–46, 1991.
  20. O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996.
  21. O. Hernández-Lerma and J. B. Lasserre. Further topics on discrete-time Markov control processes. Springer, 1999.
  22. Optimal plans for dynamic programming problems. Mathematics of Operations Research, 1(4):390–394, 1976.
  23. On the existence of stationary optimal policies for partially observed mdps under the long-run average cost criterion. Systems & control letters, 55(2):165–173, 2006.
  24. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
  25. Robustness to incorrect models and data-driven learning in average-cost optimal stochastic control. Automatica, 139:110179, 2022.
  26. A. D. Kara and S. Yüksel. Convergence of finite memory Q learning for POMDPs and near optimality of learned policies under filter stability. Mathematics of Operations Research, 2022.
  27. Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity. pages 1–34, 2023.
  28. Ali D. Kara and S. Yüksel. Robustness to approximations and model learning in MDPs and POMDPs. In A. B. Piunovskiy and Y. Zhang, editors, Modern Trends in Controlled Stochastic Processes: Theory and Applications, Volume III. Luniver Press, 2021.
  29. K. Kuratowski and C. Ryll-Nardzewski. A general theorem on selectors. Bull. Acad. Polon. Sci. Ser. Sci. Math. Astronom. Phys, 13(1):397–403, 1965.
  30. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671, 2008.
  31. S. Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
  32. J. Neveu. Discrete-parameter martingales. revised edition, 1975.
  33. L. K. Platzman. Optimal infinite-horizon undiscounted control of finite probabilistic systems. SIAM Journal on Control and Optimization, 18(4):362–380, 1980.
  34. Approximations of discrete time partially observed control problems. Giardini Pisa, 1994.
  35. Asymptotic optimality and rates of convergence of quantized stationary policies in stochastic control. IEEE Trans. Automatic Control, 60:553 –558, 2015.
  36. Finite Approximations in Discrete-Time Stochastic Control: Quantized Models and Asymptotic Optimality. Springer, Cham, 2018.
  37. Finite-state approximation of Markov decision processes with unbounded costs and Borel spaces. In IEEE Conf. Decision Control, Osaka. Japan, December 2015.
  38. Near optimality of quantized policies in stochastic control under weak continuity conditions. Journal of Mathematical Analysis and Applications, also arXiv:1410.6985, 2015.
  39. On the asymptotic optimality of finite approximations to markov decision processes with borel spaces. Mathematics of Operations Research, 42(4):945–978, 2017.
  40. Naci Saldi. Finite-state approximations to discounted and average cost constrained markov decision processes. IEEE Transactions on Automatic Control, 64(7):2681–2696, 2019.
  41. M. Schäl. A selection theorem for optimization problems. Archiv der Mathematik, 25(1):219–224, 1974.
  42. M. Schäl. Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Z. Wahrscheinlichkeitsth, 32:179–296, 1975.
  43. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38:287–308, 2000.
  44. Learning without state-estimation in partially observable markovian decision processes. Machine Learning Proceedings 1994, pages 284–292, 1994.
  45. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pages 361–368, 1995.
  46. L. Stettner. Long run control with degenerate observation. SIAM Journal on Control and Optimization, 57(2):880–899, 2019.
  47. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level Monte Carlo actor-critic. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33240–33267. PMLR, 23–29 Jul 2023.
  48. C. Szepesvári and W. D. Smart. Interpolation-based q-learning. In Proceedings of the twenty-first international conference on Machine learning, page 100, 2004.
  49. C. Szepesvári. Algorithms for reinforcement learning. volume 4, pages 1–103, 2010.
  50. J. N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16:185–202, 1994.
  51. An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control, 42(5):674–690, 1997.
  52. O. Vega-Amaya. The average cost optimality equation: a fixed point approach. Bol. Soc. Mat. Mexicana, 9(3):185–195, 2003.
  53. Oscar Vega-Amaya. The average cost optimality equation: a fixed point approach. Bol. Soc. Mat. Mexicana, 9(1):185–195, 2003.
  54. Learning and planning in average-reward Markov decision processes. In International Conference on Machine Learning, pages 10653–10662. PMLR, 2021.
  55. Model-free robust average-reward reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 36431–36469. PMLR, 23–29 Jul 2023.
  56. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Advances in neural information processing systems, 32, 2019.
  57. S. Yüksel. Control of stochastic systems. Queen’s University, Lecture notes, 2020.
  58. Finite-sample analysis for decentralized batch multiagent reinforcement learning with networked agents. IEEE Transactions on Automatic Control, 66(12):5925–5940, 2021.
  59. Y. Zhang and K. W. Ross. On-policy deep reinforcement learning for the average-reward criterion. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12535–12545. PMLR, 18–24 Jul 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ali Devran Kara (16 papers)
  2. Serdar Yuksel (25 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.