Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments (2311.00123v2)

Published 31 Oct 2023 in math.OC, cs.AI, cs.SY, and eess.SY

Abstract: As a primary contribution, we present a convergence theorem for stochastic iterations, and in particular, Q-learning iterates, under a general, possibly non-Markovian, stochastic environment. Our conditions for convergence involve an ergodicity and a positivity criterion. We provide a precise characterization on the limit of the iterates and conditions on the environment and initializations for convergence. As our second contribution, we discuss the implications and applications of this theorem to a variety of stochastic control problems with non-Markovian environments involving (i) quantized approximations of fully observed Markov Decision Processes (MDPs) with continuous spaces (where quantization break down the Markovian structure), (ii) quantized approximations of belief-MDP reduced partially observable MDPS (POMDPs) with weak Feller continuity and a mild version of filter stability (which requires the knowledge of the model by the controller), (iii) finite window approximations of POMDPs under a uniform controlled filter stability (which does not require the knowledge of the model), and (iv) for multi-agent models where convergence of learning dynamics to a new class of equilibria, subjective Q-learning equilibria, will be studied. In addition to the convergence theorem, some implications of the theorem above are new to the literature and others are interpreted as applications of the convergence theorem. Some open problems are noted.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Infinite Dimensional Analysis. Berlin, Springer, 3rd ed., 2006.
  2. G. Arslan and S. Yüksel. Decentralized Q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62:1545 – 1558, 2017.
  3. W. L. Baker. Learning via Stochastic Approximation in Function Space. PhD Dissertation, Harvard University, Cambridge, MA, 1997.
  4. E. Bayraktar and A. D. Kara. Approximate q learning for controlled diffusion processes and its near optimality. SIAM Journal on Mathematics of Data Science, 5(3):615–638, 2023.
  5. D. P. Bertsekas. Dynamic Programming and Stochastic Optimal Control. Academic Press, New York, New York, 1976.
  6. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.
  7. Neuro-dynamic programming. Athena Scientific, 1996.
  8. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  9. Reinforcement learning in non-markovian environments. Systems & Control Letters, 185:105751, 2024.
  10. P. Chigansky and R. Van Handel. A complete solution to Blackwell’s unique ergodicity problem for hidden Markov chains. The Annals of Applied Probability, 20(6):2318–2345, 2010.
  11. Near-optimality of finite-memory codes and reinforcement learning for zero-delay coding of markov sources. arXiv preprint arXiv:2310.06742, 2023.
  12. Average cost optimality of partially observed mdps: Contraction of non-linear filters and existence of optimal solutions. arXiv preprint arXiv:2312.14111, 2023.
  13. R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. i. Theory of Probability & Its Applications, 1(1):65–80, 1956.
  14. Simple agent, complex environment: Efficient reinforcement learning with agent states. The Journal of Machine Learning Research, 23(1):11627–11680, 2022.
  15. Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Mathematics of Operations Research, 41(2):656–681, 2016.
  16. Regret testing: learning to play Nash equilibrium without knowing you have an opponent. pages 341–367, 2006.
  17. F. Germano and G. Lugosi. Global nash convergence of foster and young’s regret testing. Games and Economic Behavior, 60(1):135–154, 2007.
  18. Controlled Stochastic Processes. Springer Science & Business Media, 2012.
  19. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
  20. Weak Feller property of non-linear filters. Systems & Control Letters, 134:104–512, 2019.
  21. Q-learning for MDPs with general spaces: Convergence and near optimality via quantization under weak continuity. pages 1–34, 2023.
  22. A.D Kara and S. Yüksel. Near optimality of finite memory feedback policies in partially observed markov decision processes. Journal of Machine Learning Research, 23(11):1–46, 2022.
  23. A.D Kara and S. Yüksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability. Mathematics of Operations Research, 48(4):2066–2093, 2023.
  24. A.D. Kara and S. Yüksel. Q-learning for continuous state and action mdps under average cost criteria. arXiv preprint arXiv:2308.07591, 2023.
  25. G. Di Masi and L. Stettner. Ergodicity of hidden markov models. Mathematics of Control, Signals and Systems, 17(4):269–296, 2005.
  26. C. McDonald and S. Yüksel. Exponential filter stability via Dobrushin’s coefficient. Electronic Communications in Probability, 25, 2020.
  27. C. McDonald and S. Yüksel. Stochastic observability and filter stability under several criteria. IEEE Transactions on Automatic Control (to appear), arXiv:1812.01772, 2022.
  28. S. Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
  29. K.R. Parthasarathy. Probability Measures on Metric Spaces. AMS Bookstore, 1967.
  30. D. Rhenius. Incomplete information in Markovian decision models. Ann. Statist., 2:1327–1334, 1974.
  31. Finite Approximations in Discrete-Time Stochastic Control: Quantized Models and Asymptotic Optimality. Springer, Cham, 2018.
  32. Finite model approximations for partially observed Markov decision processes with discounted cost. IEEE Transactions on Automatic Control, 65, 2020.
  33. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pages 284–292. Elsevier, 1994.
  34. C. Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.
  35. C. Szepesvári and M.L. Littman. A unified analysis of value-function-based reinforcement-learning algorithms. Neural computation, 11(8):2017–2060, 1999.
  36. J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
  37. R. van Handel. Uniform time average consistency of Monte Carlo particle filters. Stochastic Processes and their Applications, 119(11):3835–3861, 2009.
  38. C. Villani. Optimal Transport: Old and New. Springer, 2008.
  39. Q-learning. Machine Learning, 8:279–292, 1992.
  40. Independent learning in mean-field games: Satisficing paths and convergence to subjective equilibria. arXiv preprint arXiv:2209.05703, 2022.
  41. Satisficing paths and independent multi-agent reinforcement learning in stochastic games. SIAM Journal on Mathematics of Data Science (arXiv:2110.04638), 2023.
  42. A.A. Yushkevich. Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Prob. Appl., 21:153–158, 1976.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ali Devran Kara (16 papers)
  2. Serdar Yuksel (25 papers)
Citations (8)