Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Bellman operator for convergence of $Q$-learning with linear function approximation (2309.16819v1)

Published 28 Sep 2023 in cs.LG and cs.AI

Abstract: We study the convergence of $Q$-learning with linear function approximation. Our key contribution is the introduction of a novel multi-BeLLMan operator that extends the traditional BeLLMan operator. By exploring the properties of this operator, we identify conditions under which the projected multi-BeLLMan operator becomes contractive, providing improved fixed-point guarantees compared to the BeLLMan operator. To leverage these insights, we propose the multi $Q$-learning algorithm with linear function approximation. We demonstrate that this algorithm converges to the fixed-point of the projected multi-BeLLMan operator, yielding solutions of arbitrary accuracy. Finally, we validate our approach by applying it to well-known environments, showcasing the effectiveness and applicability of our findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
  2. Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37, 1995.
  3. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
  4. Vivek Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
  5. Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369–376, 1995.
  6. Andy Cahill. Catastrophic forgetting in reinforcement-learning environments. PhD thesis, University of Otago, 2011.
  7. Neural temporal-difference learning converges to global optima. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/98baeb82b676b662e12a7af8ad9212f6-Paper.pdf.
  8. A new convergent variant of q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33, 2020.
  9. Target network and truncation overcome the deadly triad in q-learning. arXiv preprint arXiv:2203.02628, 2022.
  10. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  11. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pages 1049–1058. PMLR, 2017.
  12. Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, pages 1387–1396. PMLR, 2018a.
  13. Multiple-step greedy policies in approximate and online reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018b.
  14. Online planning with lookahead policies. Advances in Neural Information Processing Systems, 33:14024–14033, 2020.
  15. Amir-massoud Farahmand. Regularization in reinforcement learning. 2011.
  16. Geoffrey J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances in neural information processing systems, pages 1040–1046, 2001.
  17. Regularized q-learning. arXiv preprint arXiv:2202.05404, 2022.
  18. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pages 719–726, 2010.
  19. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv preprint arXiv:1405.6757, 2014.
  20. Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, pages 308–322. Springer, 2007.
  21. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine learning, pages 664–671, 2008.
  22. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  23. Think too fast nor too slow: the computational trade-off between planning and reinforcement learning. arXiv preprint arXiv:2005.07404, 2020.
  24. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023.
  25. Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990.
  26. Kernel-based reinforcement learning. Machine learning, 49(2):161–178, 2002.
  27. Beyond target networks: Improving deep q𝑞qitalic_q-learning with functional regularization. arXiv preprint arXiv:2106.02613, 2021.
  28. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
  29. Martil L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2005.
  30. Bruno Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. arXiv preprint arXiv:1011.4362, 2010.
  31. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  32. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  33. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, 7, 1994.
  34. Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  35. Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, 8, 1995.
  36. C. Szepesvári and W. Smart. Interpolation-based Q𝑄Qitalic_Q-learning. In Proceedings of the 21st International Conference on Machine learning, pages 100–107, 2004a.
  37. Interpolation-based q-learning. In Proceedings of the twenty-first international conference on Machine learning, page 100, 2004b.
  38. John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996.
  39. Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018.
  40. A convergent and efficient deep q network algorithm. arXiv preprint arXiv:2106.15419, 2021.
  41. Q-learning. Machine learning, 8(3):279–292, 1992.
  42. Anna Winnicki and R Srikant. Reinforcement learning with unbiased policy evaluation and linear function approximation. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 801–806. IEEE, 2022.
  43. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020.
  44. Breaking the deadly triad with a target network. arXiv preprint arXiv:2101.08862, 2021.

Summary

We haven't generated a summary for this paper yet.