Multi-Bellman operator for convergence of $Q$-learning with linear function approximation (2309.16819v1)
Abstract: We study the convergence of $Q$-learning with linear function approximation. Our key contribution is the introduction of a novel multi-BeLLMan operator that extends the traditional BeLLMan operator. By exploring the properties of this operator, we identify conditions under which the projected multi-BeLLMan operator becomes contractive, providing improved fixed-point guarantees compared to the BeLLMan operator. To leverage these insights, we propose the multi $Q$-learning algorithm with linear function approximation. We demonstrate that this algorithm converges to the fixed-point of the projected multi-BeLLMan operator, yielding solutions of arbitrary accuracy. Finally, we validate our approach by applying it to well-known environments, showcasing the effectiveness and applicability of our findings.
- Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
- Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pages 30–37, 1995.
- Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
- Vivek Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
- Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369–376, 1995.
- Andy Cahill. Catastrophic forgetting in reinforcement-learning environments. PhD thesis, University of Otago, 2011.
- Neural temporal-difference learning converges to global optima. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/98baeb82b676b662e12a7af8ad9212f6-Paper.pdf.
- A new convergent variant of q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33, 2020.
- Target network and truncation overcome the deadly triad in q-learning. arXiv preprint arXiv:2203.02628, 2022.
- Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pages 1049–1058. PMLR, 2017.
- Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, pages 1387–1396. PMLR, 2018a.
- Multiple-step greedy policies in approximate and online reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018b.
- Online planning with lookahead policies. Advances in Neural Information Processing Systems, 33:14024–14033, 2020.
- Amir-massoud Farahmand. Regularization in reinforcement learning. 2011.
- Geoffrey J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances in neural information processing systems, pages 1040–1046, 2001.
- Regularized q-learning. arXiv preprint arXiv:2202.05404, 2022.
- Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pages 719–726, 2010.
- Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv preprint arXiv:1405.6757, 2014.
- Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, pages 308–322. Springer, 2007.
- An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine learning, pages 664–671, 2008.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Think too fast nor too slow: the computational trade-off between planning and reinforcement learning. arXiv preprint arXiv:2005.07404, 2020.
- Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023.
- Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990.
- Kernel-based reinforcement learning. Machine learning, 49(2):161–178, 2002.
- Beyond target networks: Improving deep q𝑞qitalic_q-learning with functional regularization. arXiv preprint arXiv:2106.02613, 2021.
- Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
- Martil L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2005.
- Bruno Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. arXiv preprint arXiv:1011.4362, 2010.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Reinforcement learning with soft state aggregation. Advances in neural information processing systems, 7, 1994.
- Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
- Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, 8, 1995.
- C. Szepesvári and W. Smart. Interpolation-based Q𝑄Qitalic_Q-learning. In Proceedings of the 21st International Conference on Machine learning, pages 100–107, 2004a.
- Interpolation-based q-learning. In Proceedings of the twenty-first international conference on Machine learning, page 100, 2004b.
- John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996.
- Deep reinforcement learning and the deadly triad. CoRR, abs/1812.02648, 2018.
- A convergent and efficient deep q network algorithm. arXiv preprint arXiv:2106.15419, 2021.
- Q-learning. Machine learning, 8(3):279–292, 1992.
- Anna Winnicki and R Srikant. Reinforcement learning with unbiased policy evaluation and linear function approximation. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 801–806. IEEE, 2022.
- A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020.
- Breaking the deadly triad with a target network. arXiv preprint arXiv:2101.08862, 2021.