Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kernelized Reinforcement Learning with Order Optimal Regret Bounds (2306.07745v3)

Published 13 Jun 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Abbasi-Yadkori, Y. (2013). Online learning for linearly parametrized control problems. PhD Thesis, University of Alberta.
  2. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24.
  3. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
  4. On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32.
  5. Near-optimal regret bounds for reinforcement learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc.
  6. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. CoRR, abs/1205.2661.
  7. Matérn Gaussian processes on Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 33, pages 12426–12437.
  8. Gaussian process optimization with adaptive sketching: scalable and no regret. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, Phoenix, USA. PMLR.
  9. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844–853. PMLR.
  10. Online learning in kernelized Markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3197–3205. PMLR.
  11. Support Vector Machines. Springer New York, NY.
  12. Kernel-based reinforcement learning: A finite-time analysis. In International Conference on Machine Learning, pages 2783–2792. PMLR.
  13. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53.
  14. Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pages 409–426.
  15. Bandit optimisation of functions in the Matérn kernel RKHS. In International Conference on Artificial Intelligence and Statistics, pages 2486–2495. PMLR.
  16. Is Q-learning provably efficient? Advances in neural information processing systems, 31.
  17. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
  18. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182.
  19. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673. PMLR.
  20. Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pages 240–278. PMLR.
  21. Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling. In International Conference on Machine Learning,, pages 2937–2946. PMLR.
  22. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics.
  23. Mercer, J. (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446.
  24. A graph placement methodology for fast chip design. Nature, 594(7862):207–212.
  25. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:1392–1403.
  26. Pozrikidis, C. (2014). An introduction to grids, graphs, and networks. Oxford University Press.
  27. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  28. Russo, D. (2019). Worst-case regret bounds for exploration via randomized value functions. Advances in Neural Information Processing Systems, 32.
  29. A domain-shrinking based Bayesian optimization algorithm with order-optimal regret performance. Conference on Neural Information Processing Systems, 34.
  30. Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory, pages 1723–1742. PMLR.
  31. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
  32. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.
  33. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML 2010 - Proceedings, 27th International Conference on Machine Learning, pages 1015–1022.
  34. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015–1022.
  35. Optimal order simple regret for Gaussian process bandits. Advances in Neural Information Processing Systems, 34:21202–21215.
  36. Uniform generalization bounds for overparameterized neural networks. arXiv preprint arXiv:2109.06099.
  37. On information gain and regret bounds in Gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82–90. PMLR.
  38. Open problem: Tight online confidence intervals for RKHS elements. In Conference on Learning Theory, pages 4647–4652. PMLR.
  39. Improved convergence rates for sparse approximation methods in kernel-based learning. In International Conference on Machine Learning, pages 21960–21983. PMLR.
  40. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence.
  41. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354.
  42. Gaussian processes for machine learning. MIT press Cambridge, MA.
  43. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR.
  44. Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems, 33:13903–13916.
  45. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622.
  46. Pseudo-MDPs and factored linear action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–9. IEEE.
  47. Sample complexity of kernel-based q-learning. In International Conference on Artificial Intelligence and Statistics, pages 453–469. PMLR.
  48. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR.
  49. Learning near optimal policies with low inherent Bellman error. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10978–10989. PMLR.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com