Adversarial Contextual Bandits Go Kernelized (2310.01609v1)
Abstract: We study a generalization of the problem of online learning in adversarial linear contextual bandits by incorporating loss functions that belong to a reproducing kernel Hilbert space, which allows for a more flexible modeling of complex decision-making scenarios. We propose a computationally efficient algorithm that makes use of a new optimistically biased estimator for the loss functions and achieves near-optimal regret guarantees under a variety of eigenvalue decay assumptions made on the underlying kernel. Specifically, under the assumption of polynomial eigendecay with exponent $c>1$, the regret is $\widetilde{O}(KT{\frac{1}{2}(1+\frac{1}{c})})$, where $T$ denotes the number of rounds and $K$ the number of actions. Furthermore, when the eigendecay follows an exponential pattern, we achieve an even tighter regret bound of $\widetilde{O}(\sqrt{T})$. These rates match the lower bounds in all special cases where lower bounds are known at all, and match the best known upper bounds available for the more well-studied stochastic counterpart of our problem.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38, 2017.
- The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Label optimal regret bounds for online local learning. In Proceedings of The 28th Conference on Learning Theory (COLT), pages 150–166, 2015.
- High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pages 335–342, 2008.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- Sparsity, variance and curvature in multi-armed bandits. In Algorithmic Learning Theory, pages 111–127, 2018.
- Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pages 533–557, 2019.
- Near-linear time gaussian process optimization with adaptive batching and resparsification. In International Conference on Machine Learning, pages 1295–1305, 2020.
- Online learning with kernel losses. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 971–980. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/chatterji19a.html.
- On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844–853. PMLR, 2017.
- Paul Christiano. Provably manipulation-resistant reputation systems. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), pages 670–697, 2016.
- Support Vector Machines. Springer New York, NY, 2008.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
- Refined regret for adversarial mdps with linear function approximation. arXiv preprint arXiv:2301.12942, 2023.
- Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pages 209–216. ACM, 2007.
- Learning in games: Robustness of fast convergence. Advances in Neural Information Processing Systems, 29, 2016.
- Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
- Online metric learning and fast similarity search. In Advances in neural information processing systems, pages 761–768, 2009.
- Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 575–582, 2010.
- Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR, 2019.
- Bypassing the simulator: Near-optimal adversarial linear contextual bandits. arXiv preprint arXiv:2309.00814, 2023.
- Efficient online portfolio with logarithmic regret. Advances in neural information processing systems, 31, 2018.
- J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446, 1909. ISSN 02643952. URL http://www.jstor.org/stable/91043.
- Efficient and robust algorithms for adversarial linear contextual bandits. In Conference on Learning Theory, pages 3049–3068. PMLR, 2020.
- Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
- Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.
- Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pages 1723–1742, 2017.
- Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
- Information consistency of nonparametric gaussian process methods. IEEE Transactions on Information Theory, 54(5):2376–2382, 2008. 10.1109/TIT.2007.915707.
- Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- From ads to interventions: Contextual bandits in mobile health. In Mobile Health - Sensors, Analytic Methods, and Applications, 2017. URL https://api.semanticscholar.org/CorpusID:18778220.
- Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence, 2013.
- More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291. PMLR, 2018.
- On function approximation in reinforcement learning: Optimism in the face of large state spaces. Advances in Neural Information Processing Systems, 2020, 2020.
- On early stopping in gradient descent learning. Constructive Approximation, 26:289–315, 2007.
- Efficient kernelized ucb for contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 5689–5720, 2022.
- Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 3285–3312, 2022.
- Gergely Neu (52 papers)
- Julia Olkhovskaya (11 papers)
- Sattar Vakili (37 papers)