Understanding the Role of Feedback in Online Learning with Switching Costs (2306.09588v1)
Abstract: In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $\widetilde{\Theta}(T{2/3})$ under bandit feedback and improves to $\widetilde{\Theta}(\sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{\mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{\mathrm{ex}} = O(T{2/3})$, the regret remains $\widetilde{\Theta}(T{2/3})$, but when $B_{\mathrm{ex}} = \Omega(T{2/3})$, it becomes $\widetilde{\Theta}(T/\sqrt{B_{\mathrm{ex}}})$, which improves as the budget $B_{\mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $\widetilde{\Theta}(T/\sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.
- Online learning over a finite action set with limited switching. ArXiv, abs/1803.01548, 2018.
- Better best of both worlds bounds for bandits with switching costs. In Advances in Neural Information Processing Systems, volume 35, pp. 15800–15810, 2022.
- Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1747–1754, 2012.
- Bandits with feedback graphs and switching costs. Advances in Neural Information Processing Systems, 32, 2019.
- Minimax policies for adversarial and stochastic bandits. In Annual Conference Computational Learning Theory, 2009.
- Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res., 11:2785–2836, 2010.
- Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, 2003.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002a.
- The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32:48–77, 2002b.
- The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42–1. JMLR Workshop and Conference Proceedings, 2012.
- Prediction, learning, and games. Cambridge university press, 2006.
- Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51:2152–2162, 2004.
- Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, volume 26, 2013.
- Bandits with switching costs: T^{{\{{2/3}}\}} regret. arXiv preprint arXiv:1310.2997, 2013.
- Prediction by random-walk perturbation. In Annual Conference Computational Learning Theory, 2013.
- Online learning with graph-structured feedback against adaptive adversaries. 2018 IEEE International Symposium on Information Theory (ISIT), pp. 931–935, 2018.
- Regret minimization for online buffering problems using the weighted majority algorithm. Electron. Colloquium Comput. Complex., TR10, 2010.
- Hazan, E. Introduction to online convex optimization. Found. Trends Optim., 2:157–325, 2016.
- Kaplan, S. Power plants: characteristics and costs. DIANE Publishing, 2011.
- The weighted majority algorithm. 30th Annual Symposium on Foundations of Computer Science, pp. 256–261, 1989.
- Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS, 2015.
- Orabona, F. A modern introduction to online learning. ArXiv, abs/1912.13213, 2019.
- An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, 2021.
- Prediction with limited advice and multiarmed bandits with paid observations. In International Conference on Machine Learning, 2014.
- Power-of-2-arms for bandit learning with switching costs. In Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pp. 131–140, 2022.
- Phase transitions in bandits with switching constraints. ERN: Other Econometrics: Mathematical Methods & Programming (Topic), 2019.
- Yao, A. C.-C. Probabilistic computations: Toward a unified measure of complexity. 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pp. 222–227, 1977.
- Multi-armed bandit with additional observations. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2:1 – 22, 2018.
- Traffic-based reconfiguration for logical topologies in large-scale wdm optical networks. Journal of Lightwave Technology, 23:2854–2867, 2005.
- Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683–7692. PMLR, 2019.
- Duo Cheng (1 paper)
- Xingyu Zhou (82 papers)
- Bo Ji (61 papers)