Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Incentive-compatible Bandits: Importance Weighting No More (2405.06480v1)

Published 10 May 2024 in cs.LG and cs.GT

Abstract: We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(\sqrt{T \log(K)})$ regret and $O(T{2/3}(K\log(K)){1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(\sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $\tilde O(\sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(\sqrt{KT})$ regret.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
  2. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1), 2002.
  3. Online mirror descent and dual averaging: keeping pace in the dynamic case. The Journal of Machine Learning Research, 23(1):5271–5308, 2022.
  4. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
  5. No-regret and incentive-compatible online learning. In International Conference on Machine Learning, pages 3270–3279. PMLR, 2020.
  6. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  7. Efficient learning by implicit exploration in bandit problems with side observations. Advances in Neural Information Processing Systems, 27, 2014.
  8. Self-financed wagering mechanisms for forecasting. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 170–179, 2008.
  9. An axiomatic characterization of wagering mechanisms. Journal of Economic Theory, 156:389–416, 2015.
  10. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems, 33:15522–15533, 2020a.
  11. A closer look at small-loss bounds for bandits with graph feedback. In Conference on Learning Theory, pages 2516–2564. PMLR, 2020b.
  12. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  13. Vladimir G Vovk. A game of prediction with expert advice. In Proceedings of the eighth annual conference on Computational learning theory, pages 51–60, 1995.
  14. More adaptive algorithms for adversarial bandits. 2018.
  15. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1–49, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Julian Zimmert (30 papers)
  2. Teodor V. Marinov (14 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com