Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic contextual bandits with graph feedback: from independence number to MAS number (2402.18591v2)

Published 12 Feb 2024 in cs.LG, cs.GT, math.ST, and stat.TH

Abstract: We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits, where taking an action reveals the rewards for all neighboring actions in the feedback graph under all contexts. Unlike the multi-armed bandits setting where a growing literature has painted a near-complete understanding of graph feedback, much remains unexplored in the contextual bandits counterpart. In this paper, we make inroads into this inquiry by establishing a regret lower bound $\Omega(\sqrt{\beta_M(G) T})$, where $M$ is the number of contexts, $G$ is the feedback graph, and $\beta_M(G)$ is our proposed graph-theoretic quantity that characterizes the fundamental learning limit for this class of problems. Interestingly, $\beta_M(G)$ interpolates between $\alpha(G)$ (the independence number of the graph) and $\mathsf{m}(G)$ (the maximum acyclic subgraph (MAS) number of the graph) as the number of contexts $M$ varies. We also provide algorithms that achieve near-optimal regret for important classes of context sequences and/or feedback graphs, such as transitively closed graphs that find applications in auctions and inventory control. In particular, with many contexts, our results show that the MAS number essentially characterizes the statistical complexity for contextual bandits, as opposed to the independence number in multi-armed bandits.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35. PMLR, 2015.
  2. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785–1826, 2017.
  3. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19–26. PMLR, 2012.
  4. The probabilistic method. John Wiley & Sons, 2016.
  5. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967–997, 2014.
  6. Contextual bandits with cross-learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  7. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE, 2020.
  8. Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):233–235, 1979.
  9. Reinforcement learning with feedback graphs. Advances in Neural Information Processing Systems, 33:16868–16878, 2020.
  10. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
  11. On the complexity of multi-agent decision making: From learning in games to partial monitoring. In The Thirty Sixth Annual Conference on Learning Theory, pages 2678–2792. PMLR, 2023.
  12. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
  13. Approximating clique is almost np-complete. In [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science, pages 2–12. IEEE Computer Society, 1991.
  14. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  15. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
  16. Fabrizio Grandoni. A note on the complexity of minimum dominating set. Journal of Discrete Algorithms, 4(2):209–214, 2006.
  17. A nonparametric asymptotic analysis of inventory planning with censored demand. Mathematics of Operations Research, 34(1):103–123, 2009.
  18. Optimal no-regret learning in repeated first-price auctions. arXiv preprint arXiv:2003.09795, 2020.
  19. Richard M Karp. Reducibility among combinatorial problems. Springer, 2010.
  20. Online learning with feedback graphs: The true shape of regret. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17260–17282. PMLR, 23–29 Jul 2023.
  21. Tor Lattimore. Minimax regret for partial monitoring: Infinite outcomes and rustichini’s regret. In Conference on Learning Theory, pages 1547–1575. PMLR, 2022.
  22. Feedback graph regret bounds for thompson sampling and ucb. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 592–614. PMLR, 08 Feb–11 Feb 2020.
  23. From bandits to experts: On the value of side-observations. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  24. Optimal cross-learning for contextual bandits with unknown context distributions. arXiv preprint arXiv:2401.01857, 2024.
  25. Stochastic one-sided full-information bandit. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 150–166. Springer, 2019.
  26. Contextual bandits with smooth regret: Efficient learning in continuous action spaces. In International Conference on Machine Learning, pages 27574–27590. PMLR, 2022.
  27. Practical contextual bandits with feedback graphs. arXiv e-prints, pages arXiv–2302, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuxiao Wen (7 papers)
  2. Yanjun Han (71 papers)
  3. Zhengyuan Zhou (60 papers)