Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimistic Information Directed Sampling (2402.15411v2)

Published 23 Feb 2024 in cs.LG

Abstract: We study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster, Kakade, Qian, and Rakhlin (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021). Keywords: Contextual bandits, information-directed sampling, decision estimation coefficient, first-order regret bounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning, pages 3–11, 1999.
  2. Open problem: First-order regret bounds for contextual bandits. In Conference on Learning Theory, pages 4–7, 2017.
  3. S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.
  4. Make the minority great again: First-order regret bound for contextual bandits. In International Conference on Machine Learning, pages 186–194, 2018.
  5. Efficient online bandit multiclass learning with O~⁢(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret. In International Conference on Machine Learning, pages 488–497, 2017.
  6. Bandit multiclass linear classification: Efficient algorithms for the separable case. In International Conference on Machine Learning, pages 624–633, 2019.
  7. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5. 10.1093/ACPROF:OSO/9780199535255.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
  8. S. Bubeck and M. Sellke. First-order Bayesian regret analysis of Thompson sampling. In Algorithmic Learning Theory, pages 196–233, 2020.
  9. Improved second-order bounds for prediction with expert advice. In Conference on Learning Theory, volume 3559 of Lecture Notes in Computer Science, pages 217–232. Springer, 2005.
  10. Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning, 2022.
  11. Safe-bayesian generalized linear regression. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 2623–2633, 2020.
  12. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2019.
  13. D. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210, 2020.
  14. D. J. Foster and A. Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34:18907–18919, 2021.
  15. The Statistical Complexity of Interactive Decision Making, 2021.
  16. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023a.
  17. Model-free reinforcement learning with the decision-estimation coefficient. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  18. P. Grünwald. The safe Bayesian: learning the learning rate via the mixability gap. In Algorithmic Learning Theory, pages 169–183, 2012.
  19. B. Hao and T. Lattimore. Regret bounds for information-directed reinforcement learning. Advances in Neural Information Processing Systems, 35:28575–28587, 2022.
  20. Contextual information-directed sampling. In International Conference on Machine Learning, pages 8446–8464, 2022.
  21. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pages 440–447, 2008.
  22. J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358–384, 2018.
  23. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pages 2328–2369, 2020.
  24. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pages 2777–2821, 2021.
  25. Regret minimization via saddle point optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  26. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  27. T. Lattimore and A. György. Mirror Descent and the Information Ratio. In Conference on Learning Theory, volume 134, pages 2965–2992, 2021.
  28. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662–5670, 2020.
  29. S. Min and D. Russo. An information-theoretic analysis of nonstationary bandit learning. In International Conference on Machine Learning, 2023.
  30. Lifting the information ratio: An information-theoretic analysis of Thompson sampling for contextual bandits. Advances in Neural Information Processing Systems, 35:9486–9498, 2022.
  31. First-and second-order bounds for adversarial linear contextual bandits. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  32. D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. J. Mach. Learn. Res., 17:68:1–68:30, 2016.
  33. D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Oper. Res., 66(1):230–252, 2018.
  34. S. Shalev-Shwartz. Online learning: Theory, algorithms, and applications. Hebrew University, 2007.
  35. W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, 25:285–294, 1933.
  36. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. In Algorithmic Learning Theory, pages 1237–1264, 2021.
  37. T. Zhang. From e⁢p⁢s⁢i⁢l⁢o⁢n𝑒𝑝𝑠𝑖𝑙𝑜𝑛epsilonitalic_e italic_p italic_s italic_i italic_l italic_o italic_n-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5), 2006a.
  38. T. Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006b.
  39. T. Zhang. Feel-good Thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com