Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Context-lumpable stochastic bandits (2306.13053v2)

Published 22 Jun 2023 in cs.LG

Abstract: We consider a contextual bandit problem with $S$ contexts and $K$ actions. In each round $t=1,2,\dots$, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min{S,K}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon2)$ samples with high probability and provide a matching $\Omega(r(S+K)/\epsilon2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r3(S+K)T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012.
  3. P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
  4. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  5. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
  6. Meta-learning with stochastic linear bandits. Arxiv, 2020.
  7. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121, 2020.
  8. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
  9. On oracle-efficient PAC RL with rich observations. Advances in neural information processing systems, 31, 2018.
  10. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
  11. State aggregation learning from Markov transition data. Advances in Neural Information Processing Systems, 32, 2019.
  12. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169–178, 2011.
  13. Provably efficient exploration for reinforcement learning using unsupervised learning. Advances in Neural Information Processing Systems, 33:22492–22504, 2020.
  14. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.
  15. Online clustering of bandits. In International Conference on Machine Learning, pages 757–765. PMLR, 2014.
  16. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 1253–1262, 2017.
  17. Latent bandits revisited. In NeurIPS, 2020.
  18. Online low rank matrix completion, 2022.
  19. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674, 2013.
  20. Improved regret bounds of bilinear bandits using action space analysis. In International Conference on Machine Learning, pages 4744–4754. PMLR, 2021.
  21. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  22. Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163–3172. PMLR, 2019.
  23. Efficient frameworks for generalized low-rank matrix bandit problems. In Advances in Neural Information Processing Systems, 2022.
  24. Stochastic rank-1 bandits. In Artificial Intelligence and Statistics, pages 392–401. PMLR, 2017.
  25. Finite Markov chains. Springer, 1976.
  26. Stochastic low-rank bandits. arXiv preprint arXiv:1712.04644, 2017.
  27. Differentiable meta-learning in contextual bandits. arXiv:2006.05094v1, 2020.
  28. Meta-Thompson sampling. Arxiv, 2021.
  29. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523–24534, 2021.
  30. Bandit phase retrieval. Advances in Neural Information Processing Systems, 34:18801–18811, 2021.
  31. Bandit algorithms. Cambridge University Press, 2020.
  32. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539–548, 2016.
  33. Low-rank generalized linear bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 460–468. PMLR, 2021.
  34. Latent bandits. In International Conference on Machine Learning, pages 136–144. PMLR, 2014.
  35. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  36. Model-free representation learning and exploration in low-rank MDPs. arXiv preprint arXiv:2102.07035, 2021.
  37. Learning good state and action representations via tensor decomposition. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1682–1687. IEEE, 2021.
  38. Optimal algorithms for latent bandits with cluster structure. In International Conference on Artificial Intelligence and Statistics, pages 7540–7577. PMLR, 2023.
  39. Reinforcement learning in linear MDPs: Constant regret and representation selection. Advances in Neural Information Processing Systems, 34:16371–16383, 2021.
  40. State aggregation in Markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3819–3824, 2002.
  41. Contextual bandits with latent confounders: An NMF approach, 2016.
  42. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904–1931, 2022.
  43. Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
  44. A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
  45. Solving Bernoulli rank-one bandits with unimodal Thompson sampling. In Algorithmic Learning Theory, pages 862–889. PMLR, 2020.
  46. Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations, 2022.
  47. Learning robust state abstractions for hidden-parameter block MDPs. In International Conference on Learning Representations, 2020.
  48. Provably efficient representation learning in low-rank Markov decision processes. arXiv preprint arXiv:2106.11935, 2021.
  49. Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517–26547. PMLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chung-Wei Lee (19 papers)
  2. Qinghua Liu (33 papers)
  3. Yasin Abbasi-Yadkori (35 papers)
  4. Chi Jin (90 papers)
  5. Tor Lattimore (74 papers)
  6. Csaba Szepesvári (75 papers)
Citations (2)