Context-lumpable stochastic bandits (2306.13053v2)
Abstract: We consider a contextual bandit problem with $S$ contexts and $K$ actions. In each round $t=1,2,\dots$, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min{S,K}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon2)$ samples with high probability and provide a matching $\Omega(r(S+K)/\epsilon2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r3(S+K)T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012.
- P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
- The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- Meta-learning with stochastic linear bandits. Arxiv, 2020.
- Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121, 2020.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
- On oracle-efficient PAC RL with rich observations. Advances in neural information processing systems, 31, 2018.
- Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
- State aggregation learning from Markov transition data. Advances in Neural Information Processing Systems, 32, 2019.
- Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169–178, 2011.
- Provably efficient exploration for reinforcement learning using unsupervised learning. Advances in Neural Information Processing Systems, 33:22492–22504, 2020.
- Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.
- Online clustering of bandits. In International Conference on Machine Learning, pages 757–765. PMLR, 2014.
- On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 1253–1262, 2017.
- Latent bandits revisited. In NeurIPS, 2020.
- Online low rank matrix completion, 2022.
- Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674, 2013.
- Improved regret bounds of bilinear bandits using action space analysis. In International Conference on Machine Learning, pages 4744–4754. PMLR, 2021.
- Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
- Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163–3172. PMLR, 2019.
- Efficient frameworks for generalized low-rank matrix bandit problems. In Advances in Neural Information Processing Systems, 2022.
- Stochastic rank-1 bandits. In Artificial Intelligence and Statistics, pages 392–401. PMLR, 2017.
- Finite Markov chains. Springer, 1976.
- Stochastic low-rank bandits. arXiv preprint arXiv:1712.04644, 2017.
- Differentiable meta-learning in contextual bandits. arXiv:2006.05094v1, 2020.
- Meta-Thompson sampling. Arxiv, 2021.
- RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523–24534, 2021.
- Bandit phase retrieval. Advances in Neural Information Processing Systems, 34:18801–18811, 2021.
- Bandit algorithms. Cambridge University Press, 2020.
- Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539–548, 2016.
- Low-rank generalized linear bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 460–468. PMLR, 2021.
- Latent bandits. In International Conference on Machine Learning, pages 136–144. PMLR, 2014.
- Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
- Model-free representation learning and exploration in low-rank MDPs. arXiv preprint arXiv:2102.07035, 2021.
- Learning good state and action representations via tensor decomposition. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1682–1687. IEEE, 2021.
- Optimal algorithms for latent bandits with cluster structure. In International Conference on Artificial Intelligence and Statistics, pages 7540–7577. PMLR, 2023.
- Reinforcement learning in linear MDPs: Constant regret and representation selection. Advances in Neural Information Processing Systems, 34:16371–16383, 2021.
- State aggregation in Markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3819–3824, 2002.
- Contextual bandits with latent confounders: An NMF approach, 2016.
- Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904–1931, 2022.
- Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
- Solving Bernoulli rank-one bandits with unimodal Thompson sampling. In Algorithmic Learning Theory, pages 862–889. PMLR, 2020.
- Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations, 2022.
- Learning robust state abstractions for hidden-parameter block MDPs. In International Conference on Learning Representations, 2020.
- Provably efficient representation learning in low-rank Markov decision processes. arXiv preprint arXiv:2106.11935, 2021.
- Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517–26547. PMLR, 2022.
- Chung-Wei Lee (19 papers)
- Qinghua Liu (33 papers)
- Yasin Abbasi-Yadkori (35 papers)
- Chi Jin (90 papers)
- Tor Lattimore (74 papers)
- Csaba Szepesvári (75 papers)