Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery (2402.15739v2)
Abstract: We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair $(i,j)\in [m]\times [n]$ is selected, the learner observes a noisy sample of the $(i,j)$-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an $\varepsilon$-optimal policy with probability at least $1-\delta$ typically scales as ${r(m+n)\over \varepsilon2}\log(1/\delta)$. Our regret minimization algorithm enjoys minimax guarantees typically scaling as $r{7/4}(m+n){3/4}\sqrt{T}$, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly $r(m+n)$ and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
- Improved algorithms for linear stochastic bandits. In Advances in neural information processing systems, volume 24, 2011.
- Entrywise eigenvector analysis of random matrices with low expected rank. Annals of statistics, 48(3):1452, 2020.
- Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, volume 24, pp. 1035–1043, 2011.
- Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- The moore–penrose pseudoinverse: A tutorial review of the theory. Brazilian Journal of Physics, 42(1–2):146–165, December 2011. ISSN 1678-4448.
- Speed up the cold-start learning in two-sided bandits with many arms. arXiv preprint arXiv:2210.00340, 2022.
- Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing. arXiv preprint arXiv:2309.08634, 2023.
- Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89, 2018.
- Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119, 2012.
- The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
- The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. The Annals of Statistics, 47(5):2405–2439, 2019.
- Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806, 2021.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. JMLR Workshop and Conference Proceedings, 2011.
- Unimodal bandits: Regret lower bounds and optimal algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 521–529. JMLR Workshop and Conference Proceedings, 2014.
- Stochastic linear optimization under bandit feedback. In COLT, 2008.
- Minimax-optimal off-policy evaluation with linear function approximation. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2701–2709. PMLR, 13–18 Jul 2020.
- Unperturbed: spectral analysis beyond davis-kahan. In Algorithmic Learning Theory, pp. 321–358. PMLR, 2018.
- An ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT eigenvector perturbation bound and its application to robust covariance estimation. Journal of Machine Learning Research, 18(207):1–42, 2018.
- Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20:1 – 22, 2015.
- Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020.
- Online clustering of bandits. In International conference on machine learning, pp. 757–765. PMLR, 2014.
- Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Low-rank bandits with latent mixtures. arXiv preprint arXiv:1609.01508, 2016.
- Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 178–191, 2016.
- Improved regret bounds of bilinear bandits using action space analysis. In International Conference on Machine Learning, pp. 4744–4754. PMLR, 2021.
- Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pp. 3163–3172. PMLR, 2019.
- Efficient frameworks for generalized low-rank matrix bandit problems. Advances in Neural Information Processing Systems, 35:19971–19983, 2022.
- Stochastic rank-1 bandits. In Artificial Intelligence and Statistics, pp. 392–401. PMLR, 2017.
- On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016.
- Matrix completion from noisy entries. In Advances in Neural Information Processing Systems, volume 22, pp. 952–960, 2009.
- Spectral bandits. The Journal of Machine Learning Research, 21(1):9003–9046, 2020.
- Stochastic low-rank bandits. arXiv preprint arXiv:1712.04644, 2017.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Stochastic linear bandits with hidden low rank structure. arXiv preprint arXiv:1901.09490, 2019.
- Bandit algorithms. Cambridge University Press, 2020.
- Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pp. 5662–5670. PMLR, 2020.
- Context-lumpable stochastic bandits. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Low-rank generalized linear bandit problems. In International Conference on Artificial Intelligence and Statistics, pp. 460–468. PMLR, 2021.
- Distributed matrix completion and robust factorization. J. Mach. Learn. Res., 16(1):913–960, 2015.
- Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge University Press, 2017.
- Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
- Online low rank matrix completion. In The Eleventh International Conference on Learning Representations, 2022.
- Full rank factorization of matrices. Mathematics Magazine, 72(3):193–201, 1999. ISSN 0025570X, 19300980.
- Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12):3413–3430, 2011.
- Delocalization of eigenvectors of random matrices with independent entries. Duke Mathematical Journal, 164(13):2507 – 2538, 2015.
- Spectral entry-wise matrix estimation for low-rank reinforcement learning. In NeurIPS, 2023.
- A parameter-free algorithm for misspecified linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp. 3367–3375. PMLR, 2021.
- Best policy identification in linear mdps. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1–8. IEEE, 2023.
- Solving bernoulli rank-one bandits with unimodal thompson sampling. In Algorithmic Learning Theory, pp. 862–889. PMLR, 2020.
- Tropp, J. A. Freedman’s inequality for matrix martingales. Electron. Commun. Probab., 16:262–270, 2011a. doi: 10.1214/ECP.v16-1624.
- Tropp, J. A. User-friendly tail bounds for matrix martingales. ACM Report 2011-01, Caltech, Pasadena, 2011b.
- Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp. 46–54. PMLR, 2014.
- Optimal and adaptive off-policy evaluation in contextual bandits. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3589–3597. PMLR, 06–11 Aug 2017.
- Matrix estimation for offline reinforcement learning with low-rank structure. 3rd Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2022.
- Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 3948–3958. PMLR, 26–28 Aug 2020.
- Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020.
- Yassir Jedra (20 papers)
- William Réveillard (1 paper)
- Stefan Stojanovic (5 papers)
- Alexandre Proutiere (74 papers)