Multiarmed Bandits Problem Under the Mean-Variance Setting (2212.09192v4)
Abstract: The classical multi-armed bandit (MAB) problem involves a learner and a collection of K independent arms, each with its own ex ante unknown independent reward distribution. At each one of a finite number of rounds, the learner selects one arm and receives new information. The learner often faces an exploration-exploitation dilemma: exploiting the current information by playing the arm with the highest estimated reward versus exploring all arms to gather more reward information. The design objective aims to maximize the expected cumulative reward over all rounds. However, such an objective does not account for a risk-reward tradeoff, which is often a fundamental precept in many areas of applications, most notably in finance and economics. In this paper, we build upon Sani et al. (2012) and extend the classical MAB problem to a mean-variance setting. Specifically, we relax the assumptions of independent arms and bounded rewards made in Sani et al. (2012) by considering sub-Gaussian arms. We introduce the Risk Aware Lower Confidence Bound (RALCB) algorithm to solve the problem, and study some of its properties. Finally, we perform a number of numerical simulations to demonstrate that, in both independent and dependent scenarios, our suggested approach performs better than the algorithm suggested by Sani et al. (2012).
- Agrawal, R. (1995). Sample mean based index policies by 𝒪(logn)𝒪𝑛\mathcal{O}(\log n)caligraphic_O ( roman_log italic_n ) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078.
- Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256.
- Optimal thompson sampling strategies for support-aware cvar bandits. In International Conference on Machine Learning, pages 716–726. PMLR.
- muMAB: A multi-armed bandit model for wireless network selection. Algorithms, 11(2).
- A unifying theory of thompson sampling for continuous risk-averse bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6159–6166.
- An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
- Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Proceedings of the 3rd Machine Learning for Healthcare Conference, volume 85 of Proceedings of Machine Learning Research, pages 67–82. PMLR.
- Exploration vs exploitation vs safety: Risk-averse multi-armed bandits. CoRR, abs/1401.1123.
- Multi-armed bandits with correlated arms. IEEE Transactions on Information Theory, 67(10):6711–6732.
- Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. CoRR, abs/1906.00569.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22.
- Risk-aware multi-armed bandits with refined upper confidence bounds. IEEE Signal Processing Letters, 28:269–273.
- Markowitz, H. M. (1968). Portfolio Selection: Efficient Diversification of Investments. Yale University Press.
- Bandits and recommender systems. In Machine Learning, Optimization, and Big Data, pages 325–336, Cham. Springer International Publishing.
- Theory of Games and Economic Behavior. Princeton University Press.
- Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In ICML, pages 5577–5586.
- High dimensional statistics. In Lecture notes for course 18.S997. MIT OpenCourseWare.
- Rivasplata, O. (2012). Subgaussian random variables: An expository note. Internet publication, PDF, 5.
- Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 – 535.
- Risk-aversion in multi-armed bandits. Advances in Neural Information Processing Systems, 25.
- Portfolio choices with orthogonal bandit learning. In IJCAI.
- Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
- Mean-variance and value at risk in multi-armed bandit problems. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1330–1335.
- Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111.
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48. Cambridge University Press.
- Zhao, Q. (2019). Multi-armed bandits: Theory and applications to online learning in networks. Synthesis Lectures on Communication Networks, 12(1):1–165.
- Adaptive portfolio by solving multi-armed bandit via Thompson sampling. arXiv preprint arXiv:1911.05309.
- Thompson sampling algorithms for mean-variance bandits. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11599–11608. PMLR.