A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit (2011.14033v7)
Abstract: In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2312–2320.
- Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691–3699). PMLR.
- Thompson sampling for the mnl-bandit. In Conference on Learning Theory (pp. 76–78). PMLR.
- Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67, 1453–1485. doi:10.1287/opre.2018.1832.
- An exact method for assortment optimization under the nested logit model. European Journal of Operational Research, 291, 830–845. doi:https://doi.org/10.1016/j.ejor.2020.12.007.
- Ucb-based algorithms for multinomial logistic regression bandits. Advances in Neural Information Processing Systems, 34, 2913–2924.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 235–256.
- Avadhanula, V. (2019). The MNL-Bandit Problem: Theory and Applications. Ph.D. thesis Columbia University.
- Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384 – 414. URL: https://doi.org/10.1214/09-EJS521. doi:10.1214/09-EJS521.
- Dynamic assortment optimization with changing contextual information. Journal of Machine Learning Research, 21, 1–44.
- Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Available at SSRN 3075658, .
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 208–214).
- Stochastic linear optimization under bandit feedback. In Conference on Learning Theory.
- Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052–3060). PMLR.
- Taking assortment optimization from theory to practice: Evidence from large field experiments on alibaba. Available at SSRN, .
- Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems (pp. 586–594).
- Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research, 273, 1052–1064. doi:https://doi.org/10.1016/j.ejor.2018.08.047.
- On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, .
- Filtered poisson process bandit on a continuum. European Journal of Operational Research, 295, 575–586. doi:https://doi.org/10.1016/j.ejor.2021.03.033.
- Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research, 55, 1001–1021. doi:https://doi.org/10.1287/opre.1070.0409.
- Theory of point estimation. Springer Science & Business Media.
- Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2071–2080).
- Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems (pp. 3151–3161).
- Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 9205–9213). volume 35.
- Multinomial logit bandit with linear utility functions. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 2602–2608).
- Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35, 3461–3474.
- Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58, 1666–1680. doi:https://doi.org/10.1287/opre.1100.0866.
- Linearly parameterized bandits. Mathematics of Operations Research, 35, 395–411. doi:https://doi.org/10.1287/moor.1100.0446.
- Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15, 387–404. doi:https://doi.org/10.1287/msom.2013.0429.
- Product assortment and space allocation strategies to attract loyal and non-loyal customers. European Journal of Operational Research, 285, 1058–1076. doi:https://doi.org/10.1016/j.ejor.2020.02.019.
- Design and pricing of extended warranty menus based on the multinomial logit choice model. European Journal of Operational Research, 287, 237–250. doi:https://doi.org/10.1016/j.ejor.2020.05.012.
- An online algorithm for the risk-aware restless bandit. European Journal of Operational Research, 290, 622–639. doi:https://doi.org/10.1016/j.ejor.2020.08.028.
- Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning (pp. 362–370).
- Priyank Agrawal (9 papers)
- Theja Tulabandhula (51 papers)
- Vashist Avadhanula (11 papers)