2000 character limit reached
Batched Nonparametric Contextual Bandits (2402.17732v2)
Published 27 Feb 2024 in math.ST, cs.LG, stat.ML, and stat.TH
Abstract: We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. Statistics & Probability Letters, 164:108818, 2020.
- Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608–633, 2007.
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
- Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
- Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
- A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
- Non-stationary contextual bandits and universal learning. arXiv preprint arXiv:2302.07186, 2023.
- Transfer learning for contextual multi-armed bandits. arXiv preprint arXiv:2211.12612, 2022.
- Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm. The Annals of Statistics, 50(4):2179–2204, 2022.
- Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26, 2013.
- Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105, 2014.
- An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.
- Economic analysis of simulation selection problems. Management Science, 55(3):421–437, 2009.
- Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079–1105, 2006.
- Provably efficient high-dimensional bandit learning with batched feedbacks. arXiv preprint arXiv:2311.13180, 2023.
- Lipschitz bandits with batched feedback. Advances in Neural Information Processing Systems, 35:19836–19848, 2022.
- Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348–3356. PMLR, 2020.
- A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
- Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
- Woodroofe’s one-armed bandit problem revisited. The Annals of Applied Probability, 19(4):1603–1633, 2009.
- A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
- Nonparametric stochastic contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Smoothness-adaptive contextual bandits. Operations Research, 70(6):3198–3216, 2022.
- Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
- Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Operations Research, 70(6):3261–3281, 2022.
- Batched thompson sampling. Advances in Neural Information Processing Systems, 34:29984–29994, 2021.
- Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535–10548, 2021.
- The battle trial: personalizing therapy for lung cancer. Cancer discovery, 1(1):44–53, 2011.
- Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 453–456, 2008.
- Optimal sequential treatment allocation. arXiv preprint arXiv:1705.09952, 2017.
- Contextual bandits with continuous actions: Smoothing, zooming, and adapting. The Journal of Machine Learning Research, 21(1):5402–5446, 2020.
- Bandit algorithms. Cambridge University Press, 2020.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, pages 1463–1492. PMLR, 2018.
- Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009.
- Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
- The multi-armed bandit problem with covariates. Ann. Statist., 41(2):693–721, 2013.
- Batched bandit problems. Ann. Statist., 44(2):660–681, 2016.
- Adaptive algorithm for multi-armed bandit problem with high-dimensional covariates. Journal of the American Statistical Association, pages 1–13, 2023.
- Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, 17(149), 2016.
- Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics, 10(1):242–270, 2016.
- Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
- The k-nearest neighbour ucb algorithm for multi-armed bandits with covariates. In Algorithmic Learning Theory, pages 725–752. PMLR, 2018.
- Dynamic batch learning in high-dimensional sparse linear contextual bandits. Management Science, 2023.
- Batched learning in generalized linear contextual bandits with general decision sets. IEEE Control Systems Letters, 6:37–42, 2020.
- Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630, 2010.
- Herbert E. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527–535, 1952.
- Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
- Tracking most significant shifts in nonparametric contextual bandits. arXiv preprint arXiv:2307.05341, 2023.
- Self-tuning bandits over unknown covariate-shifts. In Algorithmic Learning Theory, pages 1114–1156. PMLR, 2021.
- From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
- Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
- Stochastic bandit models for delayed conversions. arXiv preprint arXiv:1706.09186, 2017.
- Online batch decision-making with high-dimensional covariates. In International Conference on Artificial Intelligence and Statistics, pages 3848–3857. PMLR, 2020.
- Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
- Michael Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.
- Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist., 30(1):100–121, 2002.
- Inference for batched bandits. Advances in neural information processing systems, 33:9818–9829, 2020.
- Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
- How do tumor cytogenetics inform cancer treatments? dynamic risk stratification and precision medicine using multi-armed bandits. Dynamic Risk Stratification and Precision Medicine Using Multi-armed Bandits (June 17, 2019).