Papers
Topics
Authors
Recent
2000 character limit reached

An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits

Published 9 Nov 2023 in stat.ME and cs.LG | (2311.05794v4)

Abstract: Experimentation is crucial for managers to rigorously quantify the value of a change and determine if it leads to a statistically significant improvement over the status quo. As companies increasingly mandate that all changes undergo experimentation before widespread release, two challenges arise: (1) minimizing the proportion of customers assigned to the inferior treatment and (2) increasing experimentation velocity by enabling data-dependent stopping. This paper addresses both challenges by introducing the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit (MAB) algorithms that enables anytime-valid inference on the Average Treatment Effect (ATE) for \emph{any} MAB algorithm. Intuitively, MAD "mixes" any bandit algorithm with a Bernoulli design, where at each time step, the probability of assigning a unit via the Bernoulli design is determined by a user-specified deterministic sequence that can converge to zero. This sequence lets managers directly control the trade-off between regret minimization and inferential precision. Under mild conditions on the rate the sequence converges to zero, we provide a confidence sequence that is asymptotically anytime-valid and guaranteed to shrink around the true ATE. Hence, when the true ATE converges to a non-zero value, the MAD confidence sequence is guaranteed to exclude zero in finite time. Therefore, the MAD enables managers to stop experiments early while ensuring valid inference, enhancing both the efficiency and reliability of adaptive experiments. Empirically, we demonstrate that the MAD achieves finite-sample anytime-validity while accurately and precisely estimating the ATE, all without incurring significant losses in reward compared to standard bandit designs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Abadie, A., Athey, S., Imbens, G. W. and Wooldridge, J. M. (2020), “Sampling-based versus design-based uncertainty in regression analysis”, Econometrica 88(1), 265–296.
  2. Agrawal, S. and Goyal, N. (2012), Analysis of thompson sampling for the multi-armed bandit problem, in “Conference on learning theory”, JMLR Workshop and Conference Proceedings, pp. 39–1.
  3. Allesiardo, R., Féraud, R. and Maillard, O.-A. (2017), “The non-stationary stochastic multi-armed bandit problem”, International Journal of Data Science and Analytics 3, 267–283.
  4. Aramayo, N., Schiappacasse, M. and Goic, M. (2023), “A multiarmed bandit approach for house ads recommendations”, Marketing Science 42(2), 271–292.
  5. Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002a), “Finite-time analysis of the multiarmed bandit problem”, Machine learning 47, 235–256.
  6. Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002b), “Finite-time analysis of the multiarmed bandit problem”, Machine learning 47, 235–256.
  7. Banerjee, D., Ghosh, A., Chowdhury, S. R. and Gopalan, A. (2023), “Exploration in linear bandits with rich action sets and its implications for inference”.
  8. Besbes, O., Gur, Y. and Zeevi, A. (2014), Stochastic multi-armed-bandit problem with non-stationary rewards, in Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence and K. Weinberger, eds, “Advances in Neural Information Processing Systems”, Vol. 27, Curran Associates, Inc.
  9. Bojinov, I. and Gupta, S. (2022a), “Online Experimentation: Benefits, Operational and Methodological Challenges, and Scaling Guide”, Harvard Data Science Review 4(3). https://hdsr.mitpress.mit.edu/pub/aj31wj81.
  10. Bojinov, I. and Gupta, S. (2022b), “Online experimentation: Benefits, operational and methodological challenges, and scaling guide”.
  11. Bojinov, I., Rambachan, A. and Shephard, N. (2021), “Panel experiments and dynamic causal effects: A finite population perspective”.
  12. Bojinov, I. and Shephard, N. (2019), “Time series experiments and causal estimands: Exact randomization tests and trading”, Journal of the American Statistical Association 114(528), 1665–1682.
  13. Bojinov, I., Simchi-Levi, D. and Zhao, J. (2023), “Design and analysis of switchback experiments”, Management Science 69(7), 3759–3777.
  14. Csörgő, M. (1968), “On the strong law of large numbers and the central limit theorem for martingales”, Transactions of the American Mathematical Society 131(1), 259–275.
  15. Darling, D. A. and Robbins, H. (1967), “Confidence sequences for mean, variance, and median”, Proceedings of the National Academy of Sciences 58(1), 66–68.
  16. Deshpande, Y., Javanmard, A. and Mehrabi, M. (2019), “Online debiasing for adaptively collected high-dimensional data”, arXiv preprint arXiv:1911.01040 16.
  17. Dimakopoulou, M., Ren, Z. and Zhou, Z. (2021), “Online multi-armed bandits with adaptive inference”.
  18. Dimakopoulou, M., Zhou, Z., Athey, S. and Imbens, G. (2018), “Balanced linear contextual bandits”.
  19. Ding, P., Feller, A. and Miratrix, L. (2016), “Randomization inference for treatment effect variation”, Journal of the Royal Statistical Society Series B: Statistical Methodology 78(3), 655–671.
  20. Fisher, R. A. (1936), “Design of experiments”, British Medical Journal 1(3923), 554.
  21. Gao, Z., Han, Y., Ren, Z. and Zhou, Z. (2019), “Batched multi-armed bandits problem”, Advances in Neural Information Processing Systems 32.
  22. Garivier, A. and Cappé, O. (2011), The kl-ucb algorithm for bounded stochastic bandits and beyond, in “Proceedings of the 24th annual conference on learning theory”, JMLR Workshop and Conference Proceedings, pp. 359–376.
  23. Gong, X.-Y. and Simchi-Levi, D. (2023), “Bandits atop reinforcement learning: Tackling online inventory models with cyclic demands”, Management Science .
  24. Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S. and Athey, S. (2021), “Confidence intervals for policy evaluation in adaptive experiments”, Proceedings of the national academy of sciences 118(15), e2014602118.
  25. Hahn, J., Hirano, K. and Karlan, D. (2011), “Adaptive experimental design using the propensity score”, Journal of Business & Economic Statistics 29(1), 96–108.
  26. Ham, D. W., Bojinov, I., Lindon, M. and Tingley, M. (2023), “Design-based confidence sequences: A general approach to risk mitigation in online experimentation”.
  27. Horvitz, D. G. and Thompson, D. J. (1952), “A generalization of sampling without replacement from a finite universe”, Journal of the American statistical Association 47(260), 663–685.
  28. Howard, S. R. and Ramdas, A. (2022), “Sequential estimation of quantiles with applications to a/b testing and best-arm identification”, Bernoulli 28(3).
  29. Howard, S. R., Ramdas, A., McAuliffe, J. and Sekhon, J. (2021), “Time-uniform, nonparametric, nonasymptotic confidence sequences”, The Annals of Statistics 49(2).
  30. Jin, T., Xu, P., Shi, J., Xiao, X. and Gu, Q. (2020), “Mots: Minimax optimal thompson sampling”.
  31. Johari, R., Koomen, P., Pekelis, L. and Walsh, D. (2022), “Always valid inference: Continuous monitoring of a/b tests”, Operations Research 70(3), 1806–1821.
  32. Kasy, M. and Sautmann, A. (2021), “Adaptive treatment assignment in experiments for policy choice”, Econometrica 89(1), 113–132.
  33. Kaufmann, E. (2018), “On bayesian index policies for sequential resource allocation”, The Annals of Statistics 46(2), 842–865.
  34. Kim, M. J. and Lim, A. E. (2016), “Robust multiarmed bandit problems”, Management Science 62(1), 264–285.
  35. Lai, T. L. (1976a), “Boundary crossing probabilities for sample sums and confidence sequences”, The Annals of Probability 4(2), 299–312.
  36. Lai, T. L. (1976b), “On confidence sequences”, The Annals of Statistics pp. 265–280.
  37. Lai, T. L. and Robbins, H. (1985), “Asymptotically efficient adaptive allocation rules”, Advances in applied mathematics 6(1), 4–22.
  38. Lai, T. L. and Wei, C. Z. (1982), “Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems”, The Annals of Statistics 10(1), 154–166.
  39. Lei, L. and Ding, P. (2021), “Regression adjustment in completely randomized experiments with a diverging number of covariates”, Biometrika 108(4), 815–828.
  40. Luedtke, A. R. and Van Der Laan, M. J. (2016), “Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy”, Annals of statistics 44(2), 713.
  41. Ménard, P. and Garivier, A. (2017), A minimax and asymptotically optimal algorithm for stochastic bandits, in “International Conference on Algorithmic Learning Theory”, PMLR, pp. 223–237.
  42. Meng, X.-L. (2018), “Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 us presidential election”, The Annals of Applied Statistics 12(2), 685–726.
  43. Nabi, S., Nassif, H., Hong, J., Mamani, H. and Imbens, G. (2022), “Bayesian meta-prior learning using empirical bayes”, Management Science 68(3), 1737–1755.
  44. Nair, Y. and Janson, L. (2023), “Randomization tests for adaptively collected data”.
  45. Neyman, J. and Iwaszkiewicz, K. (1935), “Statistical problems in agricultural experimentation”, Supplement to the Journal of the Royal Statistical Society 2(2), 107–180.
  46. Ren, Z. and Zhou, Z. (2024), “Dynamic batch learning in high-dimensional sparse linear contextual bandits”, Management Science 70(2), 1315–1342.
  47. Russo, D., Roy, B. V., Kazerouni, A., Osband, I. and Wen, Z. (2020), “A tutorial on thompson sampling”.
  48. Simchi-Levi, D. and Wang, C. (2023), Multi-armed bandit experimental design: Online decision-making and adaptive inference, in “International Conference on Artificial Intelligence and Statistics”, PMLR, pp. 3086–3097.
  49. Thompson, W. R. (1933), “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples”, Biometrika 25(3-4), 285–294.
  50. Waudby-Smith, I., Arbour, D., Sinha, R., Kennedy, E. H. and Ramdas, A. (2023), “Time-uniform central limit theory and asymptotic confidence sequences”.
  51. Waudby-Smith, I. and Ramdas, A. (2020), “Estimating means of bounded random variables by betting”, arXiv preprint arXiv:2010.09686 .
  52. Waudby-Smith, I., Wu, L., Ramdas, A., Karampatziakis, N. and Mineiro, P. (2022), “Anytime-valid off-policy inference for contextual bandits”.
  53. Wu, Q., Iyer, N. and Wang, H. (2018), Learning contextual bandits in a non-stationary environment, in “The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval”, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, p. 495–504.
  54. Xu, M., Qin, T. and Liu, T.-Y. (2013), Estimation bias in multi-armed bandit algorithms for search advertising, in C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Weinberger, eds, “Advances in Neural Information Processing Systems”, Vol. 26, Curran Associates, Inc.
  55. Zhang, K., Janson, L. and Murphy, S. (2020), Inference for batched bandits, in H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan and H. Lin, eds, “Advances in Neural Information Processing Systems”, Vol. 33, Curran Associates, Inc., pp. 9818–9829.
  56. Zhang, K., Janson, L. and Murphy, S. (2021), Statistical inference with m-estimators on adaptively collected data, in M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang and J. W. Vaughan, eds, “Advances in Neural Information Processing Systems”, Vol. 34, Curran Associates, Inc., pp. 7460–7471.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.