Weak Signal Asymptotics for Sequentially Randomized Experiments
Abstract: We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments -- adapted to this scaling regime and with arm selection probabilities that vary continuously with state -- converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics, and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not UCB, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs.
- Near-optimal regret bounds for Thompson sampling. Journal of the ACM, 64(5):1–24, 2017.
- Diffusion approximations for a class of sequential testing problems. arXiv preprint arXiv:2102.07030, 2021.
- Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
- Increasing the take-up of long acting reversible contraceptives among adolescents and young women in Cameroon. 2021.
- Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
- UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
- The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
- Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 1999.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Prior-free and prior-dependent regret bounds for thompson sampling. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–9. IEEE, 2014.
- An adaptive targeted field experiment: Job search assistance for refugees in jordan. 2020.
- An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257, 2011.
- Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
- Economic analysis of simulation selection problems. Management Science, 55(3):421–437, 2009.
- Richard Durrett. Stochastic calculus: a practical introduction, volume 6. CRC press, 1996.
- Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, pages 848–881, 1998.
- Diffusion approximations for thompson sampling. arXiv preprint arXiv:2105.09232, 2021.
- Online network revenue management using thompson sampling. Operations research, 66(6):1586–1602, 2018.
- Validity of heavy traffic steady-state approximations in generalized Jackson networks. The Annals of Applied Probability, 16(1):56–90, 2006.
- Peter W Glynn. Diffusion approximations. Handbooks in Operations research and management Science, 2:145–198, 1990.
- Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, 118(15), 2021.
- J Michael Harrison. Brownian models of queueing networks with heterogeneous customer populations. In Stochastic differential systems, stochastic control theory and applications, pages 147–186. Springer, 1988.
- Reflected brownian motion on an orthant. The Annals of Probability, 9(2):302–308, 1981.
- Investment timing with incomplete information and multiple means of learning. Operations Research, 63(2):442–457, 2015.
- An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813–1821, 2017.
- Asymptotics for statistical treatment rules. Econometrica, 77(5):1683–1701, 2009.
- Asymptotic representations for sequential experiments. Cowles Foundation Conference on Econometrics, 2021.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
- Multiple channel queues in heavy traffic. I. Advances in Applied Probability, 2(1):150–177, 1970.
- A closer look at the worst-case behavior of multi-armed bandit algorithms. Advances in Neural Information Processing Systems, 34:8807–8819, 2021.
- Brownian Motion and Stochastic Calculus. Springer, 2005.
- Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021.
- Adaptive targeted infectious disease testing. Oxford Review of Economic Policy, 36(Supplement_1):S77–S93, 2020.
- Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199–213. Springer, 2012.
- FP Kelly and CN Laws. Dynamic routing in open queueing networks: Brownian models, cut constraints and resource pooling. Queueing systems, 13(1):47–86, 1993.
- Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- An information-theoretic approach to minimax regret in partial monitoring. pages 2111–2139, 2019.
- Bandit algorithms. Cambridge University Press, 2020.
- Lucien M Le Cam. Limits of experiments. In Elizabeth L Scott Lucien M Le Cam, Jerzy Neyman, editor, Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, volume 6, pages 245–261. University of California Press Berkeley–Los Angeles, 1972.
- Gaussian imagination in bandit learning. arXiv preprint arXiv:2201.01902, 2022.
- RÂ Duncan Luce. Individual choice behavior: A theoretical analysis. John Wiely & Son, 1959.
- Performance guarantees for policy learning. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 56(3):2162–2188, 2020.
- The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
- On the capacity of information processing systems. Operations Research, 66(2):568–586, 2018.
- Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
- Martin I Reiman. Open queueing networks in heavy traffic. Mathematics of operations research, 9(3):441–458, 1984.
- Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- Daniel Russo. Simple bayesian algorithms for best-arm identification. Operations Research, 68(6):1625–1647, 2020.
- An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
- A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
- David Siegmund. Sequential Analysis: Tests and Confidence Intervals. Springer Science & Business Media, 1985.
- Multidimensional diffusion processes. Springer, 2007.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Abraham Wald. Sequential analysis. John Wiley & Sons, New York, 1947.
- Adaptive design of clinical trials: A sequential learning approach. Available at SSRN, 2020.
- Reinforcement with fading memories. Mathematics of Operations Research, 45(4):1258–1288, 2020.
- Inference for batched bandits. In H Larochelle, M Ranzato, R Hadsell, M F Balcan, and H Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9818–9829. Curran Associates, Inc., 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.