Papers
Topics
Authors
Recent
2000 character limit reached

Weak Signal Asymptotics for Sequentially Randomized Experiments (2101.09855v7)

Published 25 Jan 2021 in math.ST, cs.LG, and stat.TH

Abstract: We use the lens of weak signal asymptotics to study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $n$ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the sample paths of a class of sequentially randomized experiments -- adapted to this scaling regime and with arm selection probabilities that vary continuously with state -- converge weakly to a diffusion limit, given as the solution to a stochastic differential equation. The diffusion limit enables us to derive refined, instance-specific characterization of stochastic dynamics, and to obtain several insights on the regret and belief evolution of a number of sequential experiments including Thompson sampling (but not UCB, which does not satisfy our continuity assumption). We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret performance when the reward gaps are relatively large. Conversely, we find that a version of Thompson sampling with an asymptotically uninformative prior variance achieves near-optimal instance-specific regret scaling, including with large reward gaps, but these good regret properties come at the cost of highly unstable posterior beliefs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Near-optimal regret bounds for Thompson sampling. Journal of the ACM, 64(5):1–24, 2017.
  2. Diffusion approximations for a class of sequential testing problems. arXiv preprint arXiv:2102.07030, 2021.
  3. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
  4. Increasing the take-up of long acting reversible contraceptives among adolescents and young women in Cameroon. 2021.
  5. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
  6. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
  7. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  8. Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 1999.
  9. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  10. Prior-free and prior-dependent regret bounds for thompson sampling. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–9. IEEE, 2014.
  11. An adaptive targeted field experiment: Job search assistance for refugees in jordan. 2020.
  12. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257, 2011.
  13. Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
  14. Economic analysis of simulation selection problems. Management Science, 55(3):421–437, 2009.
  15. Richard Durrett. Stochastic calculus: a practical introduction, volume 6. CRC press, 1996.
  16. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, pages 848–881, 1998.
  17. Diffusion approximations for thompson sampling. arXiv preprint arXiv:2105.09232, 2021.
  18. Online network revenue management using thompson sampling. Operations research, 66(6):1586–1602, 2018.
  19. Validity of heavy traffic steady-state approximations in generalized Jackson networks. The Annals of Applied Probability, 16(1):56–90, 2006.
  20. Peter W Glynn. Diffusion approximations. Handbooks in Operations research and management Science, 2:145–198, 1990.
  21. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, 118(15), 2021.
  22. J Michael Harrison. Brownian models of queueing networks with heterogeneous customer populations. In Stochastic differential systems, stochastic control theory and applications, pages 147–186. Springer, 1988.
  23. Reflected brownian motion on an orthant. The Annals of Probability, 9(2):302–308, 1981.
  24. Investment timing with incomplete information and multiple means of learning. Operations Research, 63(2):442–457, 2015.
  25. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813–1821, 2017.
  26. Asymptotics for statistical treatment rules. Econometrica, 77(5):1683–1701, 2009.
  27. Asymptotic representations for sequential experiments. Cowles Foundation Conference on Econometrics, 2021.
  28. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055–1080, 2021.
  29. Multiple channel queues in heavy traffic. I. Advances in Applied Probability, 2(1):150–177, 1970.
  30. A closer look at the worst-case behavior of multi-armed bandit algorithms. Advances in Neural Information Processing Systems, 34:8807–8819, 2021.
  31. Brownian Motion and Stochastic Calculus. Springer, 2005.
  32. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021.
  33. Adaptive targeted infectious disease testing. Oxford Review of Economic Policy, 36(Supplement_1):S77–S93, 2020.
  34. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199–213. Springer, 2012.
  35. FP Kelly and CN Laws. Dynamic routing in open queueing networks: Brownian models, cut constraints and resource pooling. Queueing systems, 13(1):47–86, 1993.
  36. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018.
  37. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  38. An information-theoretic approach to minimax regret in partial monitoring. pages 2111–2139, 2019.
  39. Bandit algorithms. Cambridge University Press, 2020.
  40. Lucien M Le Cam. Limits of experiments. In Elizabeth L Scott Lucien M Le Cam, Jerzy Neyman, editor, Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, volume 6, pages 245–261. University of California Press Berkeley–Los Angeles, 1972.
  41. Gaussian imagination in bandit learning. arXiv preprint arXiv:2201.01902, 2022.
  42. R Duncan Luce. Individual choice behavior: A theoretical analysis. John Wiely & Son, 1959.
  43. Performance guarantees for policy learning. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 56(3):2162–2188, 2020.
  44. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
  45. On the capacity of information processing systems. Operations Research, 66(2):568–586, 2018.
  46. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
  47. Martin I Reiman. Open queueing networks in heavy traffic. Mathematics of operations research, 9(3):441–458, 1984.
  48. Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  49. Daniel Russo. Simple bayesian algorithms for best-arm identification. Operations Research, 68(6):1625–1647, 2020.
  50. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  51. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
  52. David Siegmund. Sequential Analysis: Tests and Confidence Intervals. Springer Science & Business Media, 1985.
  53. Multidimensional diffusion processes. Springer, 2007.
  54. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  55. Abraham Wald. Sequential analysis. John Wiley & Sons, New York, 1947.
  56. Adaptive design of clinical trials: A sequential learning approach. Available at SSRN, 2020.
  57. Reinforcement with fading memories. Mathematics of Operations Research, 45(4):1258–1288, 2020.
  58. Inference for batched bandits. In H Larochelle, M Ranzato, R Hadsell, M F Balcan, and H Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9818–9829. Curran Associates, Inc., 2020.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.