Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Analysis of Switchback Designs in Reinforcement Learning (2403.17285v2)

Published 26 Mar 2024 in stat.ML and cs.LG

Abstract: This paper offers a detailed investigation of switchback designs in A/B testing, which alternate between baseline and new policies over time. Our aim is to thoroughly evaluate the effects of these designs on the accuracy of their resulting average treatment effect (ATE) estimators. We propose a novel "weak signal analysis" framework, which substantially simplifies the calculations of the mean squared errors (MSEs) of these ATEs in Markov decision process environments. Our findings suggest that (i) when the majority of reward errors are positively correlated, the switchback design is more efficient than the alternating-day design which switches policies in a daily basis. Additionally, increasing the frequency of policy switches tends to reduce the MSE of the ATE estimator. (ii) When the errors are uncorrelated, however, all these designs become asymptotically equivalent. (iii) In cases where the majority of errors are negative correlated, the alternating-day design becomes the optimal choice. These insights are crucial, offering guidelines for practitioners on designing experiments in A/B testing. Our analysis accommodates a variety of policy value estimators, including model-based estimators, least squares temporal difference learning estimators, and double reinforcement learning estimators, thereby offering a comprehensive understanding of optimal design strategies for policy evaluation in reinforcement learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (107)
  1. Semi-parametric estimation of treatment effects in randomised experiments. Journal of the Royal Statistical Society Series B: Statistical Methodology 85(5), 1615–1638.
  2. Athey, S. and G. W. Imbens (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic perspectives 31(2), 3–32.
  3. Optimum design and sequential treatment allocation in an experiment in deep brain stimulation with sets of treatment combinations. Statistics in Medicine 36(30), 4804–4815.
  4. Randomised response-adaptive designs in clinical trials. Monographs Stat. Appl. Probability 130, 130.
  5. Multiple randomization designs. arXiv preprint arXiv:2112.13495.
  6. The covariate-adaptive biased coin design for balancing clinical trials in the presence of prognostic factors. Biometrika 98(3), 519–535.
  7. A treatment allocation procedure for sequential clinical trials. Biometrics, 81–90.
  8. Off-policy evaluation in doubly inhomogeneous environments. arXiv preprint arXiv:2306.08719.
  9. More efficient off-policy evaluation through regularized targeted learning. In International Conference on Machine Learning, pp. 654–663. PMLR.
  10. Efficient and adaptive estimation for semiparametric models, Volume 4. Springer.
  11. Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association 114(528), 1665–1682.
  12. Design and analysis of switchback experiments. Management Science 69(7), 3759–3777.
  13. Bradtke, S. J. and A. G. Barto (1996). Linear least-squares algorithms for temporal difference learning. Machine learning 22, 33–57.
  14. Testing for the markov property in time series. Econometric Theory 28(1), 130–178.
  15. Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association 111(516), 1509–1521.
  16. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042–1051. PMLR.
  17. Chen, X. and T. M. Christensen (2015). Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics 188(2), 447–465.
  18. On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation. In International Conference on Machine Learning, pp. 3558–3582. PMLR.
  19. Double/debiased machine learning for treatment and structural parameters.  21(1), C1–C68.
  20. Automatic debiased machine learning for dynamic treatment effects and general nested functionals. arXiv preprint arXiv:2203.13887.
  21. Ertefaie, A. and R. L. Strawderman (2018). Constructing dynamic treatment regimes over indefinite time horizons. Biometrika 105(4), 963–977.
  22. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pp.  486–489. PMLR.
  23. Markovian interference in experiments. Advances in Neural Information Processing Systems 35, 535–549.
  24. Combining parametric and nonparametric models for off-policy evaluation. In International Conference on Machine Learning, pp. 2366–2375. PMLR.
  25. Grenander, U. (1981). Abstract inference. Wiley Series, New York.
  26. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pp. 2605–2613. PMLR.
  27. Data-efficient policy evaluation through behavior policy search. In International Conference on Machine Learning, pp. 1394–1403. PMLR.
  28. Bootstrapping fitted q-evaluation for off-policy inference. In International Conference on Machine Learning, pp. 4074–4084. PMLR.
  29. Efficient randomized-adaptive designs. The Annals of Statistics, 2543–2560.
  30. A unified family of covariate-adjusted response-adaptive designs based on efficiency and ethics. Journal of the American Statistical Association 110(509), 357–367.
  31. Switchback experiments under geometric mixing. arXiv preprint arXiv:2209.00197.
  32. Off-policy evaluation in partially observed markov decision processes under sequential ignorability. The Annals of Statistics 51(4), 1561–1585.
  33. Huang, J. Z. (1998). Projection estimation in multiple regression with application to functional anova models. The annals of statistics 26(1), 242–272.
  34. Hudgens, M. G. and M. E. Halloran (2008). Toward causal inference with interference. Journal of the American Statistical Association 103(482), 832–842.
  35. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. PMLR.
  36. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR.
  37. Peeking at a/b tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  1517–1525.
  38. Experimental design in two-sided platforms: An analysis of bias. Management Science 68(10), 7065–7791.
  39. D-optimal design of split-split-plot experiments. Biometrika 96(1), 67–82.
  40. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research 70(6), 3282–3302.
  41. Approximate and exact designs for total effects. The Annals of Statistics 49(3), 1594–1625.
  42. Krishnamurthy, V. (2016). Partially observed Markov decision processes. Cambridge university press.
  43. Statistical challenges in online controlled experiments: A review of a/b testing methodology. The American Statistician (just-accepted), 1–32.
  44. Leung, M. P. (2022). Rate-optimal cluster-randomized designs for spatial interference. The Annals of Statistics 50(5), 3064–3087.
  45. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting. Advances in Neural Information Processing Systems 34, 16671–16685.
  46. Sharp high-probability sample complexities for policy evaluation with linear function approximation. arXiv preprint arXiv:2305.19001.
  47. Evaluating dynamic conditional quantile treatment effects with applications in ridesharing. Journal of the American Statistical Association (just-accepted), 1–26.
  48. Optimal treatment allocation for efficient policy evaluation in sequential decision making. Advances in Neural Information Processing Systems 36.
  49. Randomization inference for peer effects. Journal of the American Statistical Association 114(528), 1651–1664.
  50. Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association 116(533), 382–391.
  51. Batch policy learning in average reward markov decision processes. Annals of statistics 50(6), 3364.
  52. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366.
  53. Online estimation and inference for robust policy evaluation in reinforcement learning. arXiv preprint arXiv:2310.02581.
  54. Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association accepted.
  55. Policy evaluation for temporal and/or spatial dependent experiments. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1–27.
  56. Revar: Strengthening policy evaluation via reduced variance sampling. In Uncertainty in Artificial Intelligence, pp.  1413–1422. PMLR.
  57. Undersmoothing and bias corrected functional estimation.
  58. Design of panel experiments with spatial and temporal interference. Available at SSRN 4466598.
  59. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  60. Reinforcement learning for ridesharing: An extended survey. Transportation Research Part C: Emerging Technologies 144, 103852.
  61. A/b testing: a systematic literature review. Journal of Systems and Software, 112011.
  62. A review of spatial causal inference methods for environmental and epidemiological applications. International Statistical Review 89(3), 605–634.
  63. Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling 7(9-12), 1393–1512.
  64. Optimal, two-stage, adaptive enrichment designs for randomized trials, using sparse linear programming. Journal of the Royal Statistical Society Series B: Statistical Methodology 82(3), 749–772.
  65. Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100(469), 322–331.
  66. Shen, X. (1997). On methods of sieves and penalization. The Annals of Statistics 25(6), 2555–2591.
  67. Value enhancement of reinforcement learning via efficient and robust trust region optimization. Journal of the American Statistical Association accepted.
  68. Deeply-debiased off-policy interval estimation. In International conference on machine learning, pp. 9580–9591. PMLR.
  69. Does the markov decision process fit the data: Testing for the markov property in sequential decision making. In International Conference on Machine Learning, pp. 8807–8817. PMLR.
  70. Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. Journal of the American Statistical Association 118(543), 2059–2071.
  71. Statistical inference of the value function for reinforcement learning in infinite-horizon settings. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(3), 765–793.
  72. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The annals of statistics, 1040–1053.
  73. Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press.
  74. A deep value-network based approach for multi-driver order dispatching. In The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’19) 25, 1780–1790.
  75. A reinforcement learning approach to estimating long-term treatment effects. arXiv preprint arXiv:2210.07536.
  76. Statistical testing under distributional shifts. Journal of the Royal Statistical Society Series B: Statistical Methodology 85(3), 597–663.
  77. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148.
  78. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  79. Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12, 389–434.
  80. Tsiatis, A. (2007). Semiparametric Theory and Missing Data. Springer Science & Business Media.
  81. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668. PMLR.
  82. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981.
  83. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355.
  84. Graph cluster randomization: Network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  329–337.
  85. Van, D. and J. A. Wellner (1996). Weak convergence and empirical processes. Springer,.
  86. Synthetic learner: model-free inference on treatments over time. Journal of Econometrics 234(2), 691–713.
  87. Wahba, G. (1975). Smoothing noisy data with spline functions. Numerische mathematik 24(5), 383–393.
  88. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning, pp. 22491–22511. PMLR.
  89. Experimentation platforms meet reinforcement learning: Bayesian sequential decision-making for continuous monitoring. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  5016–5027.
  90. Projected state-action balancing weights for offline reinforcement learning. The Annals of Statistics 51(4), 1639–1665.
  91. Quantile-optimal treatment regimes. Journal of the American Statistical Association 113(523), 1243–1254.
  92. Off-policy evaluation for tabular reinforcement learning with synthetic trajectories. Statistics and Computing 34(1), 41.
  93. Anytime-valid off-policy inference for contextual bandits. ACM/JMS Journal of Data Science.
  94. Optimum treatment allocation rules under a variance heterogeneity model. Statistics in Medicine 27(22), 4581–4595.
  95. Wu, C.-F. J. et al. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. the Annals of Statistics 14(4), 1261–1295.
  96. Semiparametrically efficient off-policy evaluation in linear markov decision processes. In International Conference on Machine Learning. PMLR.
  97. Data-driven switchback designs: Theoretical tradeoffs and empirical calibration. Available at SSRN.
  98. Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 905–913. Association for Computing Machinery.
  99. A framework for multi-a (rmed)/b (andit) testing with online fdr control. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  5959–5968.
  100. Yin, M. and Y.-X. Wang (2020). Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  3948–3958. PMLR.
  101. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR) 55(1), 1–36.
  102. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100(3), 681–694.
  103. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107(499), 1106–1118.
  104. Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning. Advances in Neural Information Processing Systems 35, 37376–37388.
  105. Graph-based equilibrium metrics for dynamic supply–demand systems with applications to ride-sourcing platforms. Journal of the American Statistical Association 116(536), 1688–1699.
  106. Optimizing pessimism in dynamic treatment regimes: A bayesian learning approach. In International Conference on Artificial Intelligence and Statistics, pp.  6704–6721. PMLR.
  107. Testing for the markov property in time series via deep conditional generative learning. Journal of the Royal Statistical Society Series B: Statistical Methodology 85(4), 1204–1222.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com