Data-driven optimal stopping: A pure exploration analysis (2312.05880v1)
Abstract: The standard theory of optimal stopping is based on the idealised assumption that the underlying process is essentially known. In this paper, we drop this restriction and study data-driven optimal stopping for a general diffusion process, focusing on investigating the statistical performance of the proposed estimator of the optimal stopping barrier. More specifically, we derive non-asymptotic upper bounds on the simple regret, along with uniform and non-asymptotic PAC bounds. Minimax optimality is verified by completing the upper bound results with matching lower bounds on the simple regret. All results are shown both under general conditions on the payoff functions and under more refined assumptions that mimic the margin condition used in binary classification, leading to an improved rate of convergence. Additionally, we investigate how our results on the simple regret transfer to the cumulative regret for a specific exploration-exploitation strategy, both with respect to lower bounds and upper bounds.
- “Concentration of scalar ergodic diffusions and some statistical implications” In Ann. Inst. Henri Poincaré Probab. Stat. 57.4, 2021, pp. 1857–1887 DOI: 10.1214/20-aihp1144
- M. T. Barlow and M. Yor “Semimartingale inequalities via the Garsia–Rodemich–Rumsey lemma, and applications to local times” In J. Functional Analysis 49.2, 1982, pp. 198–229 DOI: 10.1016/0022-1236(82)90080-5
- Sébastien Bubeck, Rémi Munos and Gilles Stoltz “Pure exploration in finitely-armed and continuous-armed bandits” In Theoret. Comput. Sci. 412.19, 2011, pp. 1832–1852 DOI: 10.1016/j.tcs.2010.12.059
- Sébastien Bubeck, Rémi Munos and Gilles Stoltz “Pure exploration in multi-armed bandits problems” In Algorithmic learning theory 5809, Lecture Notes in Comput. Sci. Springer, Berlin, 2009, pp. 23–37 DOI: 10.1007/978-3-642-04414-4“˙7
- “Nonparametric learning for impulse control problems—exploration vs. exploitation” In Ann. Appl. Probab. 33.2, 2023, pp. 1369–1387 DOI: 10.1214/22-aap1849
- Sören Christensen, Claudia Strauch and Lukas Trottner “Learning to reflect: A unifying approach for data-driven stochastic control strategies” In Bernoulli (to appear), 2024+ arXiv:2104.11496
- Sören Christensen, Asbjørn Holk Thomsen and Lukas Trottner “Data-driven rules for multidimensional reflection problems”, 2023 arXiv:2311.06639 [math.OC]
- Barry H. Dayton, Tien-Yien Li and Zhonggang Zeng “Multiple zeros of nonlinear systems” In Math. Comp. 80.276, 2011, pp. 2143–2168 DOI: 10.1090/S0025-5718-2011-02462-2
- Thomas S. Ferguson “Optimal Stopping and Applications” Dept. of Mathematics, UCLA.[Online]. Available: http://www.math.ucla.edu/~tom/Stopping/Contents.html, 2006
- “Optimal Best Arm Identification with Fixed Confidence” In 29th Annual Conference on Learning Theory 49, Proceedings of Machine Learning Research PMLR, 2016, pp. 998–1027 URL: https://proceedings.mlr.press/v49/garivier16a.html
- Sara A. Geer “Applications of empirical process theory” 6, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2000, pp. xii+286
- “Optimal Best-arm Identification in Linear Bandits” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 10007–10017 URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/7212a6567c8a6c513f33b858d868ff80-Paper.pdf
- Yanwei Jia and Xun Yu Zhou “Policy gradient and actor-critic learning in continuous time and space: theory and algorithms” In J. Mach. Learn. Res. 23, 2022, pp. Paper No. [275], 50
- Yu. A. Kutoyants and I. Negri “On L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT efficiency of empiric distribution for ergodic diffusion processes” In Teor. Veroyatnost. i Primenen. 46.1, 2001, pp. 164–169 DOI: 10.1137/S0040585X97978816
- “Bandit Algorithms” Cambridge University Press, 2020 DOI: 10.1017/9781108571401
- “Adaptivity to Smoothness in X-armed bandits” In Proceedings of the 31st Conference On Learning Theory 75, Proceedings of Machine Learning Research PMLR, 2018, pp. 1463–1492 URL: https://proceedings.mlr.press/v75/locatelli18a.html
- Shie Mannor and John N. Tsitsiklis “The sample complexity of exploration in the multi-armed bandit problem” In J. Mach. Learn. Res. 5, 2003/04, pp. 623–648 DOI: 10.1007/978-3-540-45167-9“˙31
- “Applied stochastic control of jump diffusions”, Universitext 3: Springer, Cham, 2019, pp. xvi+436 DOI: 10.1007/978-3-030-02781-0
- “Optimal stopping and free-boundary problems”, Lectures in Mathematics ETH Zürich Birkhäuser Verlag, Basel, 2006, pp. xxii+500
- Albert N. Shiryaev “Optimal stopping rules” Translated from the 1976 Russian second edition by A. B. Aries, Reprint of the 1978 translation 8, Stochastic Modelling and Applied Probability Springer-Verlag, Berlin, 2008, pp. xii+217 URL: https://mathscinet.ams.org/mathscinet-getitem?mr=2374974
- Richard S. Sutton and Andrew G. Barto “Reinforcement learning: an introduction”, Adaptive Computation and Machine Learning MIT Press, Cambridge, MA, 2018, pp. xxii+526
- Lukasz Szpruch, Tanut Treetanthiploet and Yufei Zhang “Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models”, 2021 arXiv:2112.10264 [cs.LG]
- Alexandre B. Tsybakov “Introduction to nonparametric estimation” Revised and extended from the 2004 French original, Translated by Vladimir Zaiats, Springer Series in Statistics Springer, New York, 2009, pp. xii+214 DOI: 10.1007/b13794
- Haoran Wang, Thaleia Zariphopoulou and Xun Yu Zhou “Reinforcement learning in continuous time and space: a stochastic control approach” In J. Mach. Learn. Res. 21, 2020, pp. Paper No. 198, 34
- Cagatay Yildiz, Markus Heinonen and Harri Lähdesmäki “Continuous-time model-based reinforcement learning” In International Conference on Machine Learning, 2021, pp. 12009–12018 PMLR