The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting (2403.10946v2)
Abstract: Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, $T$. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of $T{1/2}$. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.
- A definition of continual reinforcement learning. ArXiv, abs/2307.11046.
- Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
- Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education.
- Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning. ArXiv, abs/2211.12004.
- Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
- Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
- Designing m-health interventions for precision mental health support. Translational psychiatry, 10(1):222.
- Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
- Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187.
- Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
- Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in neural information processing systems, 34:26168–26182.
- Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257.
- Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. The Journal of Machine Learning Research, 23(1):11362–11403.
- On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
- An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In International Conference on Artificial Intelligence and Statistics, pages 3048–3085. PMLR.
- Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res., 75:1401–1476.
- Randomized exploration for non-stationary stochastic linear bandits. In Conference on Uncertainty in Artificial Intelligence, pages 71–80. PMLR.
- Proportional response: Contextual bandits for simple and cumulative regret minimization. arXiv preprint arXiv:2307.02108.
- A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385.
- Bandit algorithms. Cambridge University Press.
- Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22.
- Deep inventory management. arXiv preprint arXiv:2210.03137.
- The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning.
- Generalized objectives in adaptive experiments: The frontier between regret and speed. NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World.
- Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
- Reinforcement learning tutor better supported lower performers in a math task. ArXiv, abs/2304.04933.
- Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
- Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pages 3086–3097. PMLR.
- Designing reinforcement learning algorithms for digital interventions: pre-implementation guidelines. Algorithms, 15(8):255.
- A comprehensive survey of continual learning: Theory, method and application. ArXiv, abs/2302.00487.
- Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 495–504.
- Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
- Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
- Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731.