Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting (2403.10946v2)

Published 16 Mar 2024 in stat.ML and cs.LG

Abstract: Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, $T$. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of $T{1/2}$. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. A definition of continual reinforcement learning. ArXiv, abs/2307.11046.
  2. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
  3. Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education.
  4. Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning. ArXiv, abs/2211.12004.
  5. Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
  6. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  7. Designing m-health interventions for precision mental health support. Translational psychiatry, 10(1):222.
  8. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852.
  9. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
  10. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
  11. Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187.
  12. Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
  13. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in neural information processing systems, 34:26168–26182.
  14. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257.
  15. Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. The Journal of Machine Learning Research, 23(1):11362–11403.
  16. On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
  17. An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In International Conference on Artificial Intelligence and Statistics, pages 3048–3085. PMLR.
  18. Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res., 75:1401–1476.
  19. Randomized exploration for non-stationary stochastic linear bandits. In Conference on Uncertainty in Artificial Intelligence, pages 71–80. PMLR.
  20. Proportional response: Contextual bandits for simple and cumulative regret minimization. arXiv preprint arXiv:2307.02108.
  21. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385.
  22. Bandit algorithms. Cambridge University Press.
  23. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22.
  24. Deep inventory management. arXiv preprint arXiv:2210.03137.
  25. The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning.
  26. Generalized objectives in adaptive experiments: The frontier between regret and speed. NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World.
  27. Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
  28. Reinforcement learning tutor better supported lower performers in a math task. ArXiv, abs/2304.04933.
  29. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
  30. Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pages 3086–3097. PMLR.
  31. Designing reinforcement learning algorithms for digital interventions: pre-implementation guidelines. Algorithms, 15(8):255.
  32. A comprehensive survey of continual learning: Theory, method and application. ArXiv, abs/2302.00487.
  33. Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 495–504.
  34. Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
  35. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
  36. Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731.
Citations (1)

Summary

We haven't generated a summary for this paper yet.