Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization (2312.04386v3)

Published 7 Dec 2023 in cs.LG and cs.AI

Abstract: We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over Markov decision processes (MDPs). Previous work upper bounds the posterior variance over values by solving a so-called uncertainty BeLLMan equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. R. S. Sutton, “Dyna, an Integrated Architecture for Learning, Planning, and Reacting,” ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, Jul. 1991.
  2. A. L. Strehl and M. L. Littman, “An Analysis of Model-Based Interval Estimation for Markov Decision Processes,” Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309–1331, 2008, publisher: Elsevier.
  3. T. Jaksch, R. Ortner, and P. Auer, “Near-optimal Regret Bounds for Reinforcement Learning.” Journal of Machine Learning Research, vol. 11, no. 4, 2010.
  4. M. Janner, J. Fu, M. Zhang, and S. Levine, “When to Trust Your Model: Model-Based Policy Optimization,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019.
  5. S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft, “Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning,” in International Conference on Machine Learning.   PMLR, Jul. 2018, pp. 1184–1193, iSSN: 2640-3498.
  6. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models,” in Advances in Neural Information Processing Systems, vol. 31, 2018, arXiv: 1805.12114.
  7. S. Curi, F. Berkenkamp, and A. Krause, “Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 14 156–14 170.
  8. M. P. Deisenroth and C. E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to Policy Search,” in International Conference on Machine Learning, 2011, pp. 465–472.
  9. Y. Fan and Y. Ming, “Model-based Reinforcement Learning for Continuous Control with Posterior Sampling,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139.   PMLR, Jul. 2021, pp. 3078–3087.
  10. Q. Zhou, H. Li, and J. Wang, “Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization,” in AAAI Conference on Artificial Intelligence, vol. 34, Apr. 2020, pp. 6941–6948.
  11. T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma, “MOPO: Model-based Offline Policy Optimization,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 14 129–14 142.
  12. M. G. Bellemare, W. Dabney, and R. Munos, “A Distributional Perspective on Reinforcement Learning,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, Aug. 2017, pp. 449–458.
  13. H. Eriksson, D. Basu, M. Alibeigi, and C. Dimitrakakis, “SENTINEL: taming uncertainty with ensemble based distributional reinforcement learning,” in Conference on Uncertainty in Artificial Intelligence, ser. Proceedings of Machine Learning Research, J. Cussens and K. Zhang, Eds., vol. 180.   PMLR, Aug. 2022, pp. 631–640.
  14. T. Moskovitz, J. Parker-Holder, A. Pacchiano, M. Arbel, and M. Jordan, “Tactical Optimism and Pessimism for Deep Reinforcement Learning,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 12 849–12 863.
  15. D. Schneegass, A. Hans, and S. Udluft, “Uncertainty in Reinforcement Learning-Awareness, Quantisation, and Control,” Robot Learning, Sciyo, pp. 65–90, 2010.
  16. B. O’Donoghue, I. Osband, R. Munos, and V. Mnih, “The Uncertainty Bellman Equation and Exploration,” in International Conference on Machine Learning, 2018, pp. 3836–3845.
  17. S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, N. Heess, and Y. Tassa, “dm_control: Software and tasks for continuous control,” Software Impacts, vol. 6, p. 100022, 2020.
  18. J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for Deep Data-Driven Reinforcement Learning,” 2020, _eprint: 2004.07219.
  19. C. E. Luis, A. G. Bottero, J. Vinogradska, F. Berkenkamp, and J. Peters, “Model-Based Uncertainty in Value Functions,” in International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, F. Ruiz, J. Dy, and J.-W. van de Meent, Eds., vol. 206.   PMLR, Apr. 2023, pp. 8029–8052.
  20. R. Dearden, N. Friedman, and S. Russell, “Bayesian Q-Learning,” in AAAI Conference on Artificial Intelligence, ser. AAAI ’98/IAAI ’98.   USA: American Association for Artificial Intelligence, 1998, pp. 761–768, event-place: Madison, Wisconsin, USA.
  21. Y. Engel, S. Mannor, and R. Meir, “Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning,” in International Conference on Machine Learning.   AAAI Press, 2003, pp. 154–161.
  22. I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep Exploration via Bootstrapped DQN,” in Advances in Neural Information Processing Systems, vol. 29.   Curran Associates, Inc., 2016.
  23. E. Jorge, H. Eriksson, C. Dimitrakakis, D. Basu, and D. Grover, “Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning,” in Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, ser. Proceedings of Machine Learning Research, vol. 137.   PMLR, Dec. 2020, pp. 43–52.
  24. A. M. Metelli, A. Likmeta, and M. Restelli, “Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019.
  25. M. Fellows, K. Hartikainen, and S. Whiteson, “Bayesian Bellman Operators,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 13 641–13 656.
  26. B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles,” in Advances in Neural Information Processing Systems, vol. 30.   Curran Associates, Inc., 2017.
  27. J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  28. B. Zhou, H. Zeng, F. Wang, Y. Li, and H. Tian, “Efficient and Robust Reinforcement Learning with Uncertainty-based Value Expansion,” arXiv:1912.05328 [cs], Dec. 2019, arXiv: 1912.05328.
  29. R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “MOReL: Model-Based Offline Reinforcement Learning,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 21 810–21 823.
  30. P. Auer and R. Ortner, “Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning,” in Advances in Neural Information Processing Systems, vol. 19.   MIT Press, 2006.
  31. I. Osband and B. Van Roy, “Why is Posterior Sampling Better than Optimism for Reinforcement Learning?” in International Conference on Machine Learning.   PMLR, 2017, pp. 2701–2710.
  32. K. Ciosek, Q. Vuong, R. Loftin, and K. Hofmann, “Better Exploration with Optimistic Actor Critic,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019.
  33. R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman, “UCB Exploration via Q-Ensembles,” arXiv:1706.01502 [cs, stat], Nov. 2017, arXiv: 1706.01502.
  34. S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” arXiv:2005.01643 [cs, stat], Nov. 2020, arXiv: 2005.01643.
  35. A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q-Learning for Offline Reinforcement Learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 1179–1191.
  36. C. Bai, L. Wang, Z. Yang, Z.-H. Deng, A. Garg, P. Liu, and Z. Wang, “Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning,” in International Conference on Learning Representations, Mar. 2022.
  37. T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn, “COMBO: Conservative Offline Model-Based Policy Optimization,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 28 954–28 967.
  38. M. Rigter, B. Lacerda, and N. Hawes, “RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning,” in Advances in Neural Information Processing Systems, Oct. 2022.
  39. J. Jeong, X. Wang, M. Gimelfarb, H. Kim, B. Abdulhai, and S. Sanner, “Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization,” in International Conference on Learning Representations, Feb. 2023.
  40. M. J. Sobel, “The Variance of Discounted Markov Decision Processes,” Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982, publisher: Cambridge University Press.
  41. A. Tamar, D. Di Castro, and S. Mannor, “Temporal Difference Methods for the Variance of the Reward To Go,” in International Conference on Machine Learning.   PMLR, 2013, pp. 495–503.
  42. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” in NIPS Deep Learning Workshop, Dec. 2013, arXiv: 1312.5602.
  43. E. Markou and C. E. Rasmussen, “Bayesian Methods for Efficient Reinforcement Learning in Tabular Problems,” in NeurIPS Workshop on Biological and Artificial RL, 2019.
  44. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347 [cs], Aug. 2017, arXiv: 1707.06347.
  45. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” in International Conference on Machine Learning, vol. 80.   PMLR, Jul. 2018, pp. 1861–1870.
  46. L. Froehlich, M. Lefarov, M. Zeilinger, and F. Berkenkamp, “On-Policy Model Errors in Reinforcement Learning,” in International Conference on Learning Representations, 2022.
  47. B. O’Donoghue, “Variational Bayesian Reinforcement Learning with Regret Bounds,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 28 208–28 221.
  48. R. Dearden, N. Friedman, and D. Andre, “Model Based Bayesian Exploration,” in Conference on Uncertainty in Artificial Intelligence, ser. UAI’99.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 150–159, event-place: Stockholm, Sweden.
  49. B. O’Donoghue, I. Osband, and C. Ionescu, “Making Sense of Reinforcement Learning and Probabilistic Inference,” in International Conference on Learning Representations, Sep. 2019.
  50. I. Osband, J. Aslanides, and A. Cassirer, “Randomized Prior Functions for Deep Reinforcement Learning,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  51. D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering Diverse Domains through World Models,” Jan. 2023, arXiv:2301.04104 [cs, stat].
  52. I. Osband, D. Russo, and B. Van Roy, “(More) Efficient Reinforcement Learning via Posterior Sampling,” in Advances in Neural Information Processing Systems, vol. 26.   Curran Associates, Inc., 2013.
  53. D. Tiapkin, D. Belomestny, E. Moulines, A. Naumov, S. Samsonov, Y. Tang, M. Valko, and P. Menard, “From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses,” in Proceedings of the 39th International Conference on Machine Learning.   PMLR, Jun. 2022, pp. 21 380–21 431, iSSN: 2640-3498.
  54. I. Osband, B. Van Roy, D. J. Russo, and Z. Wen, “Deep Exploration via Randomized Value Functions,” Journal of Machine Learning Research, vol. 20, pp. 1–62, 2019.
  55. O. D. Domingues, Y. Flet-Berliac, E. Leurent, P. Ménard, X. Shang, and M. Valko, “rlberry - A Reinforcement Learning Library for Research and Education,” Oct. 2021.
  56. R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  57. E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in IEEE/RSJ International Conference on Intelligent Robots and systems.   IEEE, 2012, pp. 5026–5033.
  58. G. An, S. Moon, J.-H. Kim, and H. O. Song, “Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 7436–7447.
  59. S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   PMLR, Jul. 2018, pp. 1587–1596.
  60. L. Pineda, B. Amos, A. Zhang, N. O. Lambert, and R. Calandra, “MBRL-Lib: A Modular Library for Model-based Reinforcement Learning,” arXiv:2104.10159 [cs, eess], Apr. 2021, arXiv:2104.10159 [cs, eess] type: article.
Citations (1)

Summary

We haven't generated a summary for this paper yet.