Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayesian Exploration Networks (2308.13049v4)

Published 24 Aug 2023 in cs.LG

Abstract: Bayesian reinforcement learning (RL) offers a principled and elegant approach for sequential decision making under uncertainty. Most notably, Bayesian agents do not face an exploration/exploitation dilemma, a major pathology of frequentist methods. However theoretical understanding of model-free approaches is lacking. In this paper, we introduce a novel Bayesian model-free formulation and the first analysis showing that model-free approaches can yield Bayes-optimal policies. We show all existing model-free approaches make approximations that yield policies that can be arbitrarily Bayes-suboptimal. As a first step towards model-free Bayes optimality, we introduce the Bayesian exploration network (BEN) which uses normalising flows to model both the aleatoric uncertainty (via density estimation) and epistemic uncertainty (via variational inference) in the BeLLMan operator. In the limit of complete optimisation, BEN learns true Bayes-optimal policies, but like in variational expectation-maximisation, partial optimisation renders our approach tractable. Empirical results demonstrate that BEN can learn true Bayes-optimal policies in tasks where existing model-free approaches fail.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Learning is planning: near bayes-optimal reinforcement learning via monte-carlo tree search. pages 19–26, 01 2011.
  2. A distributional perspective on reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449–458. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/bellemare17a.html.
  3. Vivek Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. 01 2008. ISBN 978-81-85931-85-2. doi: 10.1007/978-93-86279-38-5.
  4. Minimax-bayes reinforcement learning. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 7511–7527. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/buening23a.html.
  5. Better exploration with optimistic actor-critic. arXiv preprint arXiv:1910.12807, 2019.
  6. Density estimation using real nvp. ICLR, 2017. URL http://arxiv.org/abs/1605.08803.
  7. RL^2: Fast reinforcement learning via slow reinforcement learning, 2017. URL https://openreview.net/forum?id=HkLXCE9lx.
  8. Michael O’Gordon Duff. Optimal Learning: Computational Procedures for Bayes-Adaptive Markov Decision Processes. PhD thesis, 2002. AAI3039353.
  9. Bayesian bellman operators. In Advances in Neural Information Processing Systems 34. Curran Associates, Inc., 2021.
  10. Why target networks stabilise temporal difference methods. In ICML, 2023.
  11. Noisy networks for exploration. In Proceedings of the International Conference on Representation Learning (ICLR 2018), Vancouver (Canada), 2018.
  12. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1050–1059. JMLR.org, 2016.
  13. Modelling transition dynamics in mdps with rkhs embeddings. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1603–1610, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
  14. Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search. Journal of Artificial Intelligence Research, 48:841–883, 10 2013. doi: 10.1613/jair.4117.
  15. Deep recurrent q-learning for partially observable mdps. AAAI, abs/1507.06527, 2015. URL https://aaai.org/papers/11673-deep-recurrent-q-learning-for-partially-observable-mdps/.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium.pdf.
  17. Neural autoregressive flows. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2078–2087. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/huang18d.html.
  18. Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning. Curran Associates Inc., Red Hook, NY, USA, 2019.
  19. Is q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf.
  20. Planning and acting in partially observable stochastic domains. Artif. Intell., 101(1–2):99–134, may 1998. ISSN 0004-3702.
  21. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2):209–232, 2002. doi: 10.1023/A:1017984413808. URL https://doi.org/10.1023/A:1017984413808.
  22. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
  23. Improved variational inference with inverse autoregressive flow. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/ddeebdeefdb7e7e7a697e1c3e3d8ef54-Paper.pdf.
  24. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964–3979, 2021. doi: 10.1109/TPAMI.2020.2992934.
  25. J. Kolter. The fixed points of off-policy td. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/fe2d010308a6b3799a3d9c728ee74244-Paper.pdf.
  26. T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  27. Bbq-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. 11 2018.
  28. Model-based uncertainty in value functions. AISTATS 2023, abs/2302.12526, 2023.
  29. J. J. Martin. Bayesian decision problems and Markov chains [by] J. J. Martin. Wiley New York, 1967.
  30. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236.
  31. Survae flows: Surjections to bridge the gap between vaes and flows. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/9578a63fbe545bd82cc5bbe749636af1-Abstract.html.
  32. (more) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, page 3003–3011, Red Hook, NY, USA, 2013. Curran Associates Inc.
  33. Randomized prior functions for deep reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8617–8629. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8080-randomized-prior-functions-for-deep-reinforcement-learning.pdf.
  34. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019. URL http://jmlr.org/papers/v20/18-339.html.
  35. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1530–1538. JMLR, 2015.
  36. Investigating action encodings in recurrent neural networks in reinforcement learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=K6g4MbAC1r.
  37. Malcolm Strens. A bayesian framework for reinforcement learning. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages 943–950. ICML, 2000.
  38. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
  39. William R Thomson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 12 1933. ISSN 0006-3444. doi: 10.1093/biomet/25.3-4.285. URL https://doi.org/10.1093/biomet/25.3-4.285.
  40. Randomized value functions via multiplicative normalizing flows. In Amir Globerson and Ricardo Silva, editors, UAI, page 156. AUAI Press, 2019.
  41. A. W. van der Vaart. Bayes Procedures, page 138–152. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.011.
  42. Larry Wasserman. All of Nonparametric Statistics (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387251456.
  43. Improving generalization in meta-learning via task augmentation. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11887–11897. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yao21b.html.
  44. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/pdf?id=Hkl9JlBYvr.
  45. Exploration in approximate hyper-state space for meta reinforcement learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12991–13001. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zintgraf21a.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mattie Fellows (7 papers)
  2. Brandon Kaplowitz (2 papers)
  3. Christian Schroeder de Witt (49 papers)
  4. Shimon Whiteson (122 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com