Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent (2402.10228v5)

Published 5 Feb 2024 in cs.LG, cs.AI, and stat.ML

Abstract: We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors associated with an optimal action-value function ($Q\star$) without the need for conjugacy and follows the greedy policies w.r.t. these approximate posterior samples. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite. Implementing HyperAgent requires minimal code addition to well-established deep RL frameworks like DQN. We theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Vo q𝑞qitalic_q l: Towards optimal regret in model-free rl with nonlinear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pp.  987–1063. PMLR, 2023.
  2. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
  3. Anonymous. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nfIAEJFiBZ.
  4. State-aware variational thompson sampling for deep q-networks. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp.  124–132, 2021.
  5. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp.  263–272. PMLR, 2017.
  6. Principled exploration via optimistic bootstrapping and backward induction. In International Conference on Machine Learning, pp.  577–587. PMLR, 2021.
  7. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
  8. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  9. A distributional perspective on reinforcement learning. In International conference on machine learning, pp.  449–458. PMLR, 2017.
  10. Neuro-dynamic programming. Athena Scientific, 1996.
  11. Omer Veysel Cagatan. Barlowrl: Barlow twins for data-efficient reinforcement learning. arXiv preprint arXiv:2308.04263, 2023.
  12. Guarantees for epsilon-greedy reinforcement learning with function approximation. In International conference on machine learning, pp.  4666–4689. PMLR, 2022.
  13. A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040–12051, 2021.
  14. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp.  2826–2836. PMLR, 2021.
  15. Langevin dqn. arXiv preprint arXiv:2002.07282, 2020.
  16. Hypermodels for exploration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryx6WgStPB.
  17. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
  18. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
  19. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  20. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  21. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  22. Randomized exploration in reinforcement learning with general value function approximation. In International Conference on Machine Learning, pp.  4607–4616. PMLR, 2021.
  23. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo, 2023.
  24. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
  25. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp.  1704–1713. PMLR, 2017.
  26. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR, 2020.
  27. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  28. Extensions of lipschitz mappings into a hilbert space. In Conference on Modern Analysis and Probability, volume 26, pp.  189–206. American Mathematical Society, 1984.
  29. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
  30. Sham Machandranath Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
  31. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
  32. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  33. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.  5639–5650. PMLR, 2020.
  34. Yingru Li. Probability tools for sequential random projection, 2023a. To appear on arXiv preprint.
  35. Yingru Li. Revisiting random projection: A new unified analysis via high-dimensional hanson-wright inequality, 2023b. To appear on arXiv preprint.
  36. HyperDQN: A randomized exploration method for deep reinforcement learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=X0nrKAXu7g-.
  37. A note on target q-learning for solving finite mdps with a generative oracle. arXiv preprint arXiv:2203.11489, 2022b.
  38. Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp.  6736–6747. PMLR, 2021.
  39. Maximize to explore: One objective function fusing estimation, planning, and exploration. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=A57UMlUJdc.
  40. Information-theoretic confidence bounds for reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/411ae1bf081d1674ca6091f8c59a266f-Paper.pdf.
  41. Reinforcement learning, bit by bit. Foundations and Trends® in Machine Learning, 16(6):733–865, 2023.
  42. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  43. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp.  16828–16847. PMLR, 2022.
  44. Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.
  45. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pp.  2701–2710. PMLR, 2017.
  46. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp.  3003–3011, 2013.
  47. Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016.
  48. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
  49. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019a.
  50. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019b. URL http://jmlr.org/papers/v20/18-339.html.
  51. Epistemic neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=dZqcC1qCmB.
  52. Approximate thompson sampling via epistemic neural networks. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI ’23. JMLR.org, 2023b.
  53. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
  54. An analysis of ensemble sampling. arXiv preprint arXiv:2203.01303, 2022.
  55. DQN Zoo: Reference implementations of DQN-based agents, 2020. URL http://github.com/deepmind/dqn_zoo.
  56. Learning to optimize via information-directed sampling. Operations Research, 66(1):230–252, 2018.
  57. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  58. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  59. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020.
  60. Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pp.  30365–30380. PMLR, 2023.
  61. Alexander L Strehl. Probably approximately correct (PAC) exploration in reinforcement learning. PhD thesis, Rutgers University-Graduate School-New Brunswick, 2007.
  62. Malcolm Strens. A bayesian framework for reinforcement learning. In International Conference on Machine Learning, pp.  943–950, 2000.
  63. Reinforcement learning: An introduction. MIT press, 2018.
  64. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  65. Sebastian Thrun. Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, PA, January 1992.
  66. From dirichlet to rubin: Optimistic exploration in rl without bonuses. In International Conference on Machine Learning, pp.  21380–21431. PMLR, 2022.
  67. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  68. When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32, 2019.
  69. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135, 2020.
  70. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp.  1995–2003. PMLR, 2016.
  71. Q-learning. Machine learning, 8:279–292, 1992.
  72. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688, 2011.
  73. Zheng Wen. Efficient reinforcement learning with value function generalization. Stanford University, 2014.
  74. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pp.  24830–24850. PMLR, 2022.
  75. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
  76. Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
  77. Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond. arXiv preprint arXiv:2211.01962, 2022.
Citations (5)

Summary

We haven't generated a summary for this paper yet.