Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration (2404.00651v1)

Published 31 Mar 2024 in cs.LG and cs.AI

Abstract: Recent advancements in deep reinforcement learning (RL) have demonstrated notable progress in sample efficiency, spanning both model-based and model-free paradigms. Despite the identification and mitigation of specific bottlenecks in prior works, the agent's exploration ability remains under-emphasized in the realm of sample-efficient RL. This paper investigates how to achieve sample-efficient exploration in continuous control tasks. We introduce an RL algorithm that incorporates a predictive model and off-policy learning elements, where an online planner enhanced by a novelty-aware terminal value function is employed for sample collection. Leveraging the forward predictive error within a latent state space, we derive an intrinsic reward without incurring parameters overhead. This reward establishes a solid connection to model uncertainty, allowing the agent to effectively overcome the asymptotic performance gap. Through extensive experiments, our method shows competitive or even superior performance compared to prior works, especially the sparse reward cases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Deep reinforcement learning at the edge of the statistical precipice. In NeurIPS, pp.  29304–29320, 2021.
  2. Hindsight experience replay. In NIPS, pp.  5048–5058, 2017.
  3. Layer normalization. CoRR, abs/1607.06450, 2016.
  4. Efficient online reinforcement learning with offline data. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  1577–1594. PMLR, 2023.
  5. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In NeurIPS, pp.  8234–8244, 2018.
  6. Large-scale study of curiosity-driven learning. In ICLR. OpenReview.net, 2019a.
  7. Exploration by random network distillation. In ICLR. OpenReview.net, 2019b.
  8. Imagined value gradients: Model-based policy optimization with tranferable latent dynamics models. In ICML, volume 100 of Proceedings of Machine Learning Research, pp.  566–589. PMLR, 2020.
  9. Evaluating model-based planning and planner amortization for continuous control. In ICLR. OpenReview.net, 2022.
  10. Randomized ensembled double q-learning: Learning fast without a model. In ICLR. OpenReview.net, 2021.
  11. IQL-TD-MPC: implicit q-learning for hierarchical model predictive control. CoRR, abs/2306.00867, 2023.
  12. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
  13. Facing off world model backbones: RNNs, transformers, and s4. In NeurIPS, 2023.
  14. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In ICLR. OpenReview.net, 2023.
  15. Model-based value estimation for efficient model-free reinforcement learning. CoRR, abs/1803.00101, 2018.
  16. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020.
  17. Addressing function approximation error in actor-critic methods. In ICML, volume 80 of Proceedings of Machine Learning Research, pp.  1582–1591. PMLR, 2018.
  18. Efficient RL via disentangled environment and agent representations. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  11525–11545. PMLR, 2023.
  19. Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, 2020.
  20. Byol-explore: Exploration by bootstrapped prediction. In NeurIPS, 2022.
  21. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, volume 80 of Proceedings of Machine Learning Research, pp.  1856–1865. PMLR, 2018.
  22. Learning latent dynamics for planning from pixels. In ICML, volume 97 of Proceedings of Machine Learning Research, pp.  2555–2565. PMLR, 2019.
  23. Dream to control: Learning behaviors by latent imagination. In ICLR. OpenReview.net, 2020.
  24. Deep hierarchical planning from pixels. In NeurIPS, 2022.
  25. Temporal difference learning for model predictive control. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  8387–8406. PMLR, 2022.
  26. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  27. Hiraoka, T. Efficient sparse-reward goal-conditioned reinforcement learning with a high replay ratio and regularization, 2023.
  28. Dropout q-functions for doubly efficient reinforcement learning. In ICLR. OpenReview.net, 2022.
  29. Reparameterized policy learning for multimodal trajectory optimization. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  13957–13975. PMLR, 2023.
  30. When to trust your model: Model-based policy optimization. In NeurIPS, pp.  12498–12509, 2019.
  31. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. CoRR, abs/1806.10293, 2018.
  32. Contrastive learning of structured world models. In ICLR. OpenReview.net, 2020.
  33. Efficient deep reinforcement learning requires regulating overfitting. In ICLR. OpenReview.net, 2023.
  34. Object-centric learning with slot attention. In NeurIPS, 2020.
  35. The primacy bias in deep reinforcement learning. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  16828–16847. PMLR, 2022.
  36. The difficulty of passive learning in deep reinforcement learning. In NeurIPS, pp.  23283–23295, 2021.
  37. Sample-efficient cross-entropy method for real-time planning. In CoRL, volume 155 of Proceedings of Machine Learning Research, pp.  1049–1065. PMLR, 2020.
  38. Data-efficient deep reinforcement learning for dexterous manipulation. CoRR, abs/1704.03073, 2017.
  39. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems, 2018.
  40. Collect & infer – a fresh look at data-efficient reinforcement learning. In CoRL, volume 164 of Proceedings of Machine Learning Research, pp.  1736–1744. PMLR, 2022.
  41. Riedmiller, M. A. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In ECML, volume 3720 of Lecture Notes in Computer Science, pp.  317–328. Springer, 2005.
  42. Decoupling exploration and exploitation in reinforcement learning. CoRR, abs/2107.08966, 2021.
  43. Active exploration for robotic manipulation. In IROS, pp.  9355–9362. IEEE, 2022.
  44. Mastering atari, go, chess and shogi by planning with a learned model. Nat., 588(7839):604–609, 2020.
  45. A generalist dynamics model for control. CoRR, abs/2305.10912, 2023.
  46. Model-based active exploration. In ICML, volume 97 of Proceedings of Machine Learning Research, pp.  5779–5788. PMLR, 2019.
  47. Learning off-policy with online planning. In CoRL, volume 164 of Proceedings of Machine Learning Research, pp.  1622–1633. PMLR, 2021.
  48. An upper bound on the loss from approximate optimal-value functions. Mach. Learn., 16(3):227–233, 1994.
  49. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. CoRR, abs/2208.07860, 2022.
  50. Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pp.  216–224. Elsevier, 1990.
  51. Reinforcement Learning: An Introduction. MIT press, 2018.
  52. Deepmind control suite. CoRR, abs/1801.00690, 2018.
  53. Deep reinforcement learning with double q-learning. In AAAI, pp.  2094–2100. AAAI Press, 2016.
  54. Denoised mdps: Learning world models better than the world itself. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  22591–22612. PMLR, 2022.
  55. Mastering atari games with limited data. In NeurIPS, pp.  25476–25488, 2021.
  56. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, volume 100 of Proceedings of Machine Learning Research, pp.  1094–1100. PMLR, 2019.
  57. Simplified temporal consistency reinforcement learning. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  42227–42246. PMLR, 2023.

Summary

We haven't generated a summary for this paper yet.