Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL (2310.07220v2)

Published 11 Oct 2023 in cs.LG

Abstract: Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Ready policy one: World building through active learning. In International Conference on Machine Learning, pages 591–601. PMLR, 2020.
  2. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
  3. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in neural information processing systems, 31, 2018.
  4. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 31, 2018.
  5. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
  6. Mismatched no more: Joint model-policy optimization for model-based rl. Advances in Neural Information Processing Systems, 35:23230–23243, 2022.
  7. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
  8. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  9. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.
  10. Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555–2565. PMLR, 2019b.
  11. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  12. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  13. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
  14. Planning goals for exploration. arXiv preprint arXiv:2303.13002, 2023.
  15. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  16. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  17. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations.
  18. Bidirectional model-based policy optimization. In International Conference on Machine Learning, pages 5618–5627. PMLR, 2020.
  19. Objective mismatch in model-based reinforcement learning. arXiv preprint arXiv:2002.04523, 2020.
  20. Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848, 2018.
  21. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
  22. Model predictive actor-critic: Accelerating robot skill acquisition with deep reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6678. IEEE, 2021.
  23. Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems, 33:10537–10546, 2020.
  24. Self-supervised exploration via disagreement. In International conference on machine learning, pages 5062–5071. PMLR, 2019.
  25. A survey of industrial model predictive control technology. Control engineering practice, 11(7):733–764, 2003.
  26. Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009.
  27. Implicit generative modeling for efficient exploration. In International Conference on Machine Learning, pages 7985–7995. PMLR, 2020.
  28. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR, 2020.
  29. State entropy maximization with random encoders for efficient exploration. In International Conference on Machine Learning, pages 9443–9454. PMLR, 2021.
  30. Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. arXiv preprint arXiv:2010.14641, 2020.
  31. Model-based policy optimization with unsupervised model adaptation. Advances in Neural Information Processing Systems, 33:2823–2834, 2020.
  32. Model-based active exploration. In International conference on machine learning, pages 5779–5788. PMLR, 2019.
  33. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
  34. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
  35. Live in the moment: Learning dynamics model adapted to evolving policy. In International Conference on Machine Learning. PMLR, 2023.
  36. Sample-efficient reinforcement learning via conservative model-based actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8612–8620, 2022.
  37. Plan to predict: Learning an uncertainty-foreseeing model for model-based reinforcement learning. Advances in Neural Information Processing Systems, 35:15849–15861, 2022.
  38. Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4202–4208. IEEE, 2021.
  39. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  40. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  41. Made: Exploration via maximizing deviation from explored regions. Advances in Neural Information Processing Systems, 34:9663–9680, 2021.
  42. Is model ensemble necessary? model-based rl via a single model with lipschitz regularized value function. In International Conference on Learning Representations, 2023.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com