COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL (2310.07220v2)
Abstract: Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.
- Ready policy one: World building through active learning. In International Conference on Machine Learning, pages 591–601. PMLR, 2020.
- Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in neural information processing systems, 31, 2018.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 31, 2018.
- Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
- Mismatched no more: Joint model-policy optimization for model-based rl. Advances in Neural Information Processing Systems, 35:23230–23243, 2022.
- Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.
- Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555–2565. PMLR, 2019b.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
- Planning goals for exploration. arXiv preprint arXiv:2303.13002, 2023.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
- Model-ensemble trust-region policy optimization. In International Conference on Learning Representations.
- Bidirectional model-based policy optimization. In International Conference on Machine Learning, pages 5618–5627. PMLR, 2020.
- Objective mismatch in model-based reinforcement learning. arXiv preprint arXiv:2002.04523, 2020.
- Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848, 2018.
- Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
- Model predictive actor-critic: Accelerating robot skill acquisition with deep reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6678. IEEE, 2021.
- Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems, 33:10537–10546, 2020.
- Self-supervised exploration via disagreement. In International conference on machine learning, pages 5062–5071. PMLR, 2019.
- A survey of industrial model predictive control technology. Control engineering practice, 11(7):733–764, 2003.
- Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009.
- Implicit generative modeling for efficient exploration. In International Conference on Machine Learning, pages 7985–7995. PMLR, 2020.
- Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR, 2020.
- State entropy maximization with random encoders for efficient exploration. In International Conference on Machine Learning, pages 9443–9454. PMLR, 2021.
- Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. arXiv preprint arXiv:2010.14641, 2020.
- Model-based policy optimization with unsupervised model adaptation. Advances in Neural Information Processing Systems, 33:2823–2834, 2020.
- Model-based active exploration. In International conference on machine learning, pages 5779–5788. PMLR, 2019.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
- Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649, 2019.
- Live in the moment: Learning dynamics model adapted to evolving policy. In International Conference on Machine Learning. PMLR, 2023.
- Sample-efficient reinforcement learning via conservative model-based actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8612–8620, 2022.
- Plan to predict: Learning an uncertainty-foreseeing model for model-based reinforcement learning. Advances in Neural Information Processing Systems, 35:15849–15861, 2022.
- Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4202–4208. IEEE, 2021.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- Made: Exploration via maximizing deviation from explored regions. Advances in Neural Information Processing Systems, 34:9663–9680, 2021.
- Is model ensemble necessary? model-based rl via a single model with lipschitz regularized value function. In International Conference on Learning Representations, 2023.