Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (2209.08466v3)
Abstract: While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.
- Policy-aware model learning for policy gradient methods, 2020. URL https://arxiv.org/abs/2003.00030.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2021.05.008. URL https://www.sciencedirect.com/science/article/pii/S1566253521001081.
- Maximum a posteriori policy optimisation, 2018. URL https://arxiv.org/abs/1806.06920.
- Differentiable mpc for end-to-end planning and control, 2018. URL https://arxiv.org/abs/1810.13400.
- On the model-based stochastic value gradient for continuous reinforcement learning, 2020. URL https://arxiv.org/abs/2008.12775.
- Deciding what to model: Value-equivalent sampling for reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fORXbIlTELP.
- Hagai Attias. Planning by probabilistic inference. In Christopher M. Bishop and Brendan J. Frey (eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, volume R4 of Proceedings of Machine Learning Research, pp. 9–16. PMLR, 03–06 Jan 2003. URL https://proceedings.mlr.press/r4/attias03a.html. Reissued by PMLR on 01 April 2021.
- Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450.
- Information prioritization through empowerment in visual model-based rl. In International Conference on Learning Representations, 2021.
- Planning as inference. Trends in Cognitive Sciences, 16(10):485–488, 2012. ISSN 1364-6613. doi: https://doi.org/10.1016/j.tics.2012.08.006. URL https://www.sciencedirect.com/science/article/pii/S1364661312001957.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/f02208a057804ee16ac72ff4d3cec53b-Paper.pdf.
- Learning and querying fast generative models for reinforcement learning, 2018. URL https://arxiv.org/abs/1802.03006.
- Randomized ensembled double q-learning: Learning fast without a model, 2021. URL https://arxiv.org/abs/2101.05982.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https://arxiv.org/abs/1805.12114.
- Model-augmented actor-critic: Backpropagating through paths, 2020. URL https://arxiv.org/abs/2005.08068.
- Pilco: A model-based and dataefficient approach to policy search. In In Proceedings of the Twenty-Eighth International Conference on Machine Learning (ICML, 2011.
- Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations, 2021. URL https://arxiv.org/abs/2110.14565.
- Gradient-aware model-based policy search. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3801–3808, apr 2020. doi: 10.1609/aaai.v34i04.5791. URL https://doi.org/10.1609%2Faaai.v34i04.5791.
- Provable rl with exogenous distractors via multistep inverse dynamics. arXiv preprint arXiv:2110.08847, 2021.
- Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers. arXiv e-prints, art. arXiv:2006.13916, June 2020.
- Mismatched no more: Joint model-policy optimization for model-based rl, 2021a. URL https://arxiv.org/abs/2110.02758.
- Robust predictable control, 2021b. URL https://arxiv.org/abs/2109.03214.
- Amir-massoud Farahmand. Iterative value-aware model learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/7a2347d96752880e3d58d72e9813cc14-Paper.pdf.
- Value-Aware Loss Function for Model-based Reinforcement Learning. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1486–1494. PMLR, 20–22 Apr 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html.
- Model-based value estimation for efficient model-free reinforcement learning, 2018. URL https://arxiv.org/abs/1803.00101.
- Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1587–1596. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/fujimoto18a.html.
- Model predictive control: Theory and practice - a survey. Autom., 25(3):335–348, 1989. URL http://dblp.uni-trier.de/db/journals/automatica/automatica25.html#GarciaPM89.
- Loss surfaces, mode connectivity, and fast ensembling of dnns, 2018. URL https://arxiv.org/abs/1802.10026.
- Reinforcement learning with competitive ensembles of information-constrained primitives, 2019. URL https://arxiv.org/abs/1906.10667.
- Bootstrap your own latent: A new approach to self-supervised learning, 2020. URL https://arxiv.org/abs/2006.07733.
- The value equivalence principle for model-based reinforcement learning, 2020. URL https://arxiv.org/abs/2011.03506.
- Learning invariant feature spaces to transfer skills with reinforcement learning, 2017. URL https://arxiv.org/abs/1703.02949.
- Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URL https://arxiv.org/abs/1801.01290.
- Learning latent dynamics for planning from pixels, 2018. URL https://arxiv.org/abs/1811.04551.
- Dream to control: Learning behaviors by latent imagination, 2019. URL https://arxiv.org/abs/1912.01603.
- Mastering atari with discrete world models, 2020. URL https://arxiv.org/abs/2010.02193.
- On the role of planning in model-based deep reinforcement learning, 2020. URL https://arxiv.org/abs/2011.04021.
- Temporal difference learning for model predictive control. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 8387–8406. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hansen22a.html.
- Learning continuous control policies by stochastic value gradients, 2015. URL https://arxiv.org/abs/1510.09142.
- Hallucinating value: A pitfall of dyna-style planning with imperfect environment models, 2020. URL https://arxiv.org/abs/2006.04363.
- When to trust your model: Model-based policy optimization, 2019. URL https://arxiv.org/abs/1906.08253.
- Gamma-models: Generative temporal difference learning for infinite-horizon prediction. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1724–1735. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/12ffb0968f2f56e51a59a6beb37b2859-Paper.pdf.
- An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
- Reinforcement learning with misspecified model classes. In 2013 IEEE International Conference on Robotics and Automation, pp. 939–946, 2013. doi: 10.1109/ICRA.2013.6630686.
- Model-based reinforcement learning for atari, 2019. URL https://arxiv.org/abs/1903.00374.
- Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
- Model-ensemble trust-region policy optimization, 2018. URL https://arxiv.org/abs/1802.10592.
- Objective mismatch in model-based reinforcement learning. ArXiv, abs/2002.04523, 2020.
- Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020. URL https://arxiv.org/abs/2005.01643.
- Continuous control with deep reinforcement learning, 2015. URL https://arxiv.org/abs/1509.02971.
- Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees, 2018. URL https://arxiv.org/abs/1807.03858.
- Gradients are not all you need, 2021. URL https://arxiv.org/abs/2111.05803.
- Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pp. 6961–6971. PMLR, 2020.
- Model predictive actor-critic: Accelerating robot skill acquisition with deep reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, may 2021. doi: 10.1109/icra48506.2021.9561298. URL https://doi.org/10.1109%2Ficra48506.2021.9561298.
- Temporal predictive coding for model-based planning in latent space. In International Conference on Machine Learning, pp. 8130–8139. PMLR, 2021.
- Control-oriented model-based reinforcement learning with implicit differentiation, 2021. URL https://arxiv.org/abs/2106.03273.
- Action-conditional video prediction using deep networks in atari games. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pp. 2863–2871, Cambridge, MA, USA, 2015. MIT Press.
- Value prediction network, 2017. URL https://arxiv.org/abs/1707.03497.
- Dreaming: Model-based reinforcement learning by latent imagination without reconstruction, 2020. URL https://arxiv.org/abs/2007.14535.
- Path integral networks: End-to-end differentiable optimal control, 2017. URL https://arxiv.org/abs/1706.09597.
- Pipps: Flexible model-based policy search robust to the curse of chaos, 2019. URL https://arxiv.org/abs/1902.01240.
- Relative entropy policy search. In AAAI, 2010.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems, 2022. URL https://arxiv.org/abs/2203.01387.
- Imagination-augmented agents for deep reinforcement learning. ArXiv, abs/1707.06203, 2017.
- A game theoretic framework for model based reinforcement learning, 2020. URL https://arxiv.org/abs/2004.07804.
- Which mutual-information representation learning objectives are sufficient for control? Advances in Neural Information Processing Systems, 34:26345–26357, 2021.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, dec 2020. doi: 10.1038/s41586-020-03051-4. URL https://doi.org/10.1038%2Fs41586-020-03051-4.
- High-dimensional continuous control using generalized advantage estimation, 2015. URL https://arxiv.org/abs/1506.02438.
- Model-based policy optimization with unsupervised model adaptation, 2020. URL https://arxiv.org/abs/2010.09546.
- Learning off-policy with online planning, 2020. URL https://arxiv.org/abs/2008.10066.
- Local search for policy iteration in continuous control, 2020. URL https://arxiv.org/abs/2010.05545.
- Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin, 2(4):160–163, 1991.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
- Value iteration networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/c21002f464c5fc5bee3b98ced83963b8-Paper.pdf.
- Russ Tedrake. Underactuated Robotics. Course Notes for MIT 6.832, 2022. URL http://underactuated.mit.edu.
- Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, volume 6, pp. 1–9, 1993.
- Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 1049–1056, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553508. URL https://doi.org/10.1145/1553374.1553508.
- Value gradient weighted model-based reinforcement learning, 2022. URL https://arxiv.org/abs/2204.01464.
- Exploring model-based planning with policy networks, 2019. URL https://arxiv.org/abs/1906.08649.
- Benchmarking model-based reinforcement learning, 2019. URL https://arxiv.org/abs/1907.02057.
- How good is the bayes posterior in deep neural networks really?, 2020. URL https://arxiv.org/abs/2002.02405.
- Latent skill planning for exploration and transfer. In International Conference on Learning Representations, 2020.
- Mastering visual continuous control: Improved data-augmented reinforcement learning, 2021. URL https://arxiv.org/abs/2107.09645.
- Reward is enough for convex mdps, 2021. URL https://arxiv.org/abs/2106.00661.
- Learning invariant representations for reinforcement learning without reconstruction, 2020. URL https://arxiv.org/abs/2006.10742.
- Raj Ghugare (4 papers)
- Homanga Bharadhwaj (36 papers)
- Benjamin Eysenbach (59 papers)
- Sergey Levine (531 papers)
- Ruslan Salakhutdinov (248 papers)