Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models (2301.04741v2)
Abstract: Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pre-training based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches. Supplementary materials and code are available at https://sites.google.com/berkeley.edu/mop-rl.
- “A survey of robot learning from demonstration” In Robotics and autonomous systems 57.5 Elsevier, 2009, pp. 469–483
- “End-to-end differentiable adversarial imitation learning” In International Conference on Machine Learning, 2017, pp. 390–399 PMLR
- Nir Baram, Oron Anschel and Shie Mannor “Model-based adversarial imitation learning” In arXiv preprint arXiv:1612.02179, 2016
- “Curious iLQR: Resolving uncertainty in model-based RL” In Conference on Robot Learning, 2020, pp. 162–171 PMLR
- “Active Preference-Based Gaussian Process Regression for Reward Learning” In Robotics: Science and Systems, 2020
- “Asking Easy Questions: A User-Friendly Approach to Active Reward Learning” In Conference on Robot Learning, 2020, pp. 1177–1190 PMLR
- “The cross-entropy method for optimization” In Handbook of statistics 31 Elsevier, 2013, pp. 35–59
- “OpenAI Gym”, 2016 eprint: arXiv:1606.01540
- “Safe imitation learning via fast Bayesian reward inference from preferences” In International Conference on Machine Learning, 2020, pp. 1165–1177 PMLR
- “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations” In International Conference on Machine Learning, 2019, pp. 783–792 PMLR
- Daniel S Brown, Wonjoon Goo and Scott Niekum “Better-than-demonstrator imitation learning via automatically-ranked demonstrations” In Conference on robot learning, 2020, pp. 330–359 PMLR
- “Exploration by random network distillation” In Seventh International Conference on Learning Representations, 2019, pp. 1–17
- Letian Chen, Rohan Paleja and Matthew Gombolay “Learning from suboptimal demonstration via self-supervised reward regression” In Conference on robot learning, 2021, pp. 1262–1277 PMLR
- “Deep Reinforcement Learning from Human Preferences” In NIPS, 2017
- “Deep reinforcement learning in a handful of trials using probabilistic dynamics models” In Advances in neural information processing systems 31, 2018
- “Statistical data cleaning for deep learning of automation tasks from demonstrations” In 2017 13th IEEE Conference on Automation Science and Engineering (CASE), 2017, pp. 1142–1149 IEEE
- “Stochastic video generation with a learned prior” In International conference on machine learning, 2018, pp. 1174–1183 PMLR
- “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control” In arXiv preprint arXiv:1812.00568, 2018
- “Imitating latent policies from observation” In International conference on machine learning, 2019, pp. 1755–1763 PMLR
- “Assistive gym: A physics simulation framework for assistive robotics” In 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 10169–10176 IEEE
- “Deep visual foresight for planning robot motion” In 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 2786–2793 IEEE
- Chelsea Finn, Sergey Levine and Pieter Abbeel “Guided cost learning: Deep inverse optimal control via policy optimization” In International conference on machine learning, 2016, pp. 49–58 PMLR
- “The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types” In Proceedings of the AAAI Conference on Artificial Intelligence, 2023
- “Negative Result for Learning from Demonstration: Challenges for End-Users Teaching Robots with Task And Motion Planning Abstractions” In Robotics: Science and Systems (RSS), 2022
- “Learning cost functions for motion planning from human preferences” In Proceedings of the IROS 2014 Workshop on Machine Learning in Planning and Control of Robot Motion 1, 2014, pp. 48–6
- “Modelling transition dynamics in MDPs with RKHS embeddings” In Proceedings of the 29th International Conference on International Conference on Machine Learning, 2012, pp. 1603–1610
- “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor” In International conference on machine learning, 2018, pp. 1861–1870 PMLR
- “Learning latent dynamics for planning from pixels” In International conference on machine learning, 2019, pp. 2555–2565 PMLR
- “Imitation learning: A survey of learning methods” In ACM Computing Surveys (CSUR) 50.2 ACM New York, NY, USA, 2017, pp. 1–35
- “Reward learning from human preferences and demonstrations in Atari” In arXiv preprint arXiv:1811.06521, 2018
- “Learning preferences for manipulation tasks from online coactive feedback” In The International Journal of Robotics Research 34.10 SAGE Publications Sage UK: London, England, 2015, pp. 1296–1313
- “Model Based Reinforcement Learning for Atari” In International Conference on Learning Representations, 2019
- Rahul Kidambi, Jonathan Chang and Wen Sun “MobILE: Model-Based Imitation Learning From Observation Alone” In Advances in Neural Information Processing Systems 34, 2021
- J Zico Kolter and Gaurav Manek “Learning stable deep dynamics models” In Advances in neural information processing systems 32, 2019
- Kimin Lee, Laura Smith and Pieter Abbeel “PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training” In International Conference on Machine Learning, 2021
- “Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control” In International Conference on Learning Representations, 2018
- “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation” In Conference on Robot Learning, 2022, pp. 1678–1690 PMLR
- “Deep dynamics models for learning dexterous manipulation” In Conference on Robot Learning, 2020, pp. 1101–1112 PMLR
- “Model learning for robot control: a survey” In Cognitive processing 12.4 Springer, 2011, pp. 319–340
- “Dueling posterior sampling for preference-based reinforcement learning” In Conference on Uncertainty in Artificial Intelligence, 2020, pp. 1029–1038 PMLR
- “Variational inference MPC for Bayesian model-based reinforcement learning” In Conference on Robot Learning, 2020, pp. 258–272 PMLR
- “Recent advances in robot learning from demonstration” In Annual review of control, robotics, and autonomous systems 3 Annual Reviews, 2020, pp. 297–330
- “Learning human objectives by evaluating hypothetical behavior” In International Conference on Machine Learning, 2020, pp. 8020–8029 PMLR
- J Anthony Rossiter “Model-based predictive control: a practical approach” CRC press, 2017
- “The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning” Springer-Verlag, 2004
- Reuven Rubinstein “The Cross-Entropy Method for Combinatorial and Continuous Optimization” In Methodology And Computing In Applied Probability, 1999
- “Active Preference-Based Learning of Reward Functions.” In Robotics: Science and Systems, 2017
- Daniel Shin, Anca Dragan and Daniel S Brown “Benchmarks and Algorithms for Offline Preference-Based Reward Learning” In Transactions on Machine Learning Research, 2023
- “Learning dynamic models for open loop predictive control of soft robotic manipulators” In Bioinspiration & biomimetics 12.6 IOP Publishing, 2017, pp. 066003
- “A Study of Causal Confusion in Preference-Based Reward Learning” In International Conference on Learning Representations, 2023
- Faraz Torabi, Garrett Warnell and Peter Stone “Behavioral cloning from observation” In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4950–4957
- “Entity abstraction in visual model-based reinforcement learning” In Conference on Robot Learning, 2020, pp. 1439–1456 PMLR
- “Rough Terrain Navigation Using Divergence Constrained Model-Based Reinforcement Learning” In 5th Annual Conference on Robot Learning, 2021
- Grady Williams, Andrew Aldrich and Evangelos Theodorou “Model predictive path integral control using covariance variable importance sampling” In arXiv preprint arXiv:1509.01149, 2015
- Aaron Wilson, Alan Fern and Prasad Tadepalli “A Bayesian approach for policy learning from trajectory preference queries” In Advances in neural information processing systems 25, 2012
- “A survey of preference-based reinforcement learning methods” In Journal of Machine Learning Research 18.136 Journal of Machine Learning Research/Massachusetts Institute of Technology …, 2017, pp. 1–46
- “On learning from game annotations” In IEEE Transactions on Computational Intelligence and AI in Games 7.3 IEEE, 2014, pp. 304–316
- Christian Wirth, Johannes Fürnkranz and Gerhard Neumann “Model-free preference-based reinforcement learning” In Thirtieth AAAI Conference on Artificial Intelligence, 2016
- Alan Wu, AJ Piergiovanni and Michael S Ryoo “Model-based behavioral cloning with future image similarity learning” In Conference on Robot Learning, 2020, pp. 1062–1077 PMLR
- “Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks” In 5th Annual Conference on Robot Learning, 2021
- “Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention” In Conference on Robot Learning, 2022, pp. 332–341 PMLR
- “MOPO: Model-based offline policy optimization” In Advances in Neural Information Processing Systems 33, 2020, pp. 14129–14142
- “SOLAR: Deep structured representations for model-based reinforcement learning” In International Conference on Machine Learning, 2019, pp. 7444–7453 PMLR
- “Asynchronous Methods for Model-Based Reinforcement Learning” In Conference on Robot Learning, 2020, pp. 1338–1347 PMLR
- “An optimization approach to rough terrain locomotion” In 2010 IEEE International Conference on Robotics and Automation, 2010, pp. 3589–3595 IEEE
- Yi Liu (543 papers)
- Gaurav Datta (5 papers)
- Ellen Novoseller (20 papers)
- Daniel S. Brown (46 papers)