Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning (2401.03306v1)

Published 6 Jan 2024 in cs.LG, cs.AI, and cs.RO

Abstract: We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations in the context of realistic robot tasks. Recent offline model-free approaches successfully use online fine-tuning to either improve the performance of the agent over the data collection policy or adapt to novel tasks. At the same time, model-based RL algorithms have achieved significant progress in sample efficiency and the complexity of the tasks they can solve, yet remain under-utilized in the fine-tuning setting. In this work, we argue that existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains due to issues with distribution shifts, off-dynamics data, and non-stationary rewards. We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization, while preventing model exploitation by controlling epistemic uncertainty. We find that our approach successfully solves tasks from the MetaWorld benchmark, as well as the Franka Kitchen robot manipulation environment completely from images. To the best of our knowledge, MOTO is the first method to solve this environment from pixels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Batch reinforcement learning. In Reinforcement Learning, volume 12. Springer, 2012.
  2. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  3. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  4. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
  5. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  6. M. Yang and O. Nachum. Representation matters: offline pretraining for sequential decision making. In International Conference on Machine Learning, pages 11784–11794. PMLR, 2021.
  7. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  8. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  9. Conservative data sharing for multi-task offline reinforcement learning. Advances in Neural Information Processing Systems, 34:11501–11516, 2021a.
  10. Data sharing without rewards in multi-task offline reinforcement learning. 2021b.
  11. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020.
  12. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  13. Behavioral priors and dynamics models: Improving performance and domain transfer in offline rl. arXiv preprint arXiv:2106.09119, 2021.
  14. Visual adversarial imitation learning using variational models. Advances in Neural Information Processing Systems, 34:3016–3028, 2021.
  15. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  16. Model-based imitation learning for urban driving. arXiv preprint arXiv:2210.07729, 2022.
  17. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
  18. A. K. Akan and F. Güney. Stretchbev: Stretching future instance prediction spatially and temporally. arXiv preprint arXiv:2203.13641, 2022.
  19. Offline reinforcement learning from images with latent space models. ArXiv, abs/2012.11547, 2020.
  20. Morel: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
  21. Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647, 2020.
  22. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020.
  23. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
  24. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  25. A. Argenson and G. Dulac-Arnold. Model-based offline planning. arXiv preprint arXiv:2008.05556, 2020.
  26. Overcoming model bias for robust offline deep reinforcement learning. arXiv preprint arXiv:2008.05533, 2020.
  27. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
  28. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
  29. Solar: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.
  30. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arxiv:1907.00953.pdf, 2020.
  31. Learning latent dynamics for planning from pixels. International Conference on Learning Representations, 2019.
  32. Dream to control: Learning behaviors by latent imagination. International Conference on Learning Representations, 2020.
  33. D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
  34. Planning to explore via self-supervised world models. arXiv preprint arXiv:2005.05960, 2020.
  35. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
  36. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in neural information processing systems, 31, 2018.
  37. On the model-based stochastic value gradient for continuous reinforcement learning. In Learning for Dynamics and Control, pages 6–20. PMLR, 2021.
  38. Model-augmented actor-critic: Backpropagating through paths. arXiv preprint arXiv:2005.08068, 2020.
  39. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
  40. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pages 617–629. PMLR, 2018.
  41. M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
  42. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018.
  43. Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.
  44. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
  45. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  46. A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  47. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, 2020.
  48. D5rl: Diverse datasets for data-driven deep reinforcement learning, 2023.
  49. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
  50. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
  51. End-to-end differentiable adversarial imitation learning. In International Conference on Machine Learning, pages 390–399. PMLR, 2017.
  52. Mitigating covariate shift in imitation learning via offline data without great coverage. arXiv preprint arXiv:2106.03207, 2021.
  53. Discriminator-guided model-based offline imitation learning. arXiv preprint arXiv:2207.00244, 2022.
  54. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. arXiv preprint arXiv:2212.11419, 2022.
  55. Hierarchical model-based imitation learning for planning in autonomous driving. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8652–8659. IEEE, 2022.
  56. J. Ho and S. Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  57. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.
  58. Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555–2565. PMLR, 2019.
  59. I. Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10 2022. URL https://github.com/ikostrikov/jaxrl2. v2.
  60. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rafael Rafailov (37 papers)
  2. Kyle Hatch (10 papers)
  3. Victor Kolev (4 papers)
  4. John D. Martin (12 papers)
  5. Mariano Phielipp (21 papers)
  6. Chelsea Finn (264 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.