Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play (2302.12422v2)

Published 24 Feb 2023 in cs.RO
MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

Abstract: Imitation learning from human demonstrations is a promising paradigm for teaching robots manipulation skills in the real world. However, learning complex long-horizon tasks often requires an unattainable amount of demonstrations. To reduce the high data requirement, we resort to human play data - video sequences of people freely interacting with the environment using their hands. Even with different morphologies, we hypothesize that human play data contain rich and salient information about physical interactions that can readily facilitate robot policy learning. Motivated by this, we introduce a hierarchical learning framework named MimicPlay that learns latent plans from human play data to guide low-level visuomotor control trained on a small number of teleoperated demonstrations. With systematic evaluations of 14 long-horizon manipulation tasks in the real world, we show that MimicPlay outperforms state-of-the-art imitation learning methods in task success rate, generalization ability, and robustness to disturbances. Code and videos are available at https://mimic-play.github.io

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

The paper "MimicPlay: Long-Horizon Imitation Learning by Watching Human Play" addresses the challenge of teaching robots long-horizon manipulation tasks through imitation learning (IL). Traditional IL methodologies often depend on a large number of robot demonstrations to effectively learn tasks, particularly when dealing with complex operations. The considerable human and time resources required for such data collection are prohibitive, limiting the scalability of such approaches. MimicPlay innovatively leverages human play data to alleviate these constraints.

Human play data, which consists of video sequences capturing human interactions with the environment, is positioned as a resource-efficient alternative for learning high-level task plans. Even though humans and robots have different physical embodiments, the paper posits that human interactions can encapsulate valuable information about task dynamics that robots can utilize.

MimicPlay introduces a two-tier learning framework: a high-level planner and a low-level visuomotor controller. Human play data informs the high-level planner, enabling it to generate latent task plans. These plans encapsulate sequences of human actions, which can guide a low-level controller trained on a small dataset of robot demonstrations. Through a sequence of systematic evaluations involving 14 distinct manipulation tasks across multiple environments, the paper illustrates that MimicPlay significantly surpasses existing IL methods in terms of sample efficiency, task success rate, generalization capabilities, and resistance to disturbances.

The results are quantitatively compelling; MimicPlay dramatically improves sample efficiency. For example, contrary to prior methods like C-BeT or LMP, which requires hours of teleoperated robot play data, MimicPlay effectively uses just 10 minutes of human play data to achieve superior outcomes. In kitchen and desk environments specifically, MimicPlay outperformed other models by notable margins in both task completion and generalization to novel subgoal compositions.

The hierarchical framework of MimicPlay facilitates distinguishing complex tasks into a series of smaller, manageable actions, akin to plan-and-control architectures previously deemed promising in research. The latent plan generated by the high-level planner is crucial for guiding the low-level controller in executing fine-grained tasks, such as grasping or object placement. The integration of human play videos as prompts is particularly innovative, permitting parseable human demonstrations to specify complex robotic manipulation tasks.

In future AI developments, MimicPlay's framework could be extended to accommodate more diverse and scalable human interaction datasets, including those from internet-scale data sources. Additionally, expanding the variety of tasks beyond static environments, like desk spaces, to dynamic and mobile settings could further enhance the practical applicability and robustness of such systems.

MimicPlay's contribution to the domain of imitation learning is strategically significant, providing a pathway to revolutionizing how robots learn complex, multifaceted tasks. By harnessing the efficiency of human play, the presented framework offers a roadmap that can significantly reduce the costs associated with training state-of-the-art robotic systems in diverse task settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  2. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
  3. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085, 2020.
  4. Taco: Learning task decomposition via temporal alignment for control. In International Conference on Machine Learning, pages 4654–4663. PMLR, 2018.
  5. Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
  6. From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022.
  7. Latent plans for task-agnostic offline reinforcement learning. arXiv preprint arXiv:2209.08959, 2022.
  8. Learning and reproduction of gestures by imitation. IEEE Robotics & Automation Magazine, 17(2):44–54, 2010.
  9. Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292), volume 2, pages 1398–1403 vol.2, 2002. doi:10.1109/ROBOT.2002.1014739.
  10. S. Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
  11. J. Kober and J. Peters. Imitation and reinforcement learning. IEEE Robotics & Automation Magazine, 17(2):55–62, 2010.
  12. P. Englert and M. Toussaint. Learning manipulation skills from a single demonstration. The International Journal of Robotics Research, 37(1):137–154, 2018.
  13. One-shot visual imitation learning via meta-learning. In Conference on robot learning, pages 357–368. PMLR, 2017.
  14. Robot programming by demonstration. In Springer handbook of robotics, pages 1371–1394. Springer, 2008.
  15. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  16. S. Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006.
  17. J. Kober and J. Peters. Learning motor primitives for robotics. In 2009 IEEE International Conference on Robotics and Automation, pages 2112–2118. IEEE, 2009.
  18. Probabilistic movement primitives. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/e53a0a2978c28872a4505bdb51db06dc-Paper.pdf.
  19. Using probabilistic movement primitives in robotics. Autonomous Robots, 42(3):529–551, 2018.
  20. What matters in learning from offline human demonstrations for robot manipulation. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=JrsfBJtDFdI.
  21. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019.
  22. VIOLA: Object-centric imitation learning for vision-based robot manipulation. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=L8hCfhPbFho.
  23. Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8913–8920. IEEE, 2021.
  24. Implicit behavioral cloning. Conference on Robot Learning (CoRL), 2021.
  25. Rt-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817, 2022.
  26. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022.
  27. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020.
  28. Neural task programming: Learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3795–3802. IEEE, 2018.
  29. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021.
  30. Model-based inverse reinforcement learning from visual demonstrations. In Conference on Robot Learning, pages 1930–1942. PMLR, 2021.
  31. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537–546. PMLR, 2022.
  32. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
  33. Learning generalizable robotic reward functions from” in-the-wild” human videos. Robotics: Science and Systems (RSS), 2021.
  34. Third-person visual imitation learning via decoupled hierarchical controller. Advances in Neural Information Processing Systems, 32, 2019.
  35. Avid: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443, 2019.
  36. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020.
  37. Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019.
  38. Reinforcement learning with videos: Combining offline observations with interaction. In J. Kober, F. Ramos, and C. Tomlin, editors, Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 339–354. PMLR, 16–18 Nov 2021. URL https://proceedings.mlr.press/v155/schmeckpeper21a.html.
  39. Videodex: Learning dexterity from internet videos. CoRL, 2022.
  40. R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=tGbpgz6yOrI.
  41. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  42. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  43. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. arXiv preprint arXiv:2212.05749, 2022.
  44. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118–1125. IEEE, 2018.
  45. Graph inverse reinforcement learning from diverse videos. Conference on Robot Learning (CoRL), 2022.
  46. Graph-structured visual imitation. In Conference on Robot Learning, pages 979–989. PMLR, 2020.
  47. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022.
  48. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  49. Understanding human hands in contact at internet scale. In CVPR, 2020.
  50. C. M. Bishop. Mixture density networks. 1994.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  53. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. arXiv preprint arXiv:2206.11251, 2022.
  54. Tclr: Temporal contrastive learning for video representation. Computer Vision and Image Understanding, 219:103406, 2022.
  55. Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1801–1810, 2019.
  56. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  57. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  58. O. Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987. doi:10.1109/JRA.1987.1087068.
  59. A. Graves and A. Graves. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012.
  60. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023.
  61. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.
  62. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109.
  63. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023.
  64. L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chen Wang (599 papers)
  2. Linxi Fan (33 papers)
  3. Jiankai Sun (53 papers)
  4. Ruohan Zhang (34 papers)
  5. Li Fei-Fei (199 papers)
  6. Danfei Xu (59 papers)
  7. Yuke Zhu (134 papers)
  8. Anima Anandkumar (236 papers)
Citations (130)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com