MimicPlay: Long-Horizon Imitation Learning by Watching Human Play (2302.12422v2)

Published 24 Feb 2023 in cs.RO

Abstract: Imitation learning from human demonstrations is a promising paradigm for teaching robots manipulation skills in the real world. However, learning complex long-horizon tasks often requires an unattainable amount of demonstrations. To reduce the high data requirement, we resort to human play data - video sequences of people freely interacting with the environment using their hands. Even with different morphologies, we hypothesize that human play data contain rich and salient information about physical interactions that can readily facilitate robot policy learning. Motivated by this, we introduce a hierarchical learning framework named MimicPlay that learns latent plans from human play data to guide low-level visuomotor control trained on a small number of teleoperated demonstrations. With systematic evaluations of 14 long-horizon manipulation tasks in the real world, we show that MimicPlay outperforms state-of-the-art imitation learning methods in task success rate, generalization ability, and robustness to disturbances. Code and videos are available at https://mimic-play.github.io

PDF HTML Abstract

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

The paper "MimicPlay: Long-Horizon Imitation Learning by Watching Human Play" addresses the challenge of teaching robots long-horizon manipulation tasks through imitation learning (IL). Traditional IL methodologies often depend on a large number of robot demonstrations to effectively learn tasks, particularly when dealing with complex operations. The considerable human and time resources required for such data collection are prohibitive, limiting the scalability of such approaches. MimicPlay innovatively leverages human play data to alleviate these constraints.

Human play data, which consists of video sequences capturing human interactions with the environment, is positioned as a resource-efficient alternative for learning high-level task plans. Even though humans and robots have different physical embodiments, the paper posits that human interactions can encapsulate valuable information about task dynamics that robots can utilize.

MimicPlay introduces a two-tier learning framework: a high-level planner and a low-level visuomotor controller. Human play data informs the high-level planner, enabling it to generate latent task plans. These plans encapsulate sequences of human actions, which can guide a low-level controller trained on a small dataset of robot demonstrations. Through a sequence of systematic evaluations involving 14 distinct manipulation tasks across multiple environments, the paper illustrates that MimicPlay significantly surpasses existing IL methods in terms of sample efficiency, task success rate, generalization capabilities, and resistance to disturbances.

The results are quantitatively compelling; MimicPlay dramatically improves sample efficiency. For example, contrary to prior methods like C-BeT or LMP, which requires hours of teleoperated robot play data, MimicPlay effectively uses just 10 minutes of human play data to achieve superior outcomes. In kitchen and desk environments specifically, MimicPlay outperformed other models by notable margins in both task completion and generalization to novel subgoal compositions.

The hierarchical framework of MimicPlay facilitates distinguishing complex tasks into a series of smaller, manageable actions, akin to plan-and-control architectures previously deemed promising in research. The latent plan generated by the high-level planner is crucial for guiding the low-level controller in executing fine-grained tasks, such as grasping or object placement. The integration of human play videos as prompts is particularly innovative, permitting parseable human demonstrations to specify complex robotic manipulation tasks.

In future AI developments, MimicPlay's framework could be extended to accommodate more diverse and scalable human interaction datasets, including those from internet-scale data sources. Additionally, expanding the variety of tasks beyond static environments, like desk spaces, to dynamic and mobile settings could further enhance the practical applicability and robustness of such systems.

MimicPlay's contribution to the domain of imitation learning is strategically significant, providing a pathway to revolutionizing how robots learn complex, multifaceted tasks. By harnessing the efficiency of human play, the presented framework offers a roadmap that can significantly reduce the costs associated with training state-of-the-art robotic systems in diverse task settings.

PDF Markdown Bookmark Chat (Pro)

References (64)

Authors (8)

Chen Wang (599 papers)
Linxi Fan (33 papers)
Jiankai Sun (53 papers)
Ruohan Zhang (34 papers)
Li Fei-Fei (199 papers)
Danfei Xu (59 papers)
Yuke Zhu (134 papers)
Anima Anandkumar (236 papers)

Citations (130)

View on Semantic Scholar

GitHub

Tweets

https://twitter.com/danfei_xu/status/1858903242308030917

YouTube

Show All Videos

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play (2302.12422v2)

MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

Related Papers

GitHub

Tweets

YouTube