Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale (2204.03514v2)

Published 7 Apr 2022 in cs.AI, cs.CV, and cs.RO

Abstract: We present a large-scale study of imitating human demonstrations on tasks that require a virtual robot to search for objects in new environments -- (1) ObjectGoal Navigation (e.g. 'find & go to a chair') and (2) Pick&Place (e.g. 'find mug, pick mug, find counter, place mug on counter'). First, we develop a virtual teleoperation data-collection infrastructure -- connecting Habitat simulator running in a web browser to Amazon Mechanical Turk, allowing remote users to teleoperate virtual robots, safely and at scale. We collect 80k demonstrations for ObjectNav and 12k demonstrations for Pick&Place, which is an order of magnitude larger than existing human demonstration datasets in simulation or on real robots. Second, we attempt to answer the question -- how does large-scale imitation learning (IL) (which hasn't been hitherto possible) compare to reinforcement learning (RL) (which is the status quo)? On ObjectNav, we find that IL (with no bells or whistles) using 70k human demonstrations outperforms RL using 240k agent-gathered trajectories. The IL-trained agent demonstrates efficient object-search behavior -- it peeks into rooms, checks corners for small objects, turns in place to get a panoramic view -- none of these are exhibited as prominently by the RL agent, and to induce these behaviors via RL would require tedious reward engineering. Finally, accuracy vs. training data size plots show promising scaling behavior, suggesting that simply collecting more demonstrations is likely to advance the state of art further. On Pick&Place, the comparison is starker -- IL agents achieve ${\sim}$18% success on episodes with new object-receptacle locations when trained with 9.5k human demonstrations, while RL agents fail to get beyond 0%. Overall, our work provides compelling evidence for investing in large-scale imitation learning. Project page: https://ram81.github.io/projects/habitat-web.

PDF Abstract

Overview of "Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale"

The paper "Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale" presents a comprehensive research paper focused on enhancing object-search capabilities in virtual robots. This paper leverages human demonstrations to develop imitation learning (IL) models that are applied to two primary tasks: Object Navigation (\objnav) and Pick-and-Place (\pickplace).

Methodological Development

The authors introduce a novel teleoperation data-collection infrastructure that allows for scalable and remote collection of human demonstrations, utilizing the Habitat simulator and interfacing it with users via Amazon Mechanical Turk (AMT). This setup enabled the collection of a significant number of human demonstrations, with 80,217 episodes for \objnav and 11,955 for \pickplace, far surpassing existing datasets in both scale and diversity.

Impressive IL vs. RL Performance

A central aspect of the research is comparing IL models trained on human demonstrations against conventional reinforcement learning (RL) models. The IL models demonstrably outperform RL in efficiency and success rates, with Object Navigation showing IL success of 35.4% compared to RL's maximum success of 34.6% over 240k agent-gathered trajectories. For Pick-and-Place tasks, the IL model achieves 18% success on new object-receptacle locations while RL models fail to surpass 0% success. The paper importantly establishes an "exchange rate," quantifying that a single human demonstration approximately equates to four agent-gathered RL trajectories in terms of efficiency and effectiveness.

Implications and Future Directions

The findings advocate for a paradigmatic shift towards large-scale imitation learning, highlighting its potential in encoding sophisticated human-like search behaviors into embodied agents. This approach mitigates numerous challenges inherent in RL, such as the necessity of complex reward engineering to induce desired behaviors like comprehensive exploration and interaction patterns. The dataset scaling behavior indicates that increasing the volume of human demonstrations can further advance state-of-the-art models for embodied AI tasks.

Theoretical and Practical Contributions

The research contributes both theoretically, by establishing imitation learning as a viable and preferable alternative to RL for complex object-search tasks, and practically, by providing a scalable dataset collection infrastructure that can be leveraged for numerous tasks within the Habitat ecosystem. It is noteworthy that the IL models exhibit important exploratory strategies, such as room peeking, panoramic scanning, and corner checking, underscoring the sophistication of human demonstrations in complex environments.

In conclusion, the paper underscores the efficacy of imitation learning fueled by extensive human demonstration datasets in advancing object-search strategies in embodied AI agents. This denotes significant implications for future developments in AI, providing valuable insights into the interplay of human-like strategic exploration behaviors and task-specific performance metrics in robotic systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ram Ramrakhya (13 papers)
Eric Undersander (11 papers)
Dhruv Batra (160 papers)
Abhishek Das (61 papers)

Citations (92)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/DhruvBatraDB/status/1833325276509667471