ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings (2206.12403v2)
Abstract: We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").
- Objectnav Revisited: On Evaluation of Embodied Agents Navigating to Objects. arXiv preprint arXiv:2006.13171, 2020.
- Habitat: A platform for embodied ai research. In ICCV, 2019.
- Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS, 2021.
- Gibson Env: Real-World Perception for Embodied Agents. In CVPR, pages 9068–9079, 2018.
- AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
- BenchBot: Evaluating Robotics Research in Photorealistic 3D Simulation and on Real Robots, 2020.
- ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
- Matterport3D: Learning from RGB-D Data in Indoor Environments. In ThreeDV, 2017. MatterPort3D dataset license: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf.
- 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In ICCV, 2019.
- Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 2022a.
- Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020.
- Auxiliary tasks and exploration enable objectnav. In ICCV, 2021.
- Thda: Treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
- SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation. In ICRA, 2021.
- Stubborn: A Strong Baseline for Indoor Object Navigation. arXiv preprint arXiv:2203.07359, 2022.
- Offline Visual Representation Learning for Embodied Navigation. arXiv preprint arXiv:2204.13226, 2022a.
- Habitat challenge 2022. https://aihabitat.org/challenge/2022/, 2022b.
- Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation. arXiv preprint arXiv:2202.02440, 2022.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
- Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In NeurIPS Datasets and Benchmarks Track, 2021.
- CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration. arXiv preprint arXiv:2203.10421, 2022.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
- Combined Scaling for Zero-shot Transfer Learning. arXiv preprint arXiv:2111.10050, 2021.
- Imagenet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- Simple but Effective: CLIP Embeddings for Embodied AI. arXiv preprint arXiv:2111.09888, 2021.
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In ICCV, 2017.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. ICRA, 2017.
- Deep Residual Learning for Image Recognition. In CVPR, 2016.
- Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. In ICCV, 2021.
- Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
- DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. In ICLR, 2019.
- Memory-Augmented Reinforcement Learning for Image-Goal Navigation. arXiv preprint arXiv:2101.05181, 2021.
- On Evaluation of Embodied Navigation Agents. arXiv preprint arXiv:1807.06757, 2018.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Brian Yamauchi. A frontier-based approach for autonomous exploration. In CIRA, 1997.
- Not all demonstrations are created equal: An objectnav case study for effectively combining imitation and reinforcement learning. https://github.com/Ram81/il_rl_baselines, 2022b.
- Arjun Majumdar (16 papers)
- Gunjan Aggarwal (5 papers)
- Bhavika Devnani (2 papers)
- Judy Hoffman (75 papers)
- Dhruv Batra (160 papers)