Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings (2206.12403v2)

Published 24 Jun 2022 in cs.CV, cs.LG, and cs.RO

Abstract: We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Objectnav Revisited: On Evaluation of Embodied Agents Navigating to Objects. arXiv preprint arXiv:2006.13171, 2020.
  2. Habitat: A platform for embodied ai research. In ICCV, 2019.
  3. Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS, 2021.
  4. Gibson Env: Real-World Perception for Embodied Agents. In CVPR, pages 9068–9079, 2018.
  5. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
  6. BenchBot: Evaluating Robotics Research in Photorealistic 3D Simulation and on Real Robots, 2020.
  7. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  8. Matterport3D: Learning from RGB-D Data in Indoor Environments. In ThreeDV, 2017. MatterPort3D dataset license: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf.
  9. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In ICCV, 2019.
  10. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 2022a.
  11. Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020.
  12. Auxiliary tasks and exploration enable objectnav. In ICCV, 2021.
  13. Thda: Treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
  14. SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation. In ICRA, 2021.
  15. Stubborn: A Strong Baseline for Indoor Object Navigation. arXiv preprint arXiv:2203.07359, 2022.
  16. Offline Visual Representation Learning for Embodied Navigation. arXiv preprint arXiv:2204.13226, 2022a.
  17. Habitat challenge 2022. https://aihabitat.org/challenge/2022/, 2022b.
  18. Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation. arXiv preprint arXiv:2202.02440, 2022.
  19. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
  20. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In NeurIPS Datasets and Benchmarks Track, 2021.
  21. CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration. arXiv preprint arXiv:2203.10421, 2022.
  22. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
  23. Combined Scaling for Zero-shot Transfer Learning. arXiv preprint arXiv:2111.10050, 2021.
  24. Imagenet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  25. Simple but Effective: CLIP Embeddings for Embodied AI. arXiv preprint arXiv:2111.09888, 2021.
  26. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In ICCV, 2017.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  28. Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. ICRA, 2017.
  29. Deep Residual Learning for Image Recognition. In CVPR, 2016.
  30. Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. In ICCV, 2021.
  31. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
  32. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. In ICLR, 2019.
  33. Memory-Augmented Reinforcement Learning for Image-Goal Navigation. arXiv preprint arXiv:2101.05181, 2021.
  34. On Evaluation of Embodied Navigation Agents. arXiv preprint arXiv:1807.06757, 2018.
  35. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  36. Brian Yamauchi. A frontier-based approach for autonomous exploration. In CIRA, 1997.
  37. Not all demonstrations are created equal: An objectnav case study for effectively combining imitation and reinforcement learning. https://github.com/Ram81/il_rl_baselines, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Arjun Majumdar (16 papers)
  2. Gunjan Aggarwal (5 papers)
  3. Bhavika Devnani (2 papers)
  4. Judy Hoffman (75 papers)
  5. Dhruv Batra (160 papers)
Citations (118)