Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings (2206.12403v2)

Published 24 Jun 2022 in cs.CV, cs.LG, and cs.RO

Abstract: We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Objectnav Revisited: On Evaluation of Embodied Agents Navigating to Objects. arXiv preprint arXiv:2006.13171, 2020.
  2. Habitat: A platform for embodied ai research. In ICCV, 2019.
  3. Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS, 2021.
  4. Gibson Env: Real-World Perception for Embodied Agents. In CVPR, pages 9068–9079, 2018.
  5. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
  6. BenchBot: Evaluating Robotics Research in Photorealistic 3D Simulation and on Real Robots, 2020.
  7. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  8. Matterport3D: Learning from RGB-D Data in Indoor Environments. In ThreeDV, 2017. MatterPort3D dataset license: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf.
  9. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In ICCV, 2019.
  10. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 2022a.
  11. Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020.
  12. Auxiliary tasks and exploration enable objectnav. In ICCV, 2021.
  13. Thda: Treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
  14. SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation. In ICRA, 2021.
  15. Stubborn: A Strong Baseline for Indoor Object Navigation. arXiv preprint arXiv:2203.07359, 2022.
  16. Offline Visual Representation Learning for Embodied Navigation. arXiv preprint arXiv:2204.13226, 2022a.
  17. Habitat challenge 2022. https://aihabitat.org/challenge/2022/, 2022b.
  18. Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation. arXiv preprint arXiv:2202.02440, 2022.
  19. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
  20. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In NeurIPS Datasets and Benchmarks Track, 2021.
  21. CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration. arXiv preprint arXiv:2203.10421, 2022.
  22. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
  23. Combined Scaling for Zero-shot Transfer Learning. arXiv preprint arXiv:2111.10050, 2021.
  24. Imagenet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  25. Simple but Effective: CLIP Embeddings for Embodied AI. arXiv preprint arXiv:2111.09888, 2021.
  26. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In ICCV, 2017.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  28. Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. ICRA, 2017.
  29. Deep Residual Learning for Image Recognition. In CVPR, 2016.
  30. Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. In ICCV, 2021.
  31. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
  32. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. In ICLR, 2019.
  33. Memory-Augmented Reinforcement Learning for Image-Goal Navigation. arXiv preprint arXiv:2101.05181, 2021.
  34. On Evaluation of Embodied Navigation Agents. arXiv preprint arXiv:1807.06757, 2018.
  35. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  36. Brian Yamauchi. A frontier-based approach for autonomous exploration. In CIRA, 1997.
  37. Not all demonstrations are created equal: An objectnav case study for effectively combining imitation and reinforcement learning. https://github.com/Ram81/il_rl_baselines, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Arjun Majumdar (16 papers)
  2. Gunjan Aggarwal (5 papers)
  3. Bhavika Devnani (2 papers)
  4. Judy Hoffman (75 papers)
  5. Dhruv Batra (160 papers)
Citations (118)

Summary

We haven't generated a summary for this paper yet.