Papers
Topics
Authors
Recent
2000 character limit reached

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation (2410.08189v1)

Published 10 Oct 2024 in cs.CV and cs.RO

Abstract: In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020.
  4. A persistent spatial semantic representation for high-level natural language instruction execution. In CoRL, pages 706–717. PMLR, 2022.
  5. Matterport3d: Learning from rgb-d data in indoor environments. 3DV, 2017.
  6. Object goal navigation using goal-oriented semantic exploration. NeurIPS, 33:4247–4258, 2020.
  7. Open-vocabulary queryable scene representations for real world planning. In ICRA, pages 11509–11522. IEEE, 2023.
  8. Integrating egocentric localization for more realistic point-goal navigation agents. CoRL, 2020.
  9. Robothor: An open simulation-to-real embodied ai platform. In CVPR, June 2020.
  10. Procthor: Large-scale embodied ai using procedural generation, 2022.
  11. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In CVPR, pages 23171–23181, 2023.
  12. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
  13. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, pages 13137–13146, 2020.
  14. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.
  15. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  16. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  17. Simple but effective: Clip embeddings for embodied ai. In CVPR, pages 14829–14838, 2022.
  18. Segment anything. In ICCV, pages 4015–4026, 2023.
  19. Openfmnav: Towards open-set zero-shot object navigation via vision-language foundation models. arXiv preprint arXiv:2402.10670, 2024.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022.
  21. Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
  22. Code as policies: Language model programs for embodied control. In ICRA, pages 9493–9500. IEEE, 2023.
  23. Visual instruction tuning. NeurIPS, 36, 2023.
  24. Visual instruction tuning, 2023.
  25. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. NeurIPS, 35:32340–32352, 2022.
  26. Thda: Treasure hunt data augmentation for semantic navigation. In ICCV, pages 15374–15383, 2021.
  27. Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342, 2021.
  28. Simple open-vocabulary object detection with vision transformers. arxiv 2022. arXiv preprint arXiv:2205.06230, 2, 2022.
  29. Visual representations for semantic target driven navigation. In ICRA, pages 8846–8852. IEEE, 2019.
  30. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  31. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, 2021.
  32. Poni: Potential functions for objectgoal navigation with interaction-free learning. In CVPR, pages 18890–18900, 2022.
  33. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In CoRL, 2023.
  34. James A Sethian. Fast marching methods. SIAM review, 41(2):199–235, 1999.
  35. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In CoRL, pages 492–504. PMLR, 2023.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  38. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357, 2019.
  39. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023.
  40. Embodied instruction following in unknown environments. arXiv preprint arXiv:2406.11818, 2024.
  41. Embodiedsam: Online segment any 3d thing in real time. arXiv preprint arXiv:2408.11811, 2024.
  42. Memory-based adapters for online 3d scene perception. In CVPR, pages 21604–21613, 2024.
  43. Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543, 2018.
  44. Auxiliary tasks and exploration enable objectgoal navigation. In ICCV, pages 16117–16126, 2021.
  45. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In ICRA, 2024.
  46. L3mvn: Leveraging large language models for visual target navigation. In IROS, pages 3554–3560. IEEE, 2023.
  47. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
  48. 3d-aware object goal navigation via simultaneous exploration and identification. In CVPR, pages 6672–6682, 2023.
  49. Fusion-aware point convolution for online semantic 3d scene segmentation. In CVPR, pages 4534–4543, 2020.
  50. Hierarchical object-to-zone graph for object navigation. In ICCV, pages 15130–15140, October 2021.
  51. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In AAAI, volume 38, pages 7641–7649, 2024.
  52. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. In ICML, pages 42829–42842. PMLR, 2023.

Summary

  • The paper presents an innovative framework, SG-Nav, that integrates online 3D scene graphs with hierarchical LLM prompting to enable zero-shot object navigation.
  • The method constructs a real-time, hierarchical scene graph that enhances spatial reasoning and re-perception, reducing false positives and improving navigation decisions.
  • Experimental results demonstrate SG-Nav surpassing state-of-the-art methods by over 10% in success rate on datasets such as MP3D, HM3D, and RoboTHOR.

SG-Nav: Framework for Zero-Shot Object Navigation

The paper "SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation" presents a novel approach for zero-shot object navigation by integrating 3D scene graph construction with LLMs to offer a comprehensive and robust navigation strategy. Unlike prior efforts that rely solely on text prompts for spatial object categories, SG-Nav leverages a rich hierarchical scene graph to model environments and improve decision-making processes.

Motivation and Methodology

Limitations of Existing Methods

Traditional zero-shot object navigation techniques often lack sufficient scene context since they only use proximate object category texts to inform the LLM. This results in a suboptimal exploitation of the reasoning capabilities of LLMs. Furthermore, they are prone to perception errors and are constrained by the specific datasets used for training.

SG-Nav Framework

SG-Nav steps away from these limitations by proposing an online 3D scene graph representation that captures intricate relationships between objects, groups, and rooms. Within this framework, SG-Nav builds a hierarchical 3D scene graph that is updated in real-time as the agent explores the environment.

The scene graph is constructed incrementally to maintain online feasibility, utilizing new and existing nodes efficiently connected through a bespoke prompting strategy that minimizes computational complexity. This enables the agent to process spatial, hierarchical, and relational data between scene elements effectively (Figure 1). Figure 1

Figure 1: Pipeline of SG-Nav. We construct a hierarchical 3D scene graph as well as an occupancy map online...

Hierarchical Reasoning and Re-Perception

Hierarchical Chain-of-Thought Prompting

The core innovation lies in prompting the LLM with the scene graph through a hierarchical chain-of-thought mechanism. This approach allows SG-Nav to break down the decision-making process into a series of judiciously guided prompts that include predicting object relationships and distances, posing contextual questions, and iteratively refining the understanding of scene structure.

Graph-based Re-Perception

SG-Nav also introduces a graph-based re-perception mechanism that enhances the agent's ability to distinguish false positives in detected objects. Through repeated observation and credibility judgment based on cumulative probability scores, SG-Nav can dynamically adjust its navigation strategy, avoiding the pitfalls of incorrect object identification inherent in prior methods.

Experimental Evaluation and Results

The paper reports strong numerical results, with SG-Nav surpassing existing state-of-the-art zero-shot methods by a margin of over 10% in success rate (SR) across tested benchmarks including MP3D, HM3D, and RoboTHOR. Notably, SG-Nav even exceeds performance levels of supervised methods on the MP3D dataset, highlighting its robust generalization capabilities (Figure 2). Figure 2

Figure 2: Visualization of the navigation process of SG-Nav.

Implications and Future Directions

Practical and Theoretical Implications

Practically, SG-Nav advances the field of robotic autonomous navigation by offering a scalable and explainable zero-shot solution that does not rely on extensive dataset-specific training. Theoretically, it sets a precedent for leveraging hierarchical representations and LLMs' reasoning strengths in navigation tasks.

Future Prospects

Future developments could involve integrating more sophisticated online 3D instance segmentation techniques to further enhance real-time scene graph construction. Additionally, exploring the application of this framework to other navigation-related tasks, such as vision-and-language navigation, could broaden its utility and adaptability in AI applications.

Conclusion

SG-Nav represents a significant shift in zero-shot object navigation by uniting 3D scene graph structures with LLM capabilities, achieving state-of-the-art performance while maintaining high explainability and adaptability in varying environments. The hierarchical chain-of-thought prompting and graph-based re-perception stand out as innovative approaches, promising a new direction in the development of autonomous navigation systems. Figure 3

Figure 3: Different from previous zero-shot object navigation methods, SG-Nav constructs a hierarchical 3D scene graph for improved structural understanding.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.