SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation (2410.08189v1)

Published 10 Oct 2024 in cs.CV and cs.RO

Abstract: In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

References (52)

Summary

The paper presents an innovative framework, SG-Nav, that integrates online 3D scene graphs with hierarchical LLM prompting to enable zero-shot object navigation.
The method constructs a real-time, hierarchical scene graph that enhances spatial reasoning and re-perception, reducing false positives and improving navigation decisions.
Experimental results demonstrate SG-Nav surpassing state-of-the-art methods by over 10% in success rate on datasets such as MP3D, HM3D, and RoboTHOR.

The paper "SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation" presents a novel approach for zero-shot object navigation by integrating 3D scene graph construction with LLMs to offer a comprehensive and robust navigation strategy. Unlike prior efforts that rely solely on text prompts for spatial object categories, SG-Nav leverages a rich hierarchical scene graph to model environments and improve decision-making processes.

Motivation and Methodology

Limitations of Existing Methods

Traditional zero-shot object navigation techniques often lack sufficient scene context since they only use proximate object category texts to inform the LLM. This results in a suboptimal exploitation of the reasoning capabilities of LLMs. Furthermore, they are prone to perception errors and are constrained by the specific datasets used for training.

SG-Nav Framework

SG-Nav steps away from these limitations by proposing an online 3D scene graph representation that captures intricate relationships between objects, groups, and rooms. Within this framework, SG-Nav builds a hierarchical 3D scene graph that is updated in real-time as the agent explores the environment.

The scene graph is constructed incrementally to maintain online feasibility, utilizing new and existing nodes efficiently connected through a bespoke prompting strategy that minimizes computational complexity. This enables the agent to process spatial, hierarchical, and relational data between scene elements effectively (Figure 1).

Figure 1: Pipeline of SG-Nav. We construct a hierarchical 3D scene graph as well as an occupancy map online...

Hierarchical Reasoning and Re-Perception

Hierarchical Chain-of-Thought Prompting

The core innovation lies in prompting the LLM with the scene graph through a hierarchical chain-of-thought mechanism. This approach allows SG-Nav to break down the decision-making process into a series of judiciously guided prompts that include predicting object relationships and distances, posing contextual questions, and iteratively refining the understanding of scene structure.

Graph-based Re-Perception

SG-Nav also introduces a graph-based re-perception mechanism that enhances the agent's ability to distinguish false positives in detected objects. Through repeated observation and credibility judgment based on cumulative probability scores, SG-Nav can dynamically adjust its navigation strategy, avoiding the pitfalls of incorrect object identification inherent in prior methods.

Experimental Evaluation and Results

The paper reports strong numerical results, with SG-Nav surpassing existing state-of-the-art zero-shot methods by a margin of over 10% in success rate (SR) across tested benchmarks including MP3D, HM3D, and RoboTHOR. Notably, SG-Nav even exceeds performance levels of supervised methods on the MP3D dataset, highlighting its robust generalization capabilities (Figure 2).

Figure 2: Visualization of the navigation process of SG-Nav.

Implications and Future Directions

Practical and Theoretical Implications

Practically, SG-Nav advances the field of robotic autonomous navigation by offering a scalable and explainable zero-shot solution that does not rely on extensive dataset-specific training. Theoretically, it sets a precedent for leveraging hierarchical representations and LLMs' reasoning strengths in navigation tasks.

Future Prospects

Future developments could involve integrating more sophisticated online 3D instance segmentation techniques to further enhance real-time scene graph construction. Additionally, exploring the application of this framework to other navigation-related tasks, such as vision-and-language navigation, could broaden its utility and adaptability in AI applications.

Conclusion

SG-Nav represents a significant shift in zero-shot object navigation by uniting 3D scene graph structures with LLM capabilities, achieving state-of-the-art performance while maintaining high explainability and adaptability in varying environments. The hierarchical chain-of-thought prompting and graph-based re-perception stand out as innovative approaches, promising a new direction in the development of autonomous navigation systems.

Figure 3: Different from previous zero-shot object navigation methods, SG-Nav constructs a hierarchical 3D scene graph for improved structural understanding.