Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation (2403.17846v2)

Published 26 Mar 2024 in cs.RO, cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.

PDF Abstract

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

The paper presented describes an innovative framework called Hierarchical Open-Vocabulary Scene Graphs (HOV-SG), designed to enhance robot navigation using language-grounded techniques. This approach integrates sophisticated methodologies for 3D scene graph construction with open-vocabulary segmentation principles, aiming to improve robotic traversal in complex multi-story environments.

Core Contributions

Hierarchical Scene Graph Construction: The system constructs hierarchical 3D scene graphs that span objects, rooms, and floors levels by leveraging vision-LLMs. It integrates state-of-the-art open-vocabulary mapping, enriched with open-vocabulary features, to index semantic concepts efficiently.
Semantic Mapping and Compactness: The proposed representation reduces storage overhead, achieving a 75% decrease compared to conventional dense maps while maintaining comprehensive semantic data. The robustness in creating compact yet information-rich maps is a significant advancement for practical applications where resources are limited.
3D Semantic Segmentation Enhancements: HOV-SG demonstrates substantial improvements in open-vocabulary 3D semantic segmentation across various datasets, surpassing previous baselines substantially. The sophisticated combination of CLIP embeddings and novel feature clustering strategies enhances segmentation accuracy for this complex task.
Language-Guided Navigation: Through the formulation of actionable graphs, the system facilitates complex object retrieval and long-horizon multi-story navigation from language queries. The hierarchical querying utilizes LLMs to break down queries effectively from broad scene concepts to fine-grained object details.
Novel Semantic Evaluation Metric: Introduction of AUC $_{k}^{top}$ offers a new metric for assessing open-vocabulary semantic similarity, scalable to extensive label sets. This metric ensures robustness in evaluating semantic accuracy over large and variably-sized category datasets.

Implications and Potential Developments

By segmenting environments into hierarchical graphs and utilizing open-vocabulary features, this work expands the scope of language-conditioned tasks in robotic navigation. The practical impact wraps around enabling seamless interaction between humans and intelligent robotic systems in multifaceted settings like smart homes or public service areas. This advances the capacity for robots to engage dynamically with their environments, adhering to intricate commands in comprehensible language.

Moreover, the storage efficiency allows deployment in resource-constrained scenarios, pushing the boundaries of autonomous navigation systems in real-world applications, including search-and-rescue missions where computational resources are critically limited.

Future Directions

The paper paves the way for future research that may explore dynamic environment mapping or integrating reactive, embodied agents. Considering the static nature of current mapping systems, addressing dynamic and changing environments and their representations presents an exciting challenge. Additionally, refining the real-time capabilities to meet demands of quick computational mappings is another crucial aspect that can amplify the system’s applicability.

Conclusion

HOV-SG brings notable enhancements to scene understanding and robotic navigation by integrating hierarchical, open-vocabulary scene graph methodologies with innovative language interaction frameworks. While it faces limitations in real-time applications and handling dynamic settings, its current contributions in efficiency, accuracy, and practical robotic navigation are commendable and provide fertile ground for further advancements in cognitive robotics and automation systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Abdelrhman Werby (1 paper)
Chenguang Huang (8 papers)
Martin Büchner (16 papers)
Abhinav Valada (117 papers)
Wolfram Burgard (149 papers)

Citations (26)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/huang_chenguang/status/1791079467055562959

YouTube

Show All Videos