SG-Reg: Generalizable and Efficient Scene Graph Registration (2504.14440v2)

Published 20 Apr 2025 in cs.RO and cs.CV

Abstract: This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.

Summary

Generalizable and Efficient Scene Graph Registration for Robotics

This paper focuses on Scene Graph Registration (SGR), which is a critical process for autonomous agents to achieve consistent map registration against remote agents or previous maps. The authors introduce a novel approach named SG-Reg, which is a generalizable and efficient solution for the registration of rigid semantic scene graphs. Traditional methods in semantic-aided registration, often dependent on hand-crafted descriptors or ground-truth annotations, hinder applicability in real-world settings. This paper addresses these shortcomings by leveraging a scene graph network and vision foundation models to execute SGR with reduced computational and bandwidth demands.

Methodology

The proposed method integrates a scene graph network that processes various semantic node features—open-set semantic features, local topology, and shape features. These features are synthesized to create a robust node representation, facilitating coarse-to-fine matching of correspondence. A robust pose estimator determines optimal transformations from these correspondences, yielding a sparse and hierarchically structured scene representation that reduces required computational resources and bandwidth. A noteworthy advancement is their innovative data generation approach, utilizing existing vision models and semantic mapping modules to reconstruct scene graphs independently from ground-truth semantic annotations.

Key Results

Performance evaluations of SG-Reg were conducted using a two-agent Simultaneous Localization and Mapping (SLAM) benchmark. SG-Reg notably outperforms conventional methods with hand-crafted descriptors in registration success rates, marking a slight increase in recall (0.7%) over established loop closure methodologies, while significantly lowering communication bandwidth to 52 KB per query frame.

Implications and Future Directions

Practically, SG-Reg provides an enhanced backbone for multi-agent systems, reducing overhead while increasing reliability and scalability in map registration tasks used in robotics. Theoretically, it extends the application of scene graph learning methodologies beyond supervised datasets into real-world scenarios exhibiting semantic noise.

Looking forward, one can speculate that further augmentation with more refined feature extraction through transformer architectures could provide additional robustness. Integration with larger dataset varieties could validate overall applicability across diverse environments. Additionally, this method's ability to manage sparse computational resources opens doors for its applicability in edge computing and real-time autonomous navigation tasks in cluttered environments.

In conclusion, the research presents a significant step in enhancing registration capabilities for autonomous systems, addressing bandwidth constraints, improving generalization, and offering robust solutions in the face of data variance often seen in real-world applications.