SGAligner : 3D Scene Alignment with Scene Graphs (2304.14880v2)

Published 28 Apr 2023 in cs.CV

Abstract: Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (eg, navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website.

References (53)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces SGAligner, a novel method that aligns pairs of 3D scene graphs robustly, even with unknown overlap and environmental changes.
SGAligner utilizes a multi-modal architecture combining object, structure, attribute, and relationship embeddings, trained with contrastive learning to match corresponding entities.
Evaluations on the 3RScan dataset show SGAligner significantly improves 3D point cloud registration accuracy and recall compared to state-of-the-art methods while also providing a new scene graph alignment benchmark.

This paper introduces SGAligner, a novel method for aligning 3D scene graphs. The key innovation is the method's robustness to real-world scenarios, specifically its ability to handle unknown overlap and changes within the environment. The method is inspired by multi-modality knowledge graphs and uses contrastive learning to learn a joint, multi-modal embedding space.

The paper makes the following claims:

It introduces SGAligner, the first method for aligning pairs of 3D scene graphs where the overlap can range from zero to full and may contain changes.
It demonstrates the potential of SGAligner on tasks such as 3D point cloud registration and 3D point cloud mosaicking, as well as 3D alignment of a point cloud in a larger map that contains changes.
It creates a scene graph alignment and 3D point cloud registration benchmark on the 3RScan dataset, providing data, metrics, and an evaluation procedure.

Here's a breakdown of the approach and results:

Problem Formulation: The paper addresses the challenge of aligning two 3D scene graphs, $\mathcal{G}_1 = (\mathcal{N}_1, \mathcal{R}_1)$ and $\mathcal{G}_2 = (\mathcal{N}_2, \mathcal{R}_2)$ , representing scenes $s_1$ and $s_2$ , respectively. The objective is to identify corresponding objects in overlapping regions, denoted as entity pairs $\mathcal{F} = \{(n_1, n_2) \; | \; n_1 \equiv n_2, n_1 \in N_1, n_2 \in N_2 \}$ , even when the overlap between the graphs is partial or non-existent, and the scenes contain changes.
SGAligner Architecture: The SGAligner architecture draws inspiration from entity alignment methods in multi-modality knowledge graphs. It encodes semantic entities, their attributes, and relationships between entities using separate modalities. The core idea is to encode each of these modalities independently and then learn a joint embedding that can effectively determine the similarity between any two nodes. The architecture is composed of the following uni-modal embeddings:
- Object Embedding: Employs a point cloud feature extractor backbone architecture (e.g., PointNet) to extract visual features $\phi_{i}^\mathcal{P}$ from the point cloud $\mathcal{P}_i$ of object instance $\mathcal{O}_i$ .
- Structure Embedding: Uses a Graph Attention Network (GAT) to model structural information in $\mathcal{G}_1$ and $\mathcal{G}_2$ . Node features represent relative translation between object instances, and edges represent the relationships between the nodes. The neighborhood structure embedding is denoted as $\phi_{i}^\mathcal{S}$ .
- Meta Embeddings: Encodes the attributes $\mathcal{A}$ and relationships $\mathcal{R}$ of each object $\mathcal{O}_i$ using one-hot encoded feature vectors passed through a single-layer MLP, resulting in embeddings $\phi_i^\mathcal{A}$ and $\phi_i^\mathcal{R}$ , respectively.
- Joint Embedding: Concatenates the uni-modal features into a single compact representation $\hat{\phi}_i$ for each object $\mathcal{O}_i$ as follows:
  
  $\hat{\phi}_i = \oplus_{k \in \mathcal{K}}\left[\frac{\exp(w_k)}{\sum_{j \in \mathcal{K}} \exp(w_j)} h_i^m \right]$
  
  where:
  - $\oplus$ denotes concatenation
  - $\mathcal{K} = \{\mathcal{P}, \mathcal{S}, \mathcal{R}, \mathcal{A} \}$ represents the set of modalities
  - $w_m$ is a trainable attention weight for each modality $k$ .
Contrastive Learning: Employs a contrastive loss function to bring comparable samples (aligned entities) closer and push dissimilar samples farther apart in the learned representation space. It uses Intra-Modal Contrastive Loss (ICL) and Inter-modal Alignment Loss (IAL) and formulates them similarly.
- $E \subset \mathcal{F}$ is available as seed-aligned entity pairs. Formally, for the $i^{th}$ object node $n_1 \in \mathcal{N}_1$ , $E = \{ n_1^i \; | \; n_2^i \in \mathcal{N}_2 \}$ , where $(n_1^i, n_2^i)$ is an aligned pair.
3D Point Cloud Registration: The method uses the aligned entity pairs $n_1$ and $n_2$ from the scene graphs $\mathcal{G}_1$ and $\mathcal{G}_2$ to perform 3D point cloud registration. 3D point correspondences are extracted from $\mathcal{P}_1^i$ and $\mathcal{P}_2^i$ for each entity pair $n_1^i$ and $n_2^i$ using an off-the-shelf correspondence extraction algorithm. The rigid point cloud transformation $T \in \text{SE}(3)$ between the point clouds of the two scenes is estimated using a robust estimator, such as RANSAC.
Dataset and Experimental Setup:
- Evaluated on the 3RScan dataset, which contains 3D point clouds captured over time along with 3D scene graph annotations.
- The dataset was augmented by creating sub-scene graphs to simulate partial overlaps.
- PointNet was used as the object encoder.
Evaluation Metrics:
- Node Alignment: Mean Reciprocal Rank (MRR) and Hits@K (K = 1, 2, 3, 4, 5) were used to evaluate the performance of node alignment.
- 3D Point Cloud Registration: Chamfer distance (CD), relative rotation error (RRE), relative translation error (RTE), feature match recall (FMR), and registration recall (RR) were used to evaluate the performance of 3D point cloud registration.
- Scene Graph Alignment: Scene Graph Alignment Recall (SGAR)
Results:
- SGAligner outperforms other methods and modality combinations in node matching, achieving an MRR of 0.950 and Hits@K values of 0.923, 0.957, 0.974, 0.982, and 0.987 for K = 1, 2, 3, 4, and 5, respectively.
- In 3D point cloud registration, SGAligner reduces the relative translation error of state-of-the-art GeoTransformer by 40% and improves Chamfer distance by 49.4%. It also runs approximately three times faster than GeoTransformer during the overlap check. The paper reports a CD of 0.01111, RRE of 1.012, RTE of 1.67, FMR of 99.85, and RR of 99.40 (with ground truth 3D scene graphs and K=2).
- SGAligner achieves a Scene Graph Alignment Recall of 0.964 with ground truth scene graphs using the top-2 matches.
Ablation Studies: The paper includes ablation studies to evaluate the impact of different modalities, overlap percentages, and the use of predicted scene graphs. The results indicate that each modality contributes to improved performance, and the method is robust to varying degrees of overlap. It also provides results with controlled semantic noise showing the impact of incorrect semantic labels.

In summary, the paper proposes a novel approach, SGAligner, for aligning 3D scene graphs. It demonstrates its effectiveness in various tasks, including node alignment, point cloud registration, and scene alignment with changes. The use of contrastive learning and multi-modal embeddings enables the method to handle real-world scenarios with unknown overlap and environmental changes. The authors also contribute a new benchmark on the 3RScan dataset to facilitate further research in this area.