Papers
Topics
Authors
Recent
2000 character limit reached

SGAligner : 3D Scene Alignment with Scene Graphs (2304.14880v2)

Published 28 Apr 2023 in cs.CV

Abstract: Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (eg, navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Taskography: Evaluating robot task planning over large 3d scene graphs. In Conference on Robot Learning, pages 46–58. PMLR, 2022.
  2. 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5664–5673, 2019.
  3. Pointnet on fpga for real-time lidar point cloud processing. 10 2020.
  4. D3feat: Joint learning of dense detection and description of 3d local features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6359–6367, 2020.
  5. Graph-cut ransac. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6733–6741, 2018.
  6. Magsac++, a fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1304–1312, 2020.
  7. D-lite: Navigation-oriented compression of 3d scene graphs under communication constraints. arXiv preprint arXiv:2209.06111, 2022.
  8. Mmea: entity alignment for multi-modal knowledge graph. In Knowledge Science, Engineering and Management: 13th International Conference, KSEM 2020, Hangzhou, China, August 28–30, 2020, Proceedings, Part I 13, pages 134–147. Springer, 2020.
  9. Multi-modal siamese network for entity alignment. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 118–126, 2022.
  10. Multijaf: Multi-modal joint entity alignment framework for multi-modal knowledge graph. Neurocomputing, 500:581–591, 2022.
  11. Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16352–16361, 2021.
  12. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communication of ACM, 1981.
  13. Continuous scene representations for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14849–14859, June 2022.
  14. Multi-modal entity alignment in hyperbolic space. Neurocomputing, 461:598–607, 2021.
  15. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, Apr 2021.
  16. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
  17. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization. 2022.
  18. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
  19. Sequential manipulation planning on scene graph. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8203–8210. IEEE, 2022.
  20. 3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE transactions on cybernetics, 50(12):4921–4933, 2019.
  21. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  22. Embodied semantic scene graph generation. In Conference on Robot Learning, pages 1585–1594. PMLR, 2022.
  23. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR, 2019.
  24. Remote object navigation for service robots using hierarchical knowledge graph in human-centered environments. Intelligent Service Robotics, 15(4):459–473, 2022.
  25. Multi-modal contrastive representation learning for entity alignment. arXiv preprint arXiv:2209.00891, 2022.
  26. Visual pivoting for (unsupervised) entity alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4257–4266, 2021.
  27. Hal: Improved text-image matching by mitigating visual semantic hubs. Proceedings of the AAAI Conference on Artificial Intelligence, 34:11563–11571, 04 2020.
  28. Mmkg: multi-modal knowledge graphs. In The Semantic Web: 16th International Conference, ESWC 2019, Portorož, Slovenia, June 2–6, 2019, Proceedings 16, pages 459–474. Springer, 2019.
  29. 3d vsg: Long-term semantic scene change prediction through 3d variable scene graphs. arXiv preprint arXiv:2209.07896, 2022.
  30. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  31. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  32. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  33. Usac: A universal framework for random sample consensus. IEEE transactions on pattern analysis and machine intelligence, 35(8):2022–2038, 2012.
  34. Bridging scene understanding and task execution with flexible simulation environments. arXiv preprint arXiv:2011.10452, 2020.
  35. Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks. In 2022 International Conference on Robotics and Automation (ICRA), pages 9272–9279, 2022.
  36. 3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans. arXiv preprint arXiv:2002.06289, 2020.
  37. Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12-14):1510–1546, 2021.
  38. Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation, pages 3212–3217. IEEE, 2009.
  39. A deep learning based behavioral approach to indoor autonomous navigation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4646–4653. IEEE, 2018.
  40. Probing the impacts of visual context in multimodal entity alignment. In Web and Big Data: 6th International Joint Conference, APWeb-WAIM 2022, Nanjing, China, November 25–27, 2022, Proceedings, Part II, pages 255–270. Springer, 2023.
  41. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. CVPR, 2021.
  42. Graph attention networks. In International Conference on Learning Representations, 2018.
  43. Rio: 3d object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019.
  44. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3961–3970, 2020.
  45. Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7515–7525, 2021.
  46. Rpm-net: Robust point matching using learned features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11824–11833, 2020.
  47. Regtr: End-to-end point cloud correspondences with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  48. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In CVPR, 2017.
  49. Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9705–9715, June 2021.
  50. Multi-view knowledge graph embedding for entity alignment. arXiv preprint arXiv:1906.02390, 2019.
  51. A dual representation framework for robot learning with human guidance. In 6th Annual Conference on Robot Learning.
  52. Knowledge-inspired 3d scene graph prediction in point cloud. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 18620–18632, 2021.
  53. Yu Zhong. Intrinsic shape signatures: A shape descriptor for 3d object recognition. In 2009 IEEE 12th international conference on computer vision workshops, ICCV workshops, pages 689–696. IEEE, 2009.
Citations (10)

Summary

  • The paper introduces SGAligner, a novel method that aligns pairs of 3D scene graphs robustly, even with unknown overlap and environmental changes.
  • SGAligner utilizes a multi-modal architecture combining object, structure, attribute, and relationship embeddings, trained with contrastive learning to match corresponding entities.
  • Evaluations on the 3RScan dataset show SGAligner significantly improves 3D point cloud registration accuracy and recall compared to state-of-the-art methods while also providing a new scene graph alignment benchmark.

This paper introduces SGAligner, a novel method for aligning 3D scene graphs. The key innovation is the method's robustness to real-world scenarios, specifically its ability to handle unknown overlap and changes within the environment. The method is inspired by multi-modality knowledge graphs and uses contrastive learning to learn a joint, multi-modal embedding space.

The paper makes the following claims:

  • It introduces SGAligner, the first method for aligning pairs of 3D scene graphs where the overlap can range from zero to full and may contain changes.
  • It demonstrates the potential of SGAligner on tasks such as 3D point cloud registration and 3D point cloud mosaicking, as well as 3D alignment of a point cloud in a larger map that contains changes.
  • It creates a scene graph alignment and 3D point cloud registration benchmark on the 3RScan dataset, providing data, metrics, and an evaluation procedure.

Here's a breakdown of the approach and results:

  • Problem Formulation: The paper addresses the challenge of aligning two 3D scene graphs, G1=(N1,R1)\mathcal{G}_1 = (\mathcal{N}_1, \mathcal{R}_1) and G2=(N2,R2)\mathcal{G}_2 = (\mathcal{N}_2, \mathcal{R}_2), representing scenes s1s_1 and s2s_2, respectively. The objective is to identify corresponding objects in overlapping regions, denoted as entity pairs F={(n1,n2)    n1n2,n1N1,n2N2}\mathcal{F} = \{(n_1, n_2) \; | \; n_1 \equiv n_2, n_1 \in N_1, n_2 \in N_2 \}, even when the overlap between the graphs is partial or non-existent, and the scenes contain changes.
  • SGAligner Architecture: The SGAligner architecture draws inspiration from entity alignment methods in multi-modality knowledge graphs. It encodes semantic entities, their attributes, and relationships between entities using separate modalities. The core idea is to encode each of these modalities independently and then learn a joint embedding that can effectively determine the similarity between any two nodes. The architecture is composed of the following uni-modal embeddings:
    • Object Embedding: Employs a point cloud feature extractor backbone architecture (e.g., PointNet) to extract visual features ϕiP\phi_{i}^\mathcal{P} from the point cloud Pi\mathcal{P}_i of object instance Oi\mathcal{O}_i.
    • Structure Embedding: Uses a Graph Attention Network (GAT) to model structural information in G1\mathcal{G}_1 and G2\mathcal{G}_2. Node features represent relative translation between object instances, and edges represent the relationships between the nodes. The neighborhood structure embedding is denoted as ϕiS\phi_{i}^\mathcal{S}.
    • Meta Embeddings: Encodes the attributes A\mathcal{A} and relationships R\mathcal{R} of each object Oi\mathcal{O}_i using one-hot encoded feature vectors passed through a single-layer MLP, resulting in embeddings ϕiA\phi_i^\mathcal{A} and ϕiR\phi_i^\mathcal{R}, respectively.
    • Joint Embedding: Concatenates the uni-modal features into a single compact representation ϕ^i\hat{\phi}_i for each object Oi\mathcal{O}_i as follows:

      ϕ^i=kK[exp(wk)jKexp(wj)him]\hat{\phi}_i = \oplus_{k \in \mathcal{K}}\left[\frac{\exp(w_k)}{\sum_{j \in \mathcal{K}} \exp(w_j)} h_i^m \right]

      where:

      • \oplus denotes concatenation
      • K={P,S,R,A}\mathcal{K} = \{\mathcal{P}, \mathcal{S}, \mathcal{R}, \mathcal{A} \} represents the set of modalities
      • wmw_m is a trainable attention weight for each modality kk.
  • Contrastive Learning: Employs a contrastive loss function to bring comparable samples (aligned entities) closer and push dissimilar samples farther apart in the learned representation space. It uses Intra-Modal Contrastive Loss (ICL) and Inter-modal Alignment Loss (IAL) and formulates them similarly.
    • EFE \subset \mathcal{F} is available as seed-aligned entity pairs. Formally, for the ithi^{th} object node n1N1n_1 \in \mathcal{N}_1, E={n1i    n2iN2}E = \{ n_1^i \; | \; n_2^i \in \mathcal{N}_2 \}, where (n1i,n2i)(n_1^i, n_2^i) is an aligned pair.
  • 3D Point Cloud Registration: The method uses the aligned entity pairs n1n_1 and n2n_2 from the scene graphs G1\mathcal{G}_1 and G2\mathcal{G}_2 to perform 3D point cloud registration. 3D point correspondences are extracted from P1i\mathcal{P}_1^i and P2i\mathcal{P}_2^i for each entity pair n1in_1^i and n2in_2^i using an off-the-shelf correspondence extraction algorithm. The rigid point cloud transformation TSE(3)T \in \text{SE}(3) between the point clouds of the two scenes is estimated using a robust estimator, such as RANSAC.
  • Dataset and Experimental Setup:
    • Evaluated on the 3RScan dataset, which contains 3D point clouds captured over time along with 3D scene graph annotations.
    • The dataset was augmented by creating sub-scene graphs to simulate partial overlaps.
    • PointNet was used as the object encoder.
  • Evaluation Metrics:
    • Node Alignment: Mean Reciprocal Rank (MRR) and Hits@K (K = 1, 2, 3, 4, 5) were used to evaluate the performance of node alignment.
    • 3D Point Cloud Registration: Chamfer distance (CD), relative rotation error (RRE), relative translation error (RTE), feature match recall (FMR), and registration recall (RR) were used to evaluate the performance of 3D point cloud registration.
    • Scene Graph Alignment: Scene Graph Alignment Recall (SGAR)
  • Results:
    • SGAligner outperforms other methods and modality combinations in node matching, achieving an MRR of 0.950 and Hits@K values of 0.923, 0.957, 0.974, 0.982, and 0.987 for K = 1, 2, 3, 4, and 5, respectively.
    • In 3D point cloud registration, SGAligner reduces the relative translation error of state-of-the-art GeoTransformer by 40% and improves Chamfer distance by 49.4%. It also runs approximately three times faster than GeoTransformer during the overlap check. The paper reports a CD of 0.01111, RRE of 1.012, RTE of 1.67, FMR of 99.85, and RR of 99.40 (with ground truth 3D scene graphs and K=2).
    • SGAligner achieves a Scene Graph Alignment Recall of 0.964 with ground truth scene graphs using the top-2 matches.
  • Ablation Studies: The paper includes ablation studies to evaluate the impact of different modalities, overlap percentages, and the use of predicted scene graphs. The results indicate that each modality contributes to improved performance, and the method is robust to varying degrees of overlap. It also provides results with controlled semantic noise showing the impact of incorrect semantic labels.

In summary, the paper proposes a novel approach, SGAligner, for aligning 3D scene graphs. It demonstrates its effectiveness in various tasks, including node alignment, point cloud registration, and scene alignment with changes. The use of contrastive learning and multi-modal embeddings enables the method to handle real-world scenarios with unknown overlap and environmental changes. The authors also contribute a new benchmark on the 3RScan dataset to facilitate further research in this area.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.