Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene Graph Generation (2303.11048v3)

Published 20 Mar 2023 in cs.CV

Abstract: In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale LLM (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 1877–1901.
  2. Exploring Contextual Relationships in 3D Cloud Points by Semantic Knowledge Mining. In Computer Graphics Forum, volume 41, 75–86. Wiley Online Library.
  3. Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 433–442.
  4. Large-scale object classification using label relation graphs. In Proceedings of the European Conference on Computer Vision, 48–64.
  5. Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16352–16361.
  6. A Generalization of Transformer Networks to Graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications.
  7. Exploring spatial context for 3D semantic segmentation of point clouds. In Proceedings of the IEEE international conference on computer vision workshops, 716–724.
  8. Hybrid topological and 3d dense mapping through autonomous exploration for large indoor environments. In 2020 IEEE International Conference on Robotics and Automation, 9673–9679. IEEE.
  9. Mapping images to scene graphs with permutation-invariant structured prediction. Advances in Neural Information Processing Systems, 31.
  10. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
  11. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4421–4430.
  12. Image retrieval using scene graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3668–3678.
  13. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1): 32–73.
  15. Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. arXiv preprint arXiv:1712.00733.
  16. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on Artificial Intelligence.
  17. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1261–1270.
  18. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence.
  19. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2980–2988.
  20. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vision, 852–869.
  21. Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning. arXiv preprint arXiv:2310.19559.
  22. Global context reasoning for semantic segmentation of 3D point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2931–2940.
  23. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
  24. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, 1532–1543.
  25. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 652–660.
  26. Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3957–3966.
  27. Latent memory-augmented graph transformer for visual storytelling. In Proceedings of the 29th ACM International Conference on Multimedia, 4892–4901.
  28. stagnet: An attentive semantic rnn for group activity recognition. In Proceedings of the European Conference on Computer Vision, 101–117.
  29. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing, 30: 2989–3004.
  30. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology, 30(8): 2617–2633.
  31. STC-GAN: Spatio-temporally coupled generative adversarial networks for predictive scene parsing. IEEE Transactions on Image Processing, 29: 5420–5430.
  32. KE-GAN: Knowledge embedded generative adversarial networks for semi-supervised scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5237–5246.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763.
  34. Fully-convolutional point networks for large-scale point clouds. In Proceedings of the European Conference on Computer Vision, 596–611.
  35. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on Artificial Intelligence.
  36. Retargetable AR: Context-aware augmented reality in indoor scenes based on 3D scene graph. In 2020 IEEE International Symposium on Mixed and Augmented Reality Adjunct, 249–255. IEEE.
  37. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3716–3725.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  39. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  40. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2708–2717.
  41. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7658–7667.
  42. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961–3970.
  43. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics, 38(5): 1–12.
  44. Scene graph generation by iterative message passing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5410–5419.
  45. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision, 670–685.
  46. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 28877–28888.
  47. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1974–1982.
  48. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5831–5840.
  49. Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9705–9715.
  50. Knowledge-inspired 3D Scene Graph Prediction in Point Cloud. Advances in Neural Information Processing Systems, 34: 18620–18632.
  51. 3D point capsule networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1009–1018.
  52. HyperDet3D: Learning a Scene-conditioned 3D Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5585–5594.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Changsheng Lv (10 papers)
  2. Mengshi Qi (32 papers)
  3. Xia Li (101 papers)
  4. Zhengyuan Yang (86 papers)
  5. Huadong Ma (52 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub