Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling (2404.02527v1)
Abstract: Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.
- J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3D semantic scene graphs from 3D indoor reconstructions,” in Proc. of the CVPR, 2020, pp. 3961–3970.
- J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3D object instance re-localization in changing indoor environments,” in Proc. of the ICCV, 2019, pp. 7658–7667.
- M. Feng, H. Hou, L. Zhang, Y. Guo, H. Yu, Y. Wang, and A. Mian, “Exploring hierarchical spatial layout cues for 3d point cloud based scene graph prediction,” IEEE Transactions on Multimedia, 2023.
- C. Zhang, J. Yu, Y. Song, and W. Cai, “Exploiting edge-oriented reasoning for 3D point-based scene graph analysis,” in Proc. of the CVPR, 2021, pp. 9705–9715.
- S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3D scene graph prediction from RGB-D sequences,” in Proc. of the CVPR, 2021, pp. 7515–7525.
- S. Zhang, A. Hao, H. Qin et al., “Knowledge-inspired 3D scene graph prediction in point cloud,” Proc. of the NeurIPS, vol. 34, pp. 18 620–18 632, 2021.
- C. Lv, M. Qi, X. Li, Z. Yang, and H. Ma, “Revisiting transformer for point cloud-based 3D scene graph generation,” arXiv preprint arXiv:2303.11048, 2023.
- Z. Wang, B. Cheng, L. Zhao, D. Xu, Y. Tang, and L. Sheng, “VL-SAT: Visual-linguistic semantics assisted training for 3D semantic scene graph prediction in point cloud,” in Proc. of the CVPR, 2023, pp. 21 560–21 569.
- H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang, “Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn,” in Proc. of the ICCV, 2017, pp. 4233–4241.
- H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proc. of the CVPR, 2016, pp. 2846–2854.
- J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Weakly-supervised learning of visual relations,” in Proc. of the ICCV, 2017, pp. 5179–5188.
- Y. Zhong, J. Shi, J. Yang, C. Xu, and Y. Li, “Learning to generate scene graph from natural language supervision,” in Proc. of the ICCV, 2021, pp. 1823–1834.
- J. Shi, Y. Zhong, N. Xu, Y. Li, and C. Xu, “A simple baseline for weakly-supervised scene graph generation,” in Proc. of the ICCV, 2021, pp. 16 393–16 402.
- S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Proc. of the fourth workshop on vision and language, 2015, pp. 70–80.
- Y.-S. Wang, C. Liu, X. Zeng, and A. Yuille, “Scene graph parsing as dependency parsing,” arXiv preprint arXiv:1803.09189, 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. of the NeurIPS, vol. 30, 2017.
- X. Zhu, R. Zhang, B. He, A. Zhou, D. Wang, B. Zhao, and P. Gao, “Not all features matter: Enhancing few-shot clip with adaptive prior refinement,” arXiv preprint arXiv:2304.01195, 2023.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Proc. of the NeurIPS, vol. 30, 2017.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in Proc. of the ICML. PMLR, 2021, pp. 5583–5594.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. of the ICML. PMLR, 2021, pp. 8748–8763.
- H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” Proc. of the NeurIPS, vol. 35, pp. 32 897–32 912, 2022.
- R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proc. of the CVPR, 2022, pp. 8552–8562.
- Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation for volumetric shapes,” in CVPR, 2015, pp. 1912–1920.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” Proc. of the IJCV, vol. 115, pp. 211–252, 2015.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” in Proc. of the CVPR, 2017, pp. 652–660.
- T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proc. of the CVPR, 2019, pp. 6163–6171.
- S. Sharifzadeh, S. M. Baharlou, and V. Tresp, “Classification by attention: Scene graph classification with prior knowledge,” in Proc. of the AAAI, vol. 35, no. 6, 2021, pp. 5025–5033.
- S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5560–5567, 2022.
- S. Y. Gadre, K. Ehsani, S. Song, and R. Mottaghi, “Continuous scene representations for embodied ai,” in Proc. of the CVPR, 2022, pp. 14 849–14 859.
- S. V. Nuthalapati, R. Chandradevan, E. Giunchiglia, B. Li, M. Kayser, T. Lukasiewicz, and C. Yang, “Lightweight visual question answering using scene graphs,” in Proc. of the ACM CIKM, 2021, p. 3353–3357.
- D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “ScanQA: 3D question answering for spatial scene understanding,” in Proc. of the CVPR, 2022, pp. 19 129–19 139.
- Y. Etesam, L. Kochiev, and A. X. Chang, “3DVQA: Visual question answering for 3D environments,” in Proc. CRV, 2022, pp. 233–240.
- B. Cheng, L. Sheng, S. Shi, M. Yang, and D. Xu, “Back-tracing representative points for voting-based 3D object detection in point clouds,” in Proc. of the CVPR, 2021, pp. 8963–8972.
- C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3D object detection in point clouds,” in Proc. of the ICCV, 2019, pp. 9277–9286.
- S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3D object detection,” in Proc. of the CVPR, 2020, pp. 10 529–10 538.
- S. Shi, X. Wang, and H. Li, “Pointrcnn: 3D object proposal generation and detection from point cloud,” in Proc. of the CVPR, 2019, pp. 770–779.
- S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2647–2664, 2020.
- Z. Zhang, B. Sun, H. Yang, and Q. Huang, “H3Dnet: 3D object detection using hybrid geometric primitives,” in ECCV. Springer, 2020, pp. 311–329.
- L. Zhao, J. Guo, D. Xu, and L. Sheng, “Transformer3D-det: Improving 3D object detection by vote refinement,” TCSVT, vol. 31, no. 12, pp. 4735–4746, 2021.
- D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3DJCG: A unified framework for joint dense captioning and visual grounding on 3D point clouds,” in Proc. of the CVPR, 2022, pp. 16 464–16 473.
- D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3D object localization in RGB-D scans using natural language,” in Proc. of the ECCV, 2020, pp. 202–221.
- D. He, Y. Zhao, J. Luo, T. Hui, S. Huang, A. Zhang, and S. Liu, “Transrefer3D: Entity-and-relation aware transformer for fine-grained 3D visual grounding,” in Proc. of the ACM MM, 2021, pp. 2344–2352.
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proc. of the ACL, 2018, pp. 2556–2565.
- F. Wu, F. Yan, W. Shi, and Z. Zhou, “3D scene graph prediction from point clouds,” Virtual Reality & Intelligent Hardware, vol. 4, no. 1, pp. 76–88, 2022.
- X. Li, L. Chen, W. Ma, Y. Yang, and J. Xiao, “Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation,” in Proc. of the ACM MM, 2022, pp. 4204–4213.
- A. Zareian, S. Karaman, and S.-F. Chang, “Weakly supervised visual semantic parsing,” in Proc. of the CVPR, 2020, pp. 3736–3745.
- A. Huang, L. Li, L. Zhang, Y. Niu, T. Zhao, and C.-W. Lin, “Multi-view graph embedding learning for image co-segmentation and co-localization,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- N. Jiang, W. Chen, J. Lin, T. Zhao, and C.-W. Lin, “Video compression artifacts removal with spatial-temporal attention-guided enhancement,” IEEE Transactions on Multimedia, 2023.
- Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C.-W. Chen, “End-to-end video scene graph generation with temporal propagation transformer,” IEEE Transactions on Multimedia, vol. 26, pp. 1613–1625, 2023.
- E. Özsoy, F. Holm, T. Czempiel, N. Navab, and B. Busam, “Location-free scene graph generation,” arXiv preprint arXiv:2303.10944, 2023.
- Y. Zheng, J. Luo, W. Chen, Z. Li, and T. Zhao, “Fuvc: A flexible codec for underwater video transmission,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
- R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 189–203, 2016.
- C. Wang, W. Ren, K. Huang, and T. Tan, “Weakly supervised object localization with latent category learning,” in Proc. of the ECCV, 2014, pp. 431–445.
- H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell, “On learning to localize objects with minimal supervision,” in Proc. of the ICML, 2014, pp. 1611–1619.
- Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu, “Deep self-taught learning for weakly supervised object localization,” in Proc. of the CVPR, 2017, pp. 1377–1385.
- S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “Incremental 3D semantic scene graph prediction from RGB sequences,” in Proc. of the CVPR, 2023, pp. 5064–5074.
- S. Zhang, Y. Tay, L. Yao, and Q. Liu, “Quaternion knowledge graph embeddings,” Proc. of the NIPS, vol. 32, 2019.
- L. Liu, C. P. Chen, and S. Li, “Hallucinating color face image by learning graph representation in quaternion space,” IEEE transactions on cybernetics, vol. 52, no. 1, pp. 265–277, 2020.
- Z. Cao, Q. Xu, Z. Yang, X. Cao, and Q. Huang, “Dual quaternion knowledge graph embeddings,” in Proc. of the AAAI, vol. 35, no. 8, 2021, pp. 6894–6902.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proc. of the CVPR, 2017, pp. 5410–5419.
- Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in Proc. of the CVPR, 2022, pp. 20 123–20 132.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Proc. of the IJCV, vol. 123, pp. 32–73, 2017.
- K. Ye and A. Kovashka, “Linguistic structures as weak supervision for visual scene graph generation,” in Proc. of the CVPR, 2021, pp. 8289–8299.
- A. Benetatos, M. Diomataris, V. Pitsikalis, and P. Maragos, “Generating salient scene graphs with weak language supervision,” in Proc. of the EUSIPCO. IEEE, 2023, pp. 526–530.
- X. Han, X. Song, X. Dong, Y. Wei, M. Liu, and L. Nie, “Dbiased-p: Dual-biased predicate predictor for unbiased scene graph generation,” IEEE Transactions on Multimedia, vol. 25, pp. 5319–5329, 2023.
- Z. Wang, X. Xu, G. Wang, Y. Yang, and H. T. Shen, “Quaternion relation embedding for scene graph generation,” IEEE Transactions on Multimedia, vol. 25, pp. 8646–8656, 2023.
- Y. Li, X. Yang, X. Huang, Z. Ma, and C. Xu, “Zero-shot predicate prediction for scene graph parsing,” IEEE Transactions on Multimedia, vol. 25, pp. 3140–3153, 2023.
- Xu Wang (319 papers)
- Yifan Li (106 papers)
- Qiudan Zhang (4 papers)
- Wenhui Wu (8 papers)
- Mark Junjie Li (3 papers)
- Jianmin Jinag (1 paper)