Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation (2309.13237v3)
Abstract: Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5410–5419.
- G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene graph generation with hierarchical context,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 909–915, 2020.
- L. Tao, L. Mi, N. Li, X. Cheng, Y. Hu, and Z. Chen, “Predicate correlation learning for scene graph generation,” IEEE Transactions on Image Processing, vol. 31, pp. 4173–4185, 2022.
- Z. Tu, H. Li, D. Zhang, J. Dauwels, B. Li, and J. Yuan, “Action-stage emphasized spatiotemporal vlad for video action recognition,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019.
- T. Han, W. Xie, and A. Zisserman, “Temporal alignment networks for long-term video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2906–2916.
- Z. Zhou, C. Ding, J. Li, E. Mohammadi, G. Liu, Y. Yang, and Q. M. J. Wu, “Sequential order-aware coding-based robust subspace clustering for human action recognition in untrimmed videos,” IEEE Transactions on Image Processing, vol. 32, pp. 13–28, 2023.
- Z. Tang, Y. Liao, S. Liu, G. Li, X. Jin, H. Jiang, Q. Yu, and D. Xu, “Human-centric spatio-temporal video grounding with visual transformers,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2021.
- Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4964–4973.
- T. Hui, S. Liu, Z. Ding, S. Huang, G. Li, W. Wang, L. Liu, and J. Han, “Language-aware spatial-temporal collaboration for referring video segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- T. Nishimura, A. Hashimoto, Y. Ushiku, H. Kameko, and S. Mori, “State-aware video procedural captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1766–1774.
- Y. Huang, H. Xue, J. Chen, H. Ma, and H. Ma, “Semantic tag augmented xlanv model for video captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4818–4822.
- H. Wang, G. Lin, S. C. H. Hoi, and C. Miao, “Cross-modal graph with meta concepts for video captioning,” IEEE Transactions on Image Processing, vol. 31, pp. 5150–5162, 2022.
- X. Hua, X. Wang, T. Rui, F. Shao, and D. Wang, “Adversarial reinforcement learning with object-scene relational graph for video captioning,” IEEE Transactions on Image Processing, vol. 31, pp. 2004–2016, 2022.
- Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura, “Bert representations for video question answering,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 1556–1565.
- P. Zeng, H. Zhang, L. Gao, J. Song, and H. T. Shen, “Video question answering with prior knowledge and object-sensitive learning,” IEEE Transactions on Image Processing, vol. 31, pp. 5936–5948, 2022.
- L. Gao, Y. Lei, P. Zeng, J. Song, M. Wang, and H. T. Shen, “Hierarchical representation network with auxiliary tasks for video captioning and video question answering,” IEEE Transactions on Image Processing, vol. 31, pp. 202–215, 2022.
- Y. Liu, X. Zhang, F. Huang, B. Zhang, and Z. Li, “Cross-attentional spatio-temporal semantic graph networks for video question answering,” IEEE Transactions on Image Processing, vol. 31, pp. 1684–1696, 2022.
- D. Liu, M. Bober, and J. Kittler, “Visual semantic information pursuit: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1404–1422, 2019.
- P. Xu, X. Chang, L. Guo, P.-Y. Huang, X. Chen, and A. G. Hauptmann, “A survey of scene graph: Generation and application,” IEEE Trans. Neural Netw. Learn. Syst, vol. 1, 2020.
- X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1–26, 2021.
- X. Shang, T. Ren, J. Guo, H. Zhang, and T.-S. Chua, “Video visual relation detection,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1300–1308.
- X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, and J. Xiao, “Video relation detection with spatio-temporal graph,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 84–93.
- S. Zheng, S. Chen, and Q. Jin, “Vrdformer: End-to-end video visual relation detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 836–18 846.
- Y. Li, X. Yang, and C. Xu, “Dynamic scene graph generation via anticipatory pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 874–13 883.
- Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y. Yang, “Spatial-temporal transformer for dynamic scene graph generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 372–16 382.
- J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 236–10 247.
- J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 535–11 543.
- X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3746–3753.
- Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor, and M. Volkovs, “Context-aware scene graph generation with seq2seq transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 931–15 941.
- C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in European conference on computer vision. Springer, 2016, pp. 852–869.
- Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1261–1270.
- J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.
- R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5831–5840.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6163–6171.
- K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6619–6628.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- S. Kundu and S. N. Aakur, “Is-ggt: Iterative scene graph generation with generative transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6292–6301.
- Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and A. Farhadi, “Video relationship reasoning using gated spatio-temporal energy graph,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 424–10 433.
- C. Liu, Y. Jin, K. Xu, G. Gong, and Y. Mu, “Beyond short-term snippet: Video relation detection with spatio-temporal global context,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 840–10 849.
- Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context aggregation for video scene graph generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 688–13 697.
- S. Wang, L. Gao, X. Lyu, Y. Guo, P. Zeng, and J. Song, “Dynamic scene graph generation via temporal prior inference,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5793–5801.
- Y. Kumar and A. Mishra, “Few-shot referring relationships in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2289–2298.
- X. Wang, L. Zhu, Y. Wu, and Y. Yang, “Symbiotic attention for egocentric action recognition with object-centric alignment,” IEEE transactions on pattern analysis and machine intelligence, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- H. Li, C. Li, A. Zheng, J. Tang, and B. Luo, “Mskat: Multi-scale knowledge-aware transformer for vehicle re-identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 557–19 568, 2022.
- Z. Wang, J. Zhang, T. Chen, W. Wang, and P. Luo, “Restoreformer++: Towards real-world blind face restoration from undegraded key-value pairs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- T. Chen, T. Pu, H. Wu, Y. Xie, and L. Lin, “Structured semantic transfer for multi-label recognition with partial labels,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 1, 2022, pp. 339–346.
- T. Pu, T. Chen, H. Wu, and L. Lin, “Semantic-aware representation blending for multi-label image recognition with partial labels,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 2, 2022, pp. 2091–2098.
- Y. Xie, T. Chen, T. Pu, H. Wu, and L. Lin, “Adversarial graph representation adaptation for cross-domain facial expression recognition,” in Proceedings of the 28th ACM international conference on Multimedia, 2020.
- Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image recognition with knowledge transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- T. Chen, L. Lin, R. Chen, X. Hui, and H. Wu, “Knowledge-guided multi-label few-shot learning for general image recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1371–1384, 2022.
- T. Pu, T. Chen, Y. Xie, H. Wu, and L. Lin, “Au-expression knowledge constrained representation learning for facial expression recognition,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 11 154–11 161.
- T. Chen, T. Pu, H. Wu, Y. Xie, L. Liu, and L. Lin, “Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9887–9903, 2022.
- W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” in Proceedings of International Conference on Learning Representations (ICLR), 2019.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- A. R. Vandenbroucke, J. Fahrenfort, J. Meuwese, H. Scholte, and V. Lamme, “Prior knowledge about objects determines neural color representation in human visual cortex,” Cerebral cortex, vol. 26, no. 4, pp. 1401–1408, 2016.
- C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. F. Wang, “Multi-label zero-shot learning with structured knowledge graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1576–1585.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- N. Rai, H. Chen, J. Ji, R. Desai, K. Kozuka, S. Ishizaka, E. Adeli, and J. C. Niebles, “Home action genome: Cooperative compositional action understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 184–11 193.
- Tao Pu (13 papers)
- Tianshui Chen (51 papers)
- Hefeng Wu (35 papers)
- Yongyi Lu (27 papers)
- Liang Lin (318 papers)