Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation (2309.13237v3)

Published 23 Sep 2023 in cs.CV

Abstract: Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5410–5419.
  2. G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan, “Scene graph generation with hierarchical context,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 909–915, 2020.
  3. L. Tao, L. Mi, N. Li, X. Cheng, Y. Hu, and Z. Chen, “Predicate correlation learning for scene graph generation,” IEEE Transactions on Image Processing, vol. 31, pp. 4173–4185, 2022.
  4. Z. Tu, H. Li, D. Zhang, J. Dauwels, B. Li, and J. Yuan, “Action-stage emphasized spatiotemporal vlad for video action recognition,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019.
  5. T. Han, W. Xie, and A. Zisserman, “Temporal alignment networks for long-term video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2906–2916.
  6. Z. Zhou, C. Ding, J. Li, E. Mohammadi, G. Liu, Y. Yang, and Q. M. J. Wu, “Sequential order-aware coding-based robust subspace clustering for human action recognition in untrimmed videos,” IEEE Transactions on Image Processing, vol. 32, pp. 13–28, 2023.
  7. Z. Tang, Y. Liao, S. Liu, G. Li, X. Jin, H. Jiang, Q. Yu, and D. Xu, “Human-centric spatio-temporal video grounding with visual transformers,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2021.
  8. Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4964–4973.
  9. T. Hui, S. Liu, Z. Ding, S. Huang, G. Li, W. Wang, L. Liu, and J. Han, “Language-aware spatial-temporal collaboration for referring video segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  10. T. Nishimura, A. Hashimoto, Y. Ushiku, H. Kameko, and S. Mori, “State-aware video procedural captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1766–1774.
  11. Y. Huang, H. Xue, J. Chen, H. Ma, and H. Ma, “Semantic tag augmented xlanv model for video captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4818–4822.
  12. H. Wang, G. Lin, S. C. H. Hoi, and C. Miao, “Cross-modal graph with meta concepts for video captioning,” IEEE Transactions on Image Processing, vol. 31, pp. 5150–5162, 2022.
  13. X. Hua, X. Wang, T. Rui, F. Shao, and D. Wang, “Adversarial reinforcement learning with object-scene relational graph for video captioning,” IEEE Transactions on Image Processing, vol. 31, pp. 2004–2016, 2022.
  14. Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura, “Bert representations for video question answering,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 1556–1565.
  15. P. Zeng, H. Zhang, L. Gao, J. Song, and H. T. Shen, “Video question answering with prior knowledge and object-sensitive learning,” IEEE Transactions on Image Processing, vol. 31, pp. 5936–5948, 2022.
  16. L. Gao, Y. Lei, P. Zeng, J. Song, M. Wang, and H. T. Shen, “Hierarchical representation network with auxiliary tasks for video captioning and video question answering,” IEEE Transactions on Image Processing, vol. 31, pp. 202–215, 2022.
  17. Y. Liu, X. Zhang, F. Huang, B. Zhang, and Z. Li, “Cross-attentional spatio-temporal semantic graph networks for video question answering,” IEEE Transactions on Image Processing, vol. 31, pp. 1684–1696, 2022.
  18. D. Liu, M. Bober, and J. Kittler, “Visual semantic information pursuit: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1404–1422, 2019.
  19. P. Xu, X. Chang, L. Guo, P.-Y. Huang, X. Chen, and A. G. Hauptmann, “A survey of scene graph: Generation and application,” IEEE Trans. Neural Netw. Learn. Syst, vol. 1, 2020.
  20. X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1–26, 2021.
  21. X. Shang, T. Ren, J. Guo, H. Zhang, and T.-S. Chua, “Video visual relation detection,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1300–1308.
  22. X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, and J. Xiao, “Video relation detection with spatio-temporal graph,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 84–93.
  23. S. Zheng, S. Chen, and Q. Jin, “Vrdformer: End-to-end video visual relation detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 836–18 846.
  24. Y. Li, X. Yang, and C. Xu, “Dynamic scene graph generation via anticipatory pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 874–13 883.
  25. Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, and M. Y. Yang, “Spatial-temporal transformer for dynamic scene graph generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 372–16 382.
  26. J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 236–10 247.
  27. J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 535–11 543.
  28. X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3746–3753.
  29. Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor, and M. Volkovs, “Context-aware scene graph generation with seq2seq transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 931–15 941.
  30. C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in European conference on computer vision.   Springer, 2016, pp. 852–869.
  31. Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1261–1270.
  32. J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.
  33. R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5831–5840.
  34. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  35. T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6163–6171.
  36. K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6619–6628.
  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  38. Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  39. S. Kundu and S. N. Aakur, “Is-ggt: Iterative scene graph generation with generative transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6292–6301.
  40. Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and A. Farhadi, “Video relationship reasoning using gated spatio-temporal energy graph,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 424–10 433.
  41. C. Liu, Y. Jin, K. Xu, G. Gong, and Y. Mu, “Beyond short-term snippet: Video relation detection with spatio-temporal global context,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 840–10 849.
  42. Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context aggregation for video scene graph generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 688–13 697.
  43. S. Wang, L. Gao, X. Lyu, Y. Guo, P. Zeng, and J. Song, “Dynamic scene graph generation via temporal prior inference,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5793–5801.
  44. Y. Kumar and A. Mishra, “Few-shot referring relationships in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2289–2298.
  45. X. Wang, L. Zhu, Y. Wu, and Y. Yang, “Symbiotic attention for egocentric action recognition with object-centric alignment,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  46. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  47. H. Li, C. Li, A. Zheng, J. Tang, and B. Luo, “Mskat: Multi-scale knowledge-aware transformer for vehicle re-identification,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 557–19 568, 2022.
  48. Z. Wang, J. Zhang, T. Chen, W. Wang, and P. Luo, “Restoreformer++: Towards real-world blind face restoration from undegraded key-value pairs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  49. T. Chen, T. Pu, H. Wu, Y. Xie, and L. Lin, “Structured semantic transfer for multi-label recognition with partial labels,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 1, 2022, pp. 339–346.
  50. T. Pu, T. Chen, H. Wu, and L. Lin, “Semantic-aware representation blending for multi-label image recognition with partial labels,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 2, 2022, pp. 2091–2098.
  51. Y. Xie, T. Chen, T. Pu, H. Wu, and L. Lin, “Adversarial graph representation adaptation for cross-domain facial expression recognition,” in Proceedings of the 28th ACM international conference on Multimedia, 2020.
  52. Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image recognition with knowledge transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  53. T. Chen, L. Lin, R. Chen, X. Hui, and H. Wu, “Knowledge-guided multi-label few-shot learning for general image recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1371–1384, 2022.
  54. T. Pu, T. Chen, Y. Xie, H. Wu, and L. Lin, “Au-expression knowledge constrained representation learning for facial expression recognition,” in 2021 IEEE international conference on robotics and automation (ICRA).   IEEE, 2021, pp. 11 154–11 161.
  55. T. Chen, T. Pu, H. Wu, Y. Xie, L. Liu, and L. Lin, “Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9887–9903, 2022.
  56. W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” in Proceedings of International Conference on Learning Representations (ICLR), 2019.
  57. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  58. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  59. A. R. Vandenbroucke, J. Fahrenfort, J. Meuwese, H. Scholte, and V. Lamme, “Prior knowledge about objects determines neural color representation in human visual cortex,” Cerebral cortex, vol. 26, no. 4, pp. 1401–1408, 2016.
  60. C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. F. Wang, “Multi-label zero-shot learning with structured knowledge graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1576–1585.
  61. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  62. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  63. N. Rai, H. Chen, J. Ji, R. Desai, K. Kozuka, S. Ishizaka, E. Adeli, and J. C. Niebles, “Home action genome: Cooperative compositional action understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 184–11 193.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tao Pu (13 papers)
  2. Tianshui Chen (51 papers)
  3. Hefeng Wu (35 papers)
  4. Yongyi Lu (27 papers)
  5. Liang Lin (318 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.