Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Video Question Answering via Video Graph Transformer (2302.13668v2)

Published 27 Feb 2023 in cs.CV and cs.MM

Abstract: We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  2. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  3. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML.   PMLR, 2021, pp. 8748–8763.
  4. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019, pp. 13–23.
  5. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in ICLR, 2020.
  6. C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in ICCV, 2019, pp. 7464–7473.
  7. J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in NeurIPS, vol. 34, 2021.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, vol. 30, 2017.
  9. Y. Zhong, J. Xiao, W. Ji, Y. Li, W. Deng, and T.-S. Chua, “Video question answering: Datasets, algorithms and challenges,” in EMNLP, 2022, pp. 6439–6455.
  10. Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in CVPR, 2017, pp. 2758–2766.
  11. C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” in CVPR, 2019, pp. 1999–2007.
  12. J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” in CVPR, 2018, pp. 6576–6585.
  13. P. Jiang and Y. Han, “Reasoning with heterogeneous graph alignment for video question answering,” in AAAI, 2020.
  14. D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan, “Location-aware graph convolutional networks for video question answering,” in AAAI, vol. 34, no. 07, 2020, pp. 11 021–11 028.
  15. J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T.-S. Chua, “Video as conditional graph hierarchy for multi-granular question answering,” in AAAI, 2022, pp. 2804–2812.
  16. P. H. Seo, A. Nagrani, and C. Schmid, “Look before you speak: Visually contextualized utterances,” in CVPR, 2021, pp. 16 877–16 887.
  17. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” in ACM MM, 2017, pp. 1645–1653.
  18. H. Xu, G. Ghosh, P.-Y. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in EMNLP, 2021, pp. 6787–6800.
  19. A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Learning to answer visual questions from web videos,” in arXiv preprint arXiv:2205.05019, 2022.
  20. Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for video question answering and retrieval,” in ECCV, September 2018.
  21. L. Zhu and Y. Yang, “Actbert: Learning global-local video-text representations,” in CVPR, 2020, pp. 8746–8755.
  22. X. Shang, D. Di, J. Xiao, Y. Cao, X. Yang, and T.-S. Chua, “Annotating objects and relations in user-generated videos,” in ICMR, 2019, pp. 279–287.
  23. J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” in CVPR, 2021, pp. 9777–9786.
  24. B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,” in NeurIPS, 2021.
  25. J. Li, L. Niu, and L. Zhang, “From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering,” in CVPR, 2022, pp. 21 273–21 282.
  26. J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, “Less is more: Clipbert for video-and-language learning via sparse sampling,” in CVPR, 2021, pp. 7331–7341.
  27. W. Yu, H. Zheng, M. Li, L. Ji, L. Wu, N. Xiao, and N. Duan, “Learning from inside: Self-driven siamese sampling and reasoning for video question answering,” NeurIPS, vol. 34, 2021.
  28. R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” in NeurIPS, vol. 34, 2021.
  29. T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “Violet: End-to-end video-language transformers with masked visual-token modeling,” in arXiv preprint arXiv:2111.12681, November 2021.
  30. S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles, “Revisiting the” video” in video-language understanding,” in CVPR, 2022, pp. 2917–2927.
  31. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  32. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, vol. 28, 2015.
  33. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
  34. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML.   PMLR, 2021, pp. 813–824.
  35. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” arXiv preprint arXiv:2106.13230, 2021.
  36. S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, 2018, pp. 305–321.
  37. V. Agarwal, R. Shetty, and M. Fritz, “Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing,” in CVPR, 2020, pp. 9690–9698.
  38. Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, and J.-R. Wen, “Counterfactual vqa: A cause-effect look at language bias,” in CVPR, 2021, pp. 12 700–12 710.
  39. Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in CVPR, 2022, pp. 2928–2937.
  40. Y. Li, X. Wang, J. Xiao, and T.-S. Chua, “Equivariant and invariant grounding for video question answering,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, p. 4714–4722.
  41. L. Peng, S. Yang, Y. Bin, and G. Wang, “Progressive graph attention network for video question answering,” in ACM MM, 2021, pp. 2871–2879.
  42. J. Xiao, P. Zhou, T.-S. Chua, and S. Yan, “Video graph transformer for video question answering,” in ECCV.   Springer, 2022, pp. 39–58.
  43. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  44. X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan, “Beyond rnns: Positional self-attention with co-attention for video question answering,” in AAAI, 2019, pp. 8658–8665.
  45. J. Jiang, Z. Chen, H. Lin, X. Zhao, and Y. Gao, “Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering,” in AAAI, vol. 34, no. 07, 2020, pp. 11 101–11 108.
  46. F. Liu, J. Liu, W. Wang, and H. Lu, “Hair: Hierarchical visual-semantic relational reasoning for video question answering,” in ICCV, October 2021, pp. 1698–1707.
  47. J. Park, J. Lee, and K. Sohn, “Bridge to answer: Structure-aware graph interaction network for video question answering,” in CVPR, 2021, pp. 15 526–15 535.
  48. L. H. Dang, T. M. Le, V. Le, and T. Tran, “Hierarchical object-oriented spatio-temporal reasoning for video question answering,” in IJCAI, August 2021.
  49. A. Seo, G.-C. Kang, J. Park, and B.-T. Zhang, “Attend what you need: Motion-appearance synergistic networks for video question answering,” in ACL, 2021, pp. 6167–6177.
  50. J. Xiao, X. Shang, X. Yang, S. Tang, and T.-S. Chua, “Visual relation grounding in videos,” in ECCV.   Springer, 2020, pp. 447–464.
  51. S. Xiao, L. Chen, K. Gao, Z. Wang, Y. Yang, and J. Xiao, “Rethinking multi-modal alignment in video question answering from feature and sample perspectives,” arXiv preprint arXiv:2204.11544, 2022.
  52. A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Just ask: Learning to answer questions from millions of narrated videos,” in ICCV, 2021, pp. 1686–1697.
  53. A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in ICCV, 2019, pp. 2630–2640.
  54. E. Amrani, R. Ben-Ari, D. Rotman, and A. Bronstein, “Noise estimation using density estimation for self-supervised multimodal learning,” in AAAI, vol. 35, no. 8, 2021, pp. 6644–6652.
  55. A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in CVPR, 2020, pp. 9879–9889.
  56. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in ICCV, 2021, pp. 1728–1738.
  57. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV.   Springer, 2014, pp. 740–755.
  58. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, vol. 123, no. 1, pp. 32–73, 2017.
  59. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018, pp. 2556–2565.
  60. P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
  61. D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou, “Rethinking graph transformers with spectral attention,” NeurIPS, vol. 34, 2021.
  62. L. Wang, X. Chang, S. Li, Y. Chu, H. Li, W. Zhang, X. He, L. Song, J. Zhou, and H. Yang, “Tcl: Transformer-based dynamic graph modelling via contrastive learning,” arXiv preprint arXiv:2105.07944, 2021.
  63. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu, “Do transformers really perform badly for graph representation?” NeurIPS, vol. 34, 2021.
  64. S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” NeurIPS, vol. 32, 2019.
  65. A. Cherian, C. Hori, T. K. Marks, and J. Le Roux, “(2.5+ 1) d spatio-temporal scene graphs for video question answering,” in AAAI, vol. 36, no. 1, 2022, pp. 444–453.
  66. S. Geng, P. Gao, M. Chatterjee, C. Hori, J. Le Roux, Y. Zhang, H. Li, and A. Cherian, “Dynamic graph representation learning for video dialog via multi-modal shuffled transformers,” in AAAI, 2021.
  67. S. Kim, S. Jeong, E. Kim, I. Kang, and N. Kwak, “Self-supervised pre-training and contrastive representation learning for multiple-choice video qa,” in AAAI, vol. 35, no. 14, 2021, pp. 13 171–13 179.
  68. Z. Liang, W. Jiang, H. Hu, and J. Zhu, “Learning to contrast the counterfactual samples for robust visual question answering,” in EMNLP, 2020, pp. 3285–3292.
  69. Y. Kant, A. Moudgil, D. Batra, D. Parikh, and H. Agrawal, “Contrast and classify: Training robust vqa models,” in ICCV, 2021, pp. 1604–1613.
  70. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  71. Y.-H. H. Tsai, S. Divvala, L.-P. Morency, R. Salakhutdinov, and A. Farhadi, “Video relationship reasoning using gated spatio-temporal energy graph,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 424–10 433.
  72. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018, pp. 6077–6086.
  73. R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei, “Referring relationships,” in CVPR, 2018, pp. 6867–6876.
  74. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  75. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  76. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
  77. K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum, “Clevrer: Collision events for video representation and reasoning,” ICLR, 2020.
  78. D. Ding, F. Hill, A. Santoro, M. Reynolds, and M. Botvinick, “Attention over learned object embeddings enables complex visual reasoning,” NeurIPS, vol. 34, 2021.
  79. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  80. T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” in CVPR, 2020, pp. 9972–9981.
  81. M. Peng, C. Wang, Y. Gao, Y. Shi, and X.-D. Zhou, “Multilevel hierarchical network with multiscale sampling for video question answering,” IJCAI, 2022.
  82. X. Shang, J. Xiao, D. Di, and T.-S. Chua, “Relation understanding in videos: A grand challenge overview,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2652–2656.
  83. J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in CVPR, 2020, pp. 10 236–10 247.
  84. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015, pp. 2425–2433.
  85. J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in CVPR, 2016, pp. 5288–5296.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Junbin Xiao (23 papers)
  2. Pan Zhou (220 papers)
  3. Angela Yao (101 papers)
  4. Yicong Li (34 papers)
  5. Richang Hong (117 papers)
  6. Shuicheng Yan (275 papers)
  7. Tat-Seng Chua (360 papers)
Citations (29)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com