Answering Diverse Questions via Text Attached with Key Audio-Visual Clues (2403.06679v1)
Abstract: Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the audio-visual soft associations based on self-attention, then key local audio-visual features relevant to the question context are captured hierarchically by shared aggregators and coupled in the form of clues with specific question vectors. 2) Secondly, knowledge distillation is enforced to align audio-visual-text pairs in a shared latent space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual dependencies are decoupled by discarding the decision-level integrations. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show that our method outperforms other state-of-the-art methods, and one interesting finding behind is that removing deep audio-visual features during inference can effectively mitigate overfitting. The source code is released at http://github.com/rikeilong/MCD-forAVQA.
- G. Li, Y. Wei, Y. Tian, C. Xu, J. Wen, and D. Hu, “Learning to answer questions in dynamic audio-visual scenarios,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19086–19096, 2022.
- P. Yang, X. Wang, X. Duan, H. Chen, R. Hou, C. Jin, and W. Zhu. “AVQA: A dataset for audio-visual question answering on videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 3480–3491.
- H. Yun, Y. Yu, W. Yang, K. Lee, and G. Kim. “Pano-AVQA: Grounded audio-visual question answering on 360deg videos,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2031–2041, 2021.
- W. Hou, G. Li, Y. Tian, and D. Hu. “Towards long form audio-visual video understanding,” arXiv:2306.09431, 2023.
- D. Hu, Y. Wei, R. Qian, W. Lin, R. Song, and J. Wen. “Class-aware sounding objects localization via audiovisual correspondence,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 9844-9859, 2021.
- X. Jiang, X. Xu, Z. Chen, J. Zhang, J. Song, F. Shen, H. Lu, and H. Shen. “Dhhn: Dual hierarchical hybrid network for weakly-supervised audio-visual video parsing,” in Proceedings of the 30th ACM International Conference on Multimedia, pp. 719–727.
- X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu. “Balanced multimodal learning via on-the-fly gradient modulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247, 2022.
- Y. Tian, D. Li, and C. Xu. “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” in European Conference on Computer Vision, pp. 436–454, 2020.
- Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. “Audio-visual event localization in unconstrained videos,” in European Conference on Computer Vision, pp. 252-268, 2018.
- M.-I. Georgescu, E. Fonseca, R. Ionescu, M. Lucic, C. Schmid, and A. Arnab. “Audiovisual masked autoencoders,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16144-16154, 2023.
- W. Pian, S. Mo, Y. Guo, and Y. Tian. “Audio-visual class-incremental learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7799-7811, 2023.
- A. Guzhov, F. Raue, J. Hees, and A. Dengel. “Audioclip: Extending clip to image, text and audio,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 976-980, 2022.
- Y. Gong, A. Rouditchenko, A. H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, and J. Glass. “Contrastive audio-visual masked autoencode,” arXiv:2210.07839, 2023.
- G. Li, W. Hou, D. Hu. “Progressive spatio-temporal perception for audio-visual question,” in Proceedings of the 31th ACM International Conference on Multimedia, 2023.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. “VQA: Visual question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2425–2433, 2015.
- M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. “MovieQA: Understanding stories in movies through question-answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4631-4640, 2016.
- Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. “TGIF-QA: Toward spatio-temporal reasoning in visual question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1359-1367, 2017.
- Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang. “Video question answering via attribute augmented attention network learning,” in International Conference on Research on Development in Information Retrieval, pp. 829-832, 2017.
- K. Kim, M. Heo, S. Choi, and B. Zhang. “DeepStory: video story QA by deep embedded memory networks,” in International Joint Conference on Artificial Intelligence, 2019.
- J. Lei, L. Yu, M. Bansal, and T. Berg. “TVQA: Localized, compositional video question answering,” in Conference on Empirical Methods in Natural Language Processing, pp. 1369-1379, 2018.
- Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. “ActivityNet-QA: A dataset for understanding complex web videos via question answering,” in AAAI Conference on Artificial Intelligence, pp. 9127-9134, 2019.
- N. Garcia, M. Otani, C.i Chu, and Y. Nakashima. “KnowIT VQA: Answering knowledge-based questions about videos,” in AAAI Conference on Artificial Intelligence, pp. 10826-10834, 2020.
- L. Xu, H. Huang, J. Liu. “SUTD-TrafficQA: A question answering benchmark and an efficient network for video reasoning over traffic events,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9878-9888, 2021.
- D. Hu, Z. Wang, F. Nie, R. Wang, and X. Li. “Self-supervised learning for heterogeneous audiovisual scene analysis,” in IEEE Transactions on Multimedia, 2022.
- A. Senocak, T. Oh, J. Kim, M. Yang, and I. S. Kweon. “Learning to localize sound source in visual scenes.,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4358-4366, 2018.
- H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. “The sound of pixels,” in European Conference on Computer Vision, pp. 587–604, 2018.
- R. Gao, and K. Grauman. “2.5D visual sound,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 324-333, 2019.
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y Ng. “Multimodal deep learning,” in International Conference on Machine Learning, pp. 689-696, 2011.
- N. Srivastava, and R.n Salakhutdinov. “Multimodal learning with deep Boltzmann machines,” in Advances in neural information processing systems, pp. 2949-2980, 2014.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” in Advances in neural information processing systems, pp. 5998-6008, 2017.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020.
- C. Chen, Q. Fan, and R. Panda. “CrossViT: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 347-356, 2021.
- A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid. “Zero-shot video question answering via frozen bidirectional language models,” in Conference on Neural Information Processing Systems, pp. 124-141, 2022.
- A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Just ask: Learning to answer questions from millions of narrated videos,” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1686-1697, 2021.
- R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi. “MERLOT RESERVE: Neural script knowledge through vision and language and sound,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16354-16366, 2022.
- J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu. “Less is More: CLIPBERT for video-and-language learning via sparse sampling,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331-7341, 2021.
- H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. M. L. Z. C. Feichtenhofer, L. Zettlemoyer, and C. Feichtenhofer. “VideoCLIP - Contrastive pre-training for zero-shot video-text understanding,” arXiv:2109.14084, 2021.
- P. H. Seo, A. Nagrani, and C. Schmid. “Look Before you Speak: Visually contextualized utterances,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16877-16887, 2021.
- S. Kim, S. Jeong, E. Kim, I. Kang, and N. Kwak. “Self-supervised pre-training and contrastive representation learning For multiple-choice video qa,” in AAAI Conference on Artificial Intelligence, pp. 13171-13179, 2021.
- Y. Liu, W. Wei, D. Peng, X.-L. Mao, Z. He, and P. Zhou, “Depth-aware and semantic guided relational attention network for visual question answering,” in IEEE Transactions on Multimedia, pp. 1–14, 2022.
- T. Qian, R. Cui, J. Chen, P. Peng, X. Guo, and Y.-G. Jiang, “Locate before answering: answer guided question localization for video question answering,” in IEEE Transactions on Multimedia, pp. 1-10, 2022.
- Y. Song, X. Yang, Y. Wang, and C. Xu, “Recovering generalization via pre-training-like knowledge distillation for out-of-distribution visual question answering,” in IEEE Transactions on Multimedia, pp. 1–15, 2023.
- J. Wang, B.-K. Bao, and C. Xu, “DualVGR: a dual-visual graph reasoning unit for video question answering,” in IEEE Transactions on Multimedia, pp. 3369–3380, 2022.
- F. Liu, J. Liu, Z. Fang, R. Hong, and H. Lu, “Visual question answering with dense inter- and intra-modality interactions,” in IEEE Transactions on Multimedia, pp. 3518–3529, 2021.
- T. Yu, J. Yu, Z. Yu, Q. Huang, and Q. Tian. “Long-term video question answering via multimodal hierarchical memory attentive networks,” in IEEE Transactions on Circuits and Systems for Video Technology, pp. 931-944, 2021.
- J. Liu, G. Wang, J. Xie, F. Zhou, and H. Xu, “Video question answering with semantic disentanglement and reasoning,” in IEEE Transactions on Circuits and Systems for Video Technology, pp.1-1, 2023.
- Yi. Lin, Y. Sung, J. Lei, M. Bansal, and G. Bertasius. “Vision transformers are parameter-efficient audio-visual learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2299-2309, 2023.
- K. He, X.Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition. in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
- J. F Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R C. Moore, M. Plakal, and M. Ritter. “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and SP, pp. 776–780, 2017.
- L. Li, T. Jin, W. Lin, H. Jiang, W. Pan, J. Wang, S. Xiao, Y. Xia, W. Jiang, and Z. Zhao. “Multi-granularity relational attention network for audio-visual question answering,” in IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2023.
- H. M Fayek and J. Johnson. “Temporal reasoning via audio question answering,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 2283–2294, 2020.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C L. Zitnick, and D. Parikh. “Vqa: Visual question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2425–2433, 2015.
- J. Lu, J. Yang, D. Batra, and D. Parikh. “Hierarchical question-image co-attention for visual question answering,” in arXiv:1606.00061, 2016.
- Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian. “Deep modular co-attention networks for visual question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290, 2019.
- X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan. “Beyond rnns: Positional self-attention with co-attention for video question answering,” in AAAI Conference on Artificial Intelligence, pp. 8658–8665, 2019.
- C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. “Heterogeneous memory enhanced multimodal attention model for video question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2007, 2019.
- T. M. Le, V. Le, S. Venkatesh, and T. Tran. “Hierarchical conditional relation networks for video question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9972–9981, 2020.
- I. Schwartz, A. G Schwing, and T. Hazan. “A simple baseline for audio-visual scene-aware dialog,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12548–12558, 2019.
- X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. Shen, and J. Song. “Learnable aggregating net with diversity learning for video question answering,” in Proceedings of the 27th ACM international conference on multimedia, pp. 1166–1174, 2019.
- J. Zhang, J. Shao, R. Cao, L. Gao, X. Xu, and H. Shen. “Action-centric relation transformer network for video question answering,” in IEEE Transactions on Circuits and Systems for Video Technology, pp. 63–74, 2020.
- P. Jiang and Y. Han. “Reasoning with heterogeneous graph alignment for video question answering,” in AAAI Conference on Artificial Intelligence, pp. 11109–11116, 2020.
- K. Qiuqiang, C. Yin, I. Turab, W. Yuxuan, W. Wenwu, and P. Mark D.. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” in IEEE-ACM Transactions on Audio Speech and Language Processing, pp. 2880-2894, 2020.
- SHI, Xingjian, Chen, Zhourong, Wang, Hao, Yeung, Dit-Yan, Wong, Wai-kin, and WOO, Wang-chun. “Convolutional LSTM Network: A machine learning approach for precipitation nowcasting,” in Advances in neural information processing systems, pp. 802-810, 2015.
- Z. Huang, W. Xu, and K. Yu. “Bidirectional LSTM-CRF Models for Sequence Tagging,” in arXiv:1508.01991, 2015.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. “Swin Transformer - Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9992-10002, 2021.