CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing (2401.12264v2)
Abstract: There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and LLMing loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.
- L. Ruan, A. Hu, Y. Song, L. Zhang, S. Zheng, and Q. Jin, “Accommodating audio modality in clip for multimodal processing,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 9641–9649.
- H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “VATT: Transformers for multimodal self-supervised learning from raw video, audio and text,” in Proc. Adv. Neural Inf. Process. Syst., 2021.
- S. Chen, X. He, L. Guo, X. Zhu, W. Wang, J. Tang, and J. Liu, “VALOR: Vision-audio-language omni-perception pretraining model and dataset,” in arXiv:2304.08345, 2022.
- B. Chen, A. Rouditchenko, K. Duarte, H. Kuehne, S. Thomas, A. Boggust, R. Panda, B. Kingsbury, R. Feris, D. Harwath, M. P. James Glass, and S.-F. Chang, “Multimodal clustering networks for self-supervised learning from unlabeled videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, p. 8012–8021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
- L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, p. 8746–8755.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7464–7473.
- D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi, “Align and prompt: Video-and-language pre-training with entity prompts,” in Proc. Eur. Conf. Comput. Vis., 2022, p. 4953–4963.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022.
- J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proc. Int. Conf. Mach. Learn., 2023.
- B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP: Learning audio concepts from natural language supervision,” in arXiv:2206.04769, 2022.
- Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023.
- A. Guzhov, F. Raue, J. Hees, and A. Dengel, “AudioCLIP: Extending clip to image, text and audio,” in GCPR, 2021.
- J. Liu, X. Zhu, F. Liu, L. Guo, Z. Zhao, M. Sun, W. Wang, H. Lu, S. Zhou, J. Zhang, and J. Wang, “OPT: Omni-perception pre-trainer for cross-modal understanding and generation,” in arXiv:2107.00249, 2021.
- P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, and C. Feichtenhofer, “Mavil: Masked audio-video learners,” in Proc. Adv. Neural Inf. Process. Syst., 2023.
- Y. Gong, A. Rouditchenko, A. H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, and J. Glass, “Contrastive audio-visual masked autoencoder,” in Proc. Int. Conf. Learn. Represent., 2023.
- A. Nagrani, P. H. Seo, B. Seybold, A. Hauth, S. Manen, C. Sun, and C. Schmid, “Learning audio-video modalities from image captions,” in Proc. Eur. Conf. Comput. Vis., 2022.
- J. M. Clark and A. Paivio, “Dual coding theory and education,” vol. 3, pp. 149–210, 1991.
- A. Paivio, “Imagery and verbal processes,” 1979.
- C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video representations using contrastive bidirectional transformer,” in Proc. Int. Conf. Learn. Represent., 2019.
- H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2019, p. 5100–5111.
- J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, “Less is more: Clipbert for video-and-language learning via sparse sampling,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7331–7341.
- L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+language omni-representation pre-training,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, p. 2046–2065.
- H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou, “Univl: A unified video and language pre-training model for multimodal understanding and generation,” in arXiv:2002.06353, 2020.
- X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” in arXiv:2303.17395, 2023.
- Y. Xin, D. Yang, and Y. Zou, “Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023.
- X. Xu, Z. Zhang, Z. Zhou, P. Zhang, Z. Xie, M. Wu, and K. Q. Zhu, “Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data,” in arXiv:2303.07902, 2023.
- H. H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2CLIP: Learning robust audio representations from clip,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 4563–4567.
- Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio spectrogram transformer,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 571–575.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, p. 4171–4186.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audioset: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 776–780.
- C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” in NAACL, 2019, p. 119–132.
- K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, p. 736–740.
- H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large-scale audio-visual dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, p. 721–725.
- F. Font, G. Roma, , and X. Serra, “Freesound technical demo,” in ACMM, 2013, p. 411–412.
- A.-M. Oncescu, A. S. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021.
- Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, “Scaling language-image pre-training via masking,” in arXiv:2212.00794, 2023.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent., 2018.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” p. 2880–2894, 2020.
- Y. Gong, C.-I. J. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-supervised audio spectrogram transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 10 699–10 709.
- A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked autoencoding audio spectrogram transformer,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2022, pp. 2438–2442.
- P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” in Proc. Adv. Neural Inf. Process. Syst., 2022.
- E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Slow-fast auditory streams for audio recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 855–859.
- W. Wang, D. Tran, and M. Feiszli, “hat makes training multi-modal classification networks hard?” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, p. 12695–12705.
- A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” in Proc. Adv. Neural Inf. Process. Syst., 2021, p. 14200–14213.
- Xianghu Yue (14 papers)
- Xiaohai Tian (24 papers)
- Malu Zhang (43 papers)
- Zhizheng Wu (45 papers)
- Haizhou Li (285 papers)
- Lu Lu (189 papers)