Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks (2311.05152v2)
Abstract: In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- “Vggsound: A large-scale audio-visual dataset” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 721–725 IEEE
- “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650 IEEE
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
- “Clap: Learning audio concepts from natural language supervision” In arXiv preprint arXiv:2206.04769, 2022
- “The pascal visual object classes (voc) challenge” In International journal of computer vision 88 Springer, 2010, pp. 303–338
- “Clip-adapter: Better vision-language models with feature adapters” In arXiv preprint arXiv:2110.04544, 2021
- “Visualvoice: Audio-visual speech separation with cross-modal consistency” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15490–15500 IEEE
- “Audio set: An ontology and human-labeled dataset for audio events” In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 776–780 IEEE
- “Audioclip: Extending clip to image, text and audio” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 976–980 IEEE
- Jie Hu, Li Shen and Gang Sun “Squeeze-and-excitation networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141
- “Maple: Multi-modal prompt learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122
- Brian Lester, Rami Al-Rfou and Noah Constant “The power of scale for parameter-efficient prompt tuning” In arXiv preprint arXiv:2104.08691, 2021
- “Learning to answer questions in dynamic audio-visual scenarios” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19108–19118
- Xiang Lisa Li and Percy Liang “Prefix-tuning: Optimizing continuous prompts for generation” In arXiv preprint arXiv:2101.00190, 2021
- Yan-Bo Lin, Yu-Jhe Li and Yu-Chiang Frank Wang “Dual-modality seq2seq network for audio-visual event localization” In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2002–2006 IEEE
- “Vision Transformers Are Parameter-Efficient Audio-Visual Learners” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2299–2309
- “GPT understands, too” In arXiv preprint arXiv:2103.10385, 2021
- “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks” In arXiv preprint arXiv:2110.07602, 2021
- “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
- “Timestamps as Prompts for Geography-Aware Location Recommendation” In arXiv preprint arXiv:2304.04151, 2023
- “Active contrastive learning of audio-visual video representations” In arXiv preprint arXiv:2009.09805, 2020
- “AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5158–5167
- “Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing” In Advances in Neural Information Processing Systems, 2022
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Learning transferable visual models from natural language supervision” In International conference on machine learning, 2021, pp. 8748–8763 PMLR
- “Denseclip: Language-guided dense prediction with context-aware prompting” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18082–18091
- Idan Schwartz, Alexander G Schwing and Tamir Hazan “A simple baseline for audio-visual scene-aware dialog” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12548–12558
- “Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 548–557
- “Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers” In arXiv preprint arXiv:2207.07087, 2022
- Yapeng Tian, Di Hu and Chenliang Xu “Cyclic co-learning of sounding object visual grounding and sound separation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2745–2754
- Yapeng Tian, Dingzeyu Li and Chenliang Xu “Unified multisensory perception: Weakly-supervised audio-visual video parsing” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 2020, pp. 436–454 Springer
- “Audio-visual event localization in unconstrained videos” In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 247–263
- Laurens Van der Maaten and Geoffrey Hinton “Visualizing data using t-SNE.” In Journal of machine learning research 9.11, 2008
- “Dual attention matching for audio-visual event localization” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300
- “Cross-modal background suppression for audio-visual event localization” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19989–19998
- “Cross-modal relation-aware networks for audio-visual event localization” In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 3893–3901
- “Cross-modal attention network for temporal inconsistent audio-visual event localization” In Proceedings of the AAAI Conference on Artificial Intelligence 34.01, 2020, pp. 279–286
- “LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5886–5896
- “Pano-avqa: Grounded audio-visual question answering on 360deg videos” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2031–2041
- Zhaoyang Zeng, Daniel McDuff and Yale Song “Contrastive learning of global and local video representations” In Advances in Neural Information Processing Systems 34, 2021, pp. 7025–7040
- “Audio–Visual Segmentation” In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, 2022, pp. 386–403 Springer
- “AVSBench: A Pixel-level Audio- Visual Segmentation Benchmark”
- “Positive sample propagation along the audio-visual event line” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8436–8444
- “Conditional prompt learning for vision-language models” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825
- “Learning to prompt for vision-language models” In International Journal of Computer Vision 130.9 Springer, 2022, pp. 2337–2348