Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks (2311.05152v2)

Published 9 Nov 2023 in cs.LG, cs.AI, cs.CV, and cs.MM

Abstract: In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
  2. “Vggsound: A large-scale audio-visual dataset” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 721–725 IEEE
  3. “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650 IEEE
  4. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
  5. “Clap: Learning audio concepts from natural language supervision” In arXiv preprint arXiv:2206.04769, 2022
  6. “The pascal visual object classes (voc) challenge” In International journal of computer vision 88 Springer, 2010, pp. 303–338
  7. “Clip-adapter: Better vision-language models with feature adapters” In arXiv preprint arXiv:2110.04544, 2021
  8. “Visualvoice: Audio-visual speech separation with cross-modal consistency” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15490–15500 IEEE
  9. “Audio set: An ontology and human-labeled dataset for audio events” In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 776–780 IEEE
  10. “Audioclip: Extending clip to image, text and audio” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 976–980 IEEE
  11. Jie Hu, Li Shen and Gang Sun “Squeeze-and-excitation networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141
  12. “Maple: Multi-modal prompt learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122
  13. Brian Lester, Rami Al-Rfou and Noah Constant “The power of scale for parameter-efficient prompt tuning” In arXiv preprint arXiv:2104.08691, 2021
  14. “Learning to answer questions in dynamic audio-visual scenarios” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19108–19118
  15. Xiang Lisa Li and Percy Liang “Prefix-tuning: Optimizing continuous prompts for generation” In arXiv preprint arXiv:2101.00190, 2021
  16. Yan-Bo Lin, Yu-Jhe Li and Yu-Chiang Frank Wang “Dual-modality seq2seq network for audio-visual event localization” In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2002–2006 IEEE
  17. “Vision Transformers Are Parameter-Efficient Audio-Visual Learners” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2299–2309
  18. “GPT understands, too” In arXiv preprint arXiv:2103.10385, 2021
  19. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks” In arXiv preprint arXiv:2110.07602, 2021
  20. “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
  21. “Timestamps as Prompts for Geography-Aware Location Recommendation” In arXiv preprint arXiv:2304.04151, 2023
  22. “Active contrastive learning of audio-visual video representations” In arXiv preprint arXiv:2009.09805, 2020
  23. “AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5158–5167
  24. “Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing” In Advances in Neural Information Processing Systems, 2022
  25. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  26. “Learning transferable visual models from natural language supervision” In International conference on machine learning, 2021, pp. 8748–8763 PMLR
  27. “Denseclip: Language-guided dense prediction with context-aware prompting” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18082–18091
  28. Idan Schwartz, Alexander G Schwing and Tamir Hazan “A simple baseline for audio-visual scene-aware dialog” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12548–12558
  29. “Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 548–557
  30. “Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers” In arXiv preprint arXiv:2207.07087, 2022
  31. Yapeng Tian, Di Hu and Chenliang Xu “Cyclic co-learning of sounding object visual grounding and sound separation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2745–2754
  32. Yapeng Tian, Dingzeyu Li and Chenliang Xu “Unified multisensory perception: Weakly-supervised audio-visual video parsing” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 2020, pp. 436–454 Springer
  33. “Audio-visual event localization in unconstrained videos” In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 247–263
  34. Laurens Van der Maaten and Geoffrey Hinton “Visualizing data using t-SNE.” In Journal of machine learning research 9.11, 2008
  35. “Dual attention matching for audio-visual event localization” In Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300
  36. “Cross-modal background suppression for audio-visual event localization” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19989–19998
  37. “Cross-modal relation-aware networks for audio-visual event localization” In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 3893–3901
  38. “Cross-modal attention network for temporal inconsistent audio-visual event localization” In Proceedings of the AAAI Conference on Artificial Intelligence 34.01, 2020, pp. 279–286
  39. “LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5886–5896
  40. “Pano-avqa: Grounded audio-visual question answering on 360deg videos” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2031–2041
  41. Zhaoyang Zeng, Daniel McDuff and Yale Song “Contrastive learning of global and local video representations” In Advances in Neural Information Processing Systems 34, 2021, pp. 7025–7040
  42. “Audio–Visual Segmentation” In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, 2022, pp. 386–403 Springer
  43. “AVSBench: A Pixel-level Audio- Visual Segmentation Benchmark”
  44. “Positive sample propagation along the audio-visual event line” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8436–8444
  45. “Conditional prompt learning for vision-language models” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825
  46. “Learning to prompt for vision-language models” In International Journal of Computer Vision 130.9 Springer, 2022, pp. 2337–2348
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com