Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities (2312.00249v1)

Published 30 Nov 2023 in eess.AS

Abstract: The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing LLMs and visual LLMs (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as LLM inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio LLM by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: A Visual Language Model for Few-Shot Learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  2. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality, March 2023.
  3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, May 2023. arXiv:2305.06500.
  4. Pengi: An Audio Language Model for Audio Tasks, May 2023. arXiv:2305.11834.
  5. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  736–740. IEEE, 2020.
  6. CLAP: Learning Audio Concepts from Natural Language Supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, 2023. doi: 10.1109/ICASSP49357.2023.10095889.
  7. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  776–780, New Orleans, LA, March 2017. IEEE. ISBN 978-1-5090-4117-6. doi: 10.1109/ICASSP.2017.7952261.
  8. ImageBind: One Embedding Space To Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15180–15190, June 2023.
  9. Listen, Think, and Understand, May 2023. arXiv:2305.10790.
  10. Audioclip: Extending Clip to Image, Text and Audio. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  976–980, May 2022. doi: 10.1109/ICASSP43922.2022.9747631. ISSN: 2379-190X.
  11. ImageBind-LLM: Multi-modality Instruction Tuning, September 2023. arXiv:2309.03905.
  12. The Benefit of Temporally-Strong Labels in Audio Event Classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  366–370. IEEE, 2021.
  13. Masked Autoencoders that Listen. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  14. AudioCaps: Generating Captions for Audios in The Wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1011.
  15. Prefix Tuning for Automated Audio Captioning. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, June 2023. doi: 10.1109/ICASSP49357.2023.10096877.
  16. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020. ISSN 2329-9304. doi: 10.1109/TASLP.2020.3030497. Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  17. Weakly-supervised Automated Audio Captioning via text only training, September 2023. URL http://arxiv.org/abs/2309.12242. arXiv:2309.12242 [cs, eess].
  18. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, January 2023. arXiv:2301.12597.
  19. Leveraging Label Hierachies for Few-Shot Everyday Sound Recognition. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France, November 2022.
  20. Adapting Language-Audio Models as Few-Shot Audio Learners. In Proc. INTERSPEECH 2023, pp.  276–280, 2023. doi: 10.21437/Interspeech.2023-1082.
  21. Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pp.  1140–1144, 2022. doi: 10.23919/EUSIPCO55093.2022.9909680.
  22. Visual Instruction Tuning, April 2023. arXiv:2304.08485.
  23. WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research, March 2023. arXiv:2303.17395.
  24. OpenAI. GPT-4 Technical Report, March 2023.
  25. Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM International Conference on Multimedia, pp.  1015–1018, Brisbane Australia, October 2015. ACM. ISBN 978-1-4503-3459-4. doi: 10.1145/2733373.2806390.
  26. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. Technical Report arXiv:2206.01948, arXiv, June 2022. URL http://arxiv.org/abs/2206.01948. arXiv:2206.01948 [cs, eess] type: article.
  27. Unified Model for Image, Video, Audio and Language Tasks, July 2023. arXiv:2307.16184.
  28. Prototypical Networks for Few-shot Learning. Advances in Neural Information Processing Systems, 30:4077–4087, 2017.
  29. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. arXiv:2307.09288.
  30. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  31. Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  905–909, June 2021. doi: 10.1109/ICASSP39728.2021.9413982. ISSN: 2379-190X.
  32. AVQA: A Dataset for Audio-Visual Question Answering on Videos. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pp.  3480–3491, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 978-1-4503-9203-7. doi: 10.1145/3503161.3548291. event-place: Lisboa, Portugal.
  33. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities, May 2023a.
  34. Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision – ECCV 2022, Lecture Notes in Computer Science, pp.  493–510, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19833-5. doi: 10.1007/978-3-031-19833-5˙29.
  35. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, March 2023b. arXiv:2303.16199.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jinhua Liang (15 papers)
  2. Xubo Liu (66 papers)
  3. Wenwu Wang (148 papers)
  4. Mark D. Plumbley (114 papers)
  5. Huy Phan (75 papers)
  6. Emmanouil Benetos (89 papers)
Citations (9)