Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities (2312.00249v1)

Published 30 Nov 2023 in eess.AS

Abstract: The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing LLMs and visual LLMs (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we introduce Acoustic Prompt Turning (APT), a new adapter extending LLMs and VLMs to the audio domain by soft prompting only. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as LLM inputs. To mitigate the data scarcity in the audio domain, a multi-task learning strategy is proposed by formulating diverse audio tasks in a sequence-to-sequence manner. Moreover, we improve the framework of audio LLM by using interleaved audio-text embeddings as the input sequence. This improved framework imposes zero constraints on the input format and thus is capable of tackling more understanding tasks, such as few-shot audio classification and audio reasoning. To further evaluate the reasoning ability of audio networks, we propose natural language audio reasoning (NLAR), a new task that analyses across two audio clips by comparison and summarization. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the targeted datasets) across various tasks. We finally demonstrate the APT's ability in extending frozen VLMs to the audio domain without finetuning, achieving promising results in the audio-visual question and answering task. Our code and model weights are released at https://github.com/JinhuaLiang/APT.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (35)

Authors (6)

Jinhua Liang (15 papers)
Xubo Liu (66 papers)
Wenwu Wang (148 papers)
Mark D. Plumbley (114 papers)
Huy Phan (75 papers)
Emmanouil Benetos (89 papers)

Citations (9)

View on Semantic Scholar

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities (2312.00249v1)

Related Papers