MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders (2409.06635v2)

Published 10 Sep 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: The rapid advancements in LLMs have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (9)

Wenyu Zhang (47 papers)
Shuo Sun (91 papers)
Bin Wang (750 papers)
Xunlong Zou (6 papers)
Zhuohan Liu (6 papers)
Yingxu He (5 papers)
Geyu Lin (10 papers)
Nancy F. Chen (97 papers)
Ai Ti Aw (18 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/gm8xx8/status/1833697274959929628

https://twitter.com/AudioAndSpeech/status/1838799704945025465

https://twitter.com/AudioAndSpeech/status/1914875918637555988

https://twitter.com/AudioAndSpeech/status/1833950248294584569

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders (2409.06635v2)

Related Papers

Tweets