Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders (2409.06635v2)

Published 10 Sep 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: The rapid advancements in LLMs have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Wenyu Zhang (47 papers)
  2. Shuo Sun (91 papers)
  3. Bin Wang (750 papers)
  4. Xunlong Zou (6 papers)
  5. Zhuohan Liu (6 papers)
  6. Yingxu He (5 papers)
  7. Geyu Lin (10 papers)
  8. Nancy F. Chen (97 papers)
  9. Ai Ti Aw (18 papers)
Citations (1)