WavLLM: Towards Robust and Adaptive Speech Large Language Model (2404.00656v3)

Published 31 Mar 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: The recent advancements in LLMs have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech LLM with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavLLM}.

PDF HTML Abstract

Overview of "WavLLM: Towards Robust and Adaptive Speech LLM"

The paper introduces WavLLM, a speech-integrated LLM that aims to enhance the modeling of speech capabilities within LLMs for improved multimodal understanding and task execution. The research delineates a novel architecture that incorporates dual encoders and a prompt-aware Low-Rank Adaptation (LoRA) weight adapter, optimized through a two-stage curriculum learning approach, to accomplish robust and versatile auditory task processing.

Methodology

Model Architecture:

Dual Encoders: WavLLM integrates the Whisper and WavLM encoders to separately process different facets of the speech signal. Whisper primarily captures semantic content, while WavLM focuses on the speaker’s acoustic attributes such as identity.
Prompt-aware LoRA Adapter: This component is introduced to modulate model parameters in response to varying instruction prompts, thereby enhancing adaptability and performance across diverse input types.

Training Approach:

Two-Stage Curriculum Learning: The training regimen begins with simpler tasks, aiming to build foundational capabilities by leveraging synthesized spoken question-answering datasets and other elementary speech tasks such as automatic speech recognition (ASR), speech translation (ST), speaker verification (SV), and emotion recognition (ER).
The second stage involves advanced multi-task learning with complex task combinations and integrates a prompt-aware adapter to further refine the execution across mixed tasks and instructions.

Experimental Evaluation

The model is evaluated on a range of single and multiple task benchmarks:

Single-Task Performance: WavLLM demonstrates state-of-the-art results on tasks like ASR, with a WER of 2.0% on the LibriSpeech test-clean dataset, outperforming comparable models.
Task Flexibility and Chain-of-Thought (CoT) Processing: The model exhibits strong performance in handling tasks requiring CoT reasoning, leveraging the capability to decompose complex tasks into manageable sub-tasks, fostering efficient problem-solving.

Discussion and Implications

WavLLM showcases a compelling advancement in merging speech and language understanding, offering a more nuanced capability to generalize across various auditory and textual tasks. Its adaptable architecture presents potential use cases extending beyond speech transcription to include complex dialogues and multilingual task executions, underscoring its applicability in real-world voice-based AI applications.

The separation of speech encoding into semantic and acoustic content via dual encoders might signal a trend towards more compartmentalized processing architectures within multimodal models, paving the way for custom-tailored solutions in diverse application areas. The introduction of prompt-aware tuning mechanisms also highlights a growing recognition of the importance of contextual adaptation in enhancing model performance and reliability.

Future Directions

Future research could delve into further optimizing the interplay between semantic and acoustic encoders, perhaps through more advanced fusion techniques. Additionally, exploring the extension of such models to cover more diverse languages and dialects, as well as the incorporation of real-time adaptability in dynamic environments, will be crucial. Integrating generators for audio output might also offer enhanced interactivity between users and models, creating a more seamless multimodal communication interface.

Overall, WavLLM highlights the ongoing evolution and potential of integrating speech processing capabilities into LLMs, with promising implications for both theoretical advancements and practical applications in the AI domain.

PDF Markdown Bookmark Chat (Pro)

References (35)

Authors (12)

Shujie Hu (36 papers)
Long Zhou (57 papers)
Shujie Liu (101 papers)
Sanyuan Chen (28 papers)
Hongkun Hao (11 papers)
Jing Pan (25 papers)
Xunying Liu (92 papers)
Jinyu Li (164 papers)
Sunit Sivasankaran (11 papers)
Linquan Liu (8 papers)
Furu Wei (291 papers)
Lingwei Meng (31 papers)

Citations (27)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775188522732868000

https://twitter.com/AudioAndSpeech/status/1775317594997747839

https://twitter.com/knishimae0531/status/1775305595270414355

https://twitter.com/javaeeeee1/status/1776604878585942415

https://twitter.com/cackerman21/status/1832181020629856473

https://twitter.com/realmofresearch/status/1826411214802661383