Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WavLLM: Towards Robust and Adaptive Speech Large Language Model (2404.00656v3)

Published 31 Mar 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: The recent advancements in LLMs have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech LLM with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavLLM}.

Overview of "WavLLM: Towards Robust and Adaptive Speech LLM"

The paper introduces WavLLM, a speech-integrated LLM that aims to enhance the modeling of speech capabilities within LLMs for improved multimodal understanding and task execution. The research delineates a novel architecture that incorporates dual encoders and a prompt-aware Low-Rank Adaptation (LoRA) weight adapter, optimized through a two-stage curriculum learning approach, to accomplish robust and versatile auditory task processing.

Methodology

Model Architecture:

  • Dual Encoders: WavLLM integrates the Whisper and WavLM encoders to separately process different facets of the speech signal. Whisper primarily captures semantic content, while WavLM focuses on the speaker’s acoustic attributes such as identity.
  • Prompt-aware LoRA Adapter: This component is introduced to modulate model parameters in response to varying instruction prompts, thereby enhancing adaptability and performance across diverse input types.

Training Approach:

  • Two-Stage Curriculum Learning: The training regimen begins with simpler tasks, aiming to build foundational capabilities by leveraging synthesized spoken question-answering datasets and other elementary speech tasks such as automatic speech recognition (ASR), speech translation (ST), speaker verification (SV), and emotion recognition (ER).
  • The second stage involves advanced multi-task learning with complex task combinations and integrates a prompt-aware adapter to further refine the execution across mixed tasks and instructions.

Experimental Evaluation

The model is evaluated on a range of single and multiple task benchmarks:

  • Single-Task Performance: WavLLM demonstrates state-of-the-art results on tasks like ASR, with a WER of 2.0% on the LibriSpeech test-clean dataset, outperforming comparable models.
  • Task Flexibility and Chain-of-Thought (CoT) Processing: The model exhibits strong performance in handling tasks requiring CoT reasoning, leveraging the capability to decompose complex tasks into manageable sub-tasks, fostering efficient problem-solving.

Discussion and Implications

WavLLM showcases a compelling advancement in merging speech and language understanding, offering a more nuanced capability to generalize across various auditory and textual tasks. Its adaptable architecture presents potential use cases extending beyond speech transcription to include complex dialogues and multilingual task executions, underscoring its applicability in real-world voice-based AI applications.

The separation of speech encoding into semantic and acoustic content via dual encoders might signal a trend towards more compartmentalized processing architectures within multimodal models, paving the way for custom-tailored solutions in diverse application areas. The introduction of prompt-aware tuning mechanisms also highlights a growing recognition of the importance of contextual adaptation in enhancing model performance and reliability.

Future Directions

Future research could delve into further optimizing the interplay between semantic and acoustic encoders, perhaps through more advanced fusion techniques. Additionally, exploring the extension of such models to cover more diverse languages and dialects, as well as the incorporation of real-time adaptability in dynamic environments, will be crucial. Integrating generators for audio output might also offer enhanced interactivity between users and models, creating a more seamless multimodal communication interface.

Overall, WavLLM highlights the ongoing evolution and potential of integrating speech processing capabilities into LLMs, with promising implications for both theoretical advancements and practical applications in the AI domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  3. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
  4. The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction, pages 28–39. Springer, 2005.
  5. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  6. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  7. The Fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004.
  8. MuTual: A dataset for multi-turn dialogue reasoning. arXiv preprint arXiv:2004.04494, 2020.
  9. Pengi: An audio language model for audio tasks. arXiv e-prints, pages arXiv–2305, 2023.
  10. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017. Association for Computational Linguistics, 2019.
  11. Switchboard: Telephone speech corpus for research and development. In ICASSP, volume 1, pages 517–520. IEEE Computer Society, 1992.
  12. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
  13. Boosting large language model for speech synthesis: An empirical study. arXiv preprint arXiv:2401.00246, 2023.
  14. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  17. Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context. arXiv preprint arXiv:2309.08105, 2023.
  18. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
  19. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  21. Generating images in context with multimodal large language models. In The Twelfth International Conference on Learning Representations, 2023.
  22. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  23. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  24. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  25. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930, 2023.
  26. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023.
  27. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.
  30. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310, 2020.
  31. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023.
  32. WikiQA: A challenge dataset for open-domain question answering. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1237. URL https://aclanthology.org/D15-1237.
  33. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.
  34. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  35. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Shujie Hu (36 papers)
  2. Long Zhou (57 papers)
  3. Shujie Liu (101 papers)
  4. Sanyuan Chen (28 papers)
  5. Hongkun Hao (11 papers)
  6. Jing Pan (25 papers)
  7. Xunying Liu (92 papers)
  8. Jinyu Li (164 papers)
  9. Sunit Sivasankaran (11 papers)
  10. Linquan Liu (8 papers)
  11. Furu Wei (291 papers)
  12. Lingwei Meng (31 papers)
Citations (27)