Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Speech Large Language Models (2410.18908v2)

Published 24 Oct 2024 in eess.AS
A Survey on Speech Large Language Models

Abstract: LLMs exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

A Survey on Speech LLMs

This paper provides a detailed examination of the integration of LLMs within the domain of Spoken Language Understanding (SLU). The authors present a comprehensive analysis of Speech LLMs, delineating their evolution, architecture, training strategies, and performance on various SLU tasks. They focus on advancing SLU capabilities by leveraging LLMs for enhanced audio feature extraction and multimodal fusion.

Architectural Developments

The paper categorizes Speech LLMs into distinct architectural stages: Audio Feature Extraction, Multimodal Information Fusion, and LLM Inference. These models seek to integrate audio and text modalities for comprehensive understanding. Notably, evolving from traditional approaches that use simple cascading with Automatic Speech Recognition (ASR), Speech LLMs now propose innovative direct processing architectures. They achieve this through methods like Direct Projection and Token Mapping, ensuring that audio features are seamlessly integrated into the text feature space of LLMs.

Training Strategies

Three primary training strategies are outlined: pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). Pretraining captures broad, unsupervised knowledge, while SFT hones models on specific tasks. Reinforcement learning is noted as a potential area for further exploration to optimize cross-task performance in SLU.

Performance and Challenges

The paper highlights significant achievements in traditional tasks such as ASR and speech translation, where Speech LLMs often outperform existing models. It underscores the importance of addressing specific challenges such as LLM dormancy and high computational costs. The authors identify dormancy as a notable issue, where LLMs underutilize their capabilities when applied to speech tasks, suggesting the need for further optimization in audio-text alignment.

Future Directions

The investigation concludes by proposing future directions that emphasize refining modality alignment and expanding the application scope of Speech LLMs. The authors suggest augmenting token spaces and introducing more dynamic reinforcement learning techniques. They foresee deploying Speech LLMs as integral components in complex multimodal systems, leveraging their robust contextual reasoning for enhanced interaction and processing capabilities.

Implications

The paper's findings have practical implications, suggesting pathways for developing more efficient and effective SLU systems. Theoretical implications include deepening our understanding of multimodal fusion in LLMs and exploring novel architectural adaptations. The proposed solutions aim to mitigate current limitations, enhancing model performance and utility across various applications.

Overall, this survey provides a well-rounded analysis of Speech LLM advancements and sets a foundation for future research within SLU and multimodal contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  2. “From large language models to large multimodal models: A literature review,” Applied Sciences, vol. 14, no. 12, 2024.
  3. “Learning transferable visual models from natural language supervision,” 2021.
  4. “Zero-shot text-to-image generation,” 2021.
  5. Spoken language understanding: Systems for extracting semantic information from speech, John Wiley & Sons, 2011.
  6. “A survey on spoken language understanding: Recent advances and new frontiers,” arXiv preprint arXiv:2103.03095, 2021.
  7. “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
  8. “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  9. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” in IEEE Transactions on Audio, Speech, and Language Processing, 2021.
  10. “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” arXiv preprint arXiv:2110.07205, 2021.
  11. “Modular end-to-end automatic speech recognition framework for acoustic-to-word model,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2174–2183, 2020.
  12. “Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers,” arXiv preprint arXiv:2406.05370, 2024.
  13. “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” 2023.
  14. “Audiopalm: A large language model that can speak and listen,” 2023.
  15. “Pengi: An audio language model for audio tasks,” Advances in Neural Information Processing Systems, vol. 36, pp. 18090–18108, 2023.
  16. “Salmonn: Towards generic hearing abilities for large language models,” arXiv preprint arXiv:2310.13289, 2023.
  17. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023.
  18. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,” arXiv preprint arXiv:2407.04675, 2024.
  19. “An embarrassingly simple approach for llm with strong asr capacity,” 2024.
  20. “Prompting large language models with speech recognition abilities,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 13351–13355.
  21. “Speechx: Neural codec language model as a versatile speech transformer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  22. “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning. PMLR, 2016, pp. 173–182.
  23. “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518.
  24. “Speech reallm–real-time streaming speech recognition with multimodal llms by teaching the flow of time,” arXiv preprint arXiv:2406.09569, 2024.
  25. “Multilingual and fully non-autoregressive asr with large language model fusion: A comprehensive study,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 13306–13310.
  26. “Findings of the iwslt 2023 evaluation campaign,” Association for Computational Linguistics, 2023.
  27. “Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages,” arXiv preprint arXiv:2305.18098, 2023.
  28. “Gentranslate: Large language models are generative multilingual speech and machine translators,” arXiv preprint arXiv:2402.06894, 2024.
  29. “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
  30. “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
  31. “Pengi: An audio language model for audio tasks,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. 2023, vol. 36, pp. 18090–18108, Curran Associates, Inc.
  32. “Mala-asr: Multimedia-assisted llm-based asr,” arXiv preprint arXiv:2406.05839, 2024.
  33. “Voxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 13326–13330.
  34. “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, 2023.
  35. “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
  36. “Bat: Learning to reason about spatial sounds with large language models,” arXiv preprint arXiv:2402.01591, 2024.
  37. “Decoder-only architecture for speech recognition with ctc prompts and text data augmentation,” 2024.
  38. “Spirit-lm: Interleaved spoken and written language model,” 2024.
  39. “Superb: Speech processing universal performance benchmark,” in Interspeech, 2021, pp. 1194–1198.
  40. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” in Advances in Neural Information Processing Systems, 2022.
  41. “Anygpt: Unified multimodal llm with discrete sequence modeling,” arXiv preprint arXiv:2402.12226, 2024.
  42. “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” arXiv preprint arXiv:2402.01831, 2024.
  43. “Ai alignment: A comprehensive survey,” arXiv preprint arXiv:2310.19852, 2023.
  44. “Enhancing zero-shot text-to-speech synthesis with human feedback,” arXiv preprint arXiv:2406.00654, 2024.
  45. “Preference alignment improves language model-based tts,” arXiv preprint arXiv:2409.12403, 2024.
  46. “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  47. “Prompting large language model for machine translation: A case study,” in International Conference on Machine Learning. PMLR, 2023, pp. 41092–41110.
  48. “Fleurs: Few-shot learning evaluation of universal representations of speech,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805.
  49. “The flores-101 evaluation benchmark for low-resource and multilingual machine translation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 522–538, 2022.
  50. “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
  51. “Findings of the 2020 conference on machine translation (wmt20),” in Proceedings of the Fifth Conference on Machine Translation. Association for Computational Linguistics,, 2020, pp. 1–55.
  52. “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  53. “Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words,” arXiv preprint arXiv:2406.13340, 2024.
  54. “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  55. “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
  56. “A full-duplex speech dialogue scheme based on large language models,” arXiv preprint arXiv:2405.19487, 2024.
  57. “Towards achieving human parity on end-to-end simultaneous speech translation via llm agent,” arXiv preprint arXiv:2407.21646, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jing Peng (32 papers)
  2. Yucheng Wang (83 papers)
  3. Yu Xi (16 papers)
  4. Kai Yu (201 papers)
  5. Xu Li (126 papers)
  6. Xizhuo Zhang (2 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com