Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation (2405.19041v1)

Published 29 May 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent end-to-end approaches have shown promise in extending LLMs to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215.
  2. Qwen technical report.
  3. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
  4. Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy. Association for Computational Linguistics.
  5. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
  6. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909.
  7. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.
  8. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017. Association for Computational Linguistics.
  9. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083.
  10. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
  11. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  13. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656.
  14. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
  15. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  16. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622.
  17. R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  18. Cosmic: Data efficient instruction-tuning for speech in-context learning. arXiv preprint arXiv:2311.02248.
  19. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  20. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  21. Robust speech recognition via large-scale weak supervision. arxiv. arXiv preprint arXiv:2212.04356.
  22. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  23. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  24. Llasm: Large language and speech model.
  25. Gemini Team. 2024. Gemini: A family of highly capable multimodal models.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
  28. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing.
  29. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26.
  30. Slm: Bridge the thin gap between speech and text foundation models. arXiv preprint arXiv:2310.00230.
  31. Speech-to-text adapter and speech-to-entity retriever augmented llms for speech understanding. arXiv preprint arXiv:2306.07944.
  32. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
  33. Speechgen: Unlocking the generative power of speech language models with prompts. arXiv preprint arXiv:2306.02207.
  34. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917.
  35. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  36. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000.
  37. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
  38. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chen Wang (599 papers)
  2. Minpeng Liao (11 papers)
  3. Zhongqiang Huang (20 papers)
  4. Jiajun Zhang (176 papers)
Citations (4)