BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation (2405.19041v1)
Abstract: Recent end-to-end approaches have shown promise in extending LLMs to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215.
- Qwen technical report.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
- Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy. Association for Computational Linguistics.
- X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
- Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909.
- Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.
- Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017. Association for Computational Linguistics.
- Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Mathematical language models: A survey. arXiv preprint arXiv:2312.07622.
- R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Cosmic: Data efficient instruction-tuning for speech in-context learning. arXiv preprint arXiv:2311.02248.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- Robust speech recognition via large-scale weak supervision. arxiv. arXiv preprint arXiv:2212.04356.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Llasm: Large language and speech model.
- Gemini Team. 2024. Gemini: A family of highly capable multimodal models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
- Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing.
- A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26.
- Slm: Bridge the thin gap between speech and text foundation models. arXiv preprint arXiv:2310.00230.
- Speech-to-text adapter and speech-to-entity retriever augmented llms for speech understanding. arXiv preprint arXiv:2306.07944.
- Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
- Speechgen: Unlocking the generative power of speech language models with prompts. arXiv preprint arXiv:2306.02207.
- On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
- Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
- Chen Wang (599 papers)
- Minpeng Liao (11 papers)
- Zhongqiang Huang (20 papers)
- Jiajun Zhang (176 papers)