Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM (2411.00774v5)
Abstract: Rapidly developing LLMs have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
- AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, pages 1–5. IEEE, 2017.
- Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370, 2024.
- Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024.
- Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024.
- Paraformer: Fast and accurate parallel Transformer for non-autoregressive end-to-end speech recognition. In Proc. INTERSPEECH, pages 5079–5083. ISCA, 2022.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In William W. Cohen and Andrew W. Moore, editors, International Conference on Machine Learning, ICML 2006, volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM, 2006.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association for Computational Linguistics.
- Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
- Spoken question answering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255, 2023.
- OpenAI. https://openai.com/index/hello-gpt-4o/, 2024.
- Librispeech: An asr corpus based on public domain audio books. IEEE, 2015.
- Fewer-token neural speech codec with time-invariant codes. arXiv preprint arXiv:2310.00014, 2023.
- Qwen Team. Qwen2.5: A party of foundation models, September 2024.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Annual Conference on Neural Information Processing Systems, NeurIPS 2017, pages 5998–6008, 2017.
- Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024.
- Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex. arXiv preprint arXiv:2410.11190, 2024.
- Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547, 2021.
- WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. In International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, pages 6182–6186. IEEE, 2022.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.