Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation (2410.17799v2)

Published 23 Oct 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text LLM backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.
  2. Qwen2-audio technical report.
  3. Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai.
  4. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3029–3051. Association for Computational Linguistics.
  5. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.
  6. Lauragpt: Listen, attend, understand, and regenerate audio with gpt.
  7. Llama-omni: Seamless speech interaction with large language models.
  8. Vita: Towards open-source interactive omni multimodal llm.
  9. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer - 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18-22, 2018, Proceedings, volume 11096 of Lecture Notes in Computer Science, pages 198–208. Springer.
  10. Towards better instruction following language models for chinese: Investigating the impact of training data and evaluation.
  11. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In NeurIPS.
  12. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  13. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  14. Instruction tuning with GPT-4. CoRR, abs/2304.03277.
  15. Mls: A large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411.
  16. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  17. MUSAN: A Music, Speech, and Noise Corpus. ArXiv:1510.08484v1.
  18. Moss: An open conversational large language model. Machine Intelligence Research, 21(5):888–905.
  19. Salmonn: Towards generic hearing abilities for large language models.
  20. Improving and generalizing flow-based generative models with minibatch optimal transport. Trans. Mach. Learn. Res., 2024.
  21. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.
  22. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 993–1003. Association for Computational Linguistics.
  23. Zhifei Xie and Changqiao Wu. 2024. Mini-omni: Language models can hear, talk while thinking in streaming.
  24. Qwen2 technical report.
  25. Xin Xu Shaoji Zhang Ming Li Yao Shi, Hui Bu. 2015. Aishell-3: A multi-speaker mandarin tts corpus and the baselines.
  26. Libritts: A corpus derived from librispeech for text-to-speech. In 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, pages 1526–1530. ISCA.
  27. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
  28. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.
  29. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  30. 3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect corpus for speech representation disentanglement.
Citations (1)

Summary

We haven't generated a summary for this paper yet.