Moshi: a speech-text foundation model for real-time dialogue (2410.00037v2)

Published 17 Sep 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text LLM backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken LLM, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

Citations (20)

View on Semantic Scholar

Summary

The paper introduces Moshi, a model that reimagines conversational AI by generating speech as tokens with a dual-stream LLM backbone.
It achieves low latency with theoretical and practical values of 160ms and 200ms, delivering state-of-the-art performance in speech intelligibility, audio quality, and dialogue consistency.
Moshi pioneers the 'Inner Monologue' approach by predicting time-aligned text alongside audio, setting the stage for future research in natural, interactive dialogue systems.

Moshi: A Speech-Text Foundation Model for Real-Time Dialogue

In the field of spoken dialogue systems, the traditional approach has relied heavily on a sequential pipeline of independent components, namely voice activity detection, speech recognition, textual dialogue processing, and text-to-speech conversion. However, this conventional architecture tends to introduce several critical inefficiencies, such as significant latency, loss of non-linguistic information, and rigid speaker turn-taking. The paper "Moshi: a speech-text foundation model for real-time dialogue" seeks to address these issues by proposing an innovative framework for real-time, full-duplex spoken dialogue.

Key Contributions

The authors introduce Moshi, a foundational model that reimagines spoken dialogue as speech-to-speech generation. By extending a text LLM backbone, Moshi effectively generates speech as tokens through a neural audio codec's residual quantizer. This dual-stream modeling allows Moshi to manage both the user and its speech simultaneously, removing the necessity for explicit speaker turns and facilitating complex conversational dynamics such as interruptions or overlaps.

Significantly, Moshi advances beyond existing models by predicting not only audio but also time-aligned text tokens. This method, termed "Inner Monologue," enhances the linguistic quality of the generated speech while enabling functionalities such as streaming ASR and TTS.

Numerical Results and Claims

The paper reports theoretical model latency of 160ms and observed practical latency of 200ms. This latency is lower than typical conversational latencies, drastically improving the responsiveness of speech interactions. The authors demonstrate the model's capabilities through various benchmarks, showing state-of-the-art results in multiple dimensions, including speech intelligibility, audio quality, dialogue consistency, and question-answering tasks.

Implications and Future Directions

Moshi’s approach of integrating audio and text into a hierarchical architecture opens up new avenues in AI-driven human-computer interactions. The model's potential to operate in real time while maintaining high-quality interactions represents a notable advancement toward more naturalistic conversational AI.

The paper suggests several promising directions for future research, such as refining the model's ability to handle diverse emotional expressions, adapting to different acoustic environments, and potentially expanding to multilingual settings.

Conclusion

Moshi's full-duplex speech generation presents a compelling solution to the limitations of traditional spoken dialogue systems. By addressing latency, maintaining the richness of audio interactions, and integrating textual comprehension, the research contributes significantly to the field. The availability of the model on GitHub highlights the authors' commitment to fostering further advancements in this area. As spoken dialogues gain prominence across various applications, Moshi's groundwork offers a robust foundation for next-generation conversational agents.

Related Papers

GitHub

GitHub - kyutai-labs/moshi (5,887 stars)

Tweets

https://twitter.com/jeethu/status/1836451127790817682

https://twitter.com/PythonHub/status/1841980346369188162

https://twitter.com/andi_marafioti/status/1836444126662853053

https://twitter.com/gm8xx8/status/1841323143551389929

https://twitter.com/taziku_co/status/1836727359724237074

https://twitter.com/generatorman_ai/status/1836485353139782127

YouTube

Show All Videos