Moshi: a speech-text foundation model for real-time dialogue

Published 17 Sep 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD | (2410.00037v2)

Abstract: We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text LLM backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken LLM, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces Moshi, a unified speech-to-speech model that significantly reduces dialogue latency by integrating linguistic, semantic, and acoustic cues.
It leverages a 7-billion-parameter Helium language model, the Mimi neural audio codec, and a hierarchical RQ-Transformer for efficient multi-stream audio processing.
The model demonstrates state-of-the-art performance in spoken question-answering and real-time dialogue applications, paving the way for next-gen conversational AI.

Moshi: A Speech-Text Foundation Model for Real-Time Dialogue

Abstract

The introduction of Moshi marks a significant advancement in the field of spoken dialogue systems. Moshi is designed as a full-duplex spoken dialogue framework that addresses the inherent limitations of existing systems, such as latency and the loss of non-linguistic information. By casting spoken dialogue as a speech-to-speech generation task, Moshi leverages a bespoke text LLM backbone to generate speech directly from neural audio codec tokens, thus enabling real-time interactions without explicit segmentation into speaker turns.

Introduction

Moshi revolutionizes voice interaction by moving beyond traditional pipelined architectures that rely on independent components like ASR, NLU, and TTS, which introduce latency and information loss. Instead, Moshi treats dialogue as a unified speech-to-speech generation process, integrating non-verbal cues such as emotions and environmental sounds into the interaction model. This capability is crucial for achieving natural conversation flow, handling interruptions, and overlapping speech, which are common in real dialogues but inadequately managed by current systems.

Figure 1: Overview of Moshi. Moshi is a speech-text foundation model which enables real-time spoken dialogue.

Key Components

Helium LLM

The core of Moshi's architecture is the Helium LLM, a 7-billion parameter text LLM pre-trained on a vast corpus of English data. Helium's design emphasizes efficiency and depth, employing RMSNorm for stability and RoPE for improved position encoding across a context size of 4096 tokens. This model serves as the backbone for Moshi, ensuring robust linguistic capabilities.

Mimi Neural Audio Codec

Mimi is a novel neural audio codec that utilizes residual vector quantization to transform audio into discrete tokens, effectively capturing both semantic and acoustic information. During training, Mimi distills semantic knowledge from non-causal embeddings, enhancing the real-time encoding and decoding of audio signals.

Figure 2: Architecture and training of Mimi, our neural audio codec, with its split residual vector quantization.

RQ-Transformer for Hierarchical Audio Modeling

Moshi integrates hierarchical generation through the RQ-Transformer, which conditions smaller Depth Transformers on a large Temporal Transformer context. This hierarchical design allows Moshi to model longer sequences efficiently and supports the parallel generation of audio tokens, vital for sustaining real-time dialogue without compromising linguistic coherence.

Figure 3: Architecture of the RQ-Transformer. The RQ-Transformer breaks down a flattened sequence into manageable steps for multi-scale processing.

Inner Monologue and Multi-Stream Audio Modeling

The integration of Inner Monologue within Moshi enables a unique training and inference setup that predicts text and audio tokens in sequence. This innovation not only streamlines speech generation but also allows Moshi to simultaneously process multiple audio streams, merging Moshi's voice and user input in a fluid conversational exchange.

Figure 4: Representation of the joint sequence modeled by Moshi.

Evaluation and Implications

Moshi's architecture demonstrates superior performance in both traditional benchmarks and new metrics devised for speech-text models. It achieves state-of-the-art results in spoken question answering while maintaining the capability to produce coherent and contextually rich dialogues. The extension of Moshi to tasks such as streaming ASR and TTS further illustrates its versatility, introducing new paradigms for real-time speech processing.

Conclusion

Moshi stands as a pioneering model in speech-text foundations, pushing the boundaries of what is feasible in real-time dialogue systems. Its ability to integrate linguistic, semantic, and acoustic information in a cohesive framework represents a notable leap towards achieving truly natural conversational AI. With its publicly available model, Moshi lays the groundwork for future innovations in AI-driven spoken dialogue systems.

Figure 5: Representation of the joint sequence modeled by Moshi in different contexts.

Moshi's release is intended to foster further research and development in speech-to-speech interaction models, inviting exploration into its application across a variety of real-world scenarios. The potential for extending Moshi's methodologies to other languages and contexts remains an open and exciting avenue for future exploration.

Markdown Report Issue