Moshi: Unified Speech-Text Dialogue Model
- Moshi is a full-duplex speech-text foundation model that unifies ASR, LLM, and TTS into a single autoregressive architecture for natural, overlapping dialogue.
- It employs a hierarchical multi-stream generative framework and an inner monologue method to reduce latency and preserve paralinguistic cues.
- The model achieves real-time responses (~200 ms latency) and outperforms traditional pipelines in ASR, TTS, and spoken QA benchmarks.
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework that redefines machine spoken interaction by eschewing the cascaded pipeline architecture in favor of a unified, autoregressive, speech-to-speech model. Unlike traditional spoken dialogue systems that chain independent voice activity detectors, automatic speech recognition (ASR), LLMs, and text-to-speech (TTS) modules—resulting in compounded multi-second latencies, loss of non-linguistic and paralinguistic cues, and strict one-at-a-time turn-taking—Moshi is designed to continuously listen and speak, model overlapping speech, interruptions, interjections, and emit output in real-time with practical end-to-end latency of approximately 200 ms. It introduces a hierarchical, multi-stream generative framework built upon a large text LLM backbone, streaming neural audio codec with residual quantization, and a parallel tokenization scheme for both system and user speech, enabling natural, realistic conversational dynamics (Défossez et al., 17 Sep 2024).
1. Motivation and Limitations of Pipeline Approaches
Conventional spoken dialogue systems operate as a pipeline: wake-word detection, ASR, textual LLM, and TTS. These cascaded architectures introduce three principal limitations:
- Latency Compounding: Round-trip dialogue latency accumulates across each independent stage, leading to several seconds of lag, in contrast to the 200 ms conversational gaps typical in human dialogues.
- Textual Bottleneck: Intermediate textual representation erases paralinguistic elements such as emotion, speaker style, and non-speech sounds, severely limiting naturalness and expressivity.
- Rigid Turn Segmentation: Pipeline systems segment conversation into discrete speaker turns, making it impossible to natively model the 10–20% overlap, interruptions, and interjections common in real human dialogue.
Moshi resolves these limitations by unifying input and output speech streams and casting conversation as speech-to-speech autoregression. Dialogue unfolds as parallel, interleaved user and system token streams, removing explicit turn boundaries and supporting overlapping, naturalistic exchanges (Défossez et al., 17 Sep 2024).
2. Model Architecture
Moshi’s architecture consists of three principal components: Helium (text LLM backbone), Mimi (streaming neural audio codec with residual quantization), and the multi-stream hierarchical token generative model.
Helium: Text LLM Backbone
- Parameters: 7B
- Vocabulary: 32K-token SentencePiece
- Context: 4096 tokens
- Architectural Features: RoPE positional encoding, gated linear units, FlashAttention, RMSNorm
- Pretraining Data: ~2.1T public English tokens (Wikipedia, StackExchange, filtered CommonCrawl)
- Performance: Matches or exceeds other ≤2.5T-token, 7B models on MMLU (54.3%), ARC, OBQA, NQ benchmarks
Mimi: Streaming Neural Audio Codec
- Encoder/Decoder: SeaNet convolutional stacks + 2 causal Transformer bottlenecks (8 layers, 8 heads, 20s context)
- Codebooks: , each of size at 12.5 Hz (1.1 kbps)
- Residual Vector Quantization (RVQ): “Split RVQ”—a semantic level-1 VQ distilled from WavLM embeddings in parallel with a 7-level acoustic RVQ, ensuring orthogonalization of semantic and acoustic information (semantic ABX↓ 23.3%→8.1%)
- Training: Pure adversarial loss (no reconstruction loss) achieves a perceptual MUSHRA of 81.0; objective VisQOL of 1.84
- Streaming: Causal encoding/decoding in 80 ms frames
Hierarchical Multi-Stream Generative Model (RQ-Transformer)
At each timestep , parallel subtokens are sampled:
- : Text token (system transcript)
- : Semantic audio (system)
- : Delayed acoustic tokens (system)
- : Semantic audio (user)
- : Delayed acoustic tokens (user)
Autoregression is factored along time (Temporal Transformer, 32 layers, initialized from Helium) and depth (Depth Transformer, 6 layers, 16 heads, 1024 dimensions). Acoustic delay is frames (160 ms) in pretraining, (80 ms) in fine-tuning, yielding a theoretical latency of 160 ms and practical end-to-end latency of ~200 ms.
| Component | Key Hyperparameters | Output Characteristics |
|---|---|---|
| Helium (LLM) | 7B params, 32K vocab, 4096 ctx | 54.3% MMLU, outperforms Llama 2 |
| Mimi (Codec) | 8 codebooks (2048), 12.5 Hz | ABX 8.1%, MUSHRA 64%@1.1kbps |
| Multi-Stream Generator | 32L Temporal, 6L Depth, | ~200ms E2E latency, concurrent streams |
3. Hierarchical Tokenization and the "Inner Monologue" Method
To address the need for precise linguistic control in speech generation, Moshi employs the "Inner Monologue" approach (“Editor’s term”): time-aligned text tokens () are prepended as a prefix to the semantic codebook at each timestep, derived from Whisper ASR timestamps and PAD tokens to maintain 12.5 Hz alignment.
Training optimizes the joint cross-entropy:
with (semantic), (acoustic). Incorporating Inner Monologue reduces NLL from 4.36→2.77, extends maximum transcript length (486→1920 chars), and substantially increases spoken QA performance (WebQ 9%→26.6%, LlamaQ 21%→62.3%, AudioTriv 7.3%→22.8%).
Restoration of text control underpins both streaming ASR and TTS capabilities within a single neural architecture by swapping the output delays for text and audio.
4. Streaming, Full-Duplex, and Conversational Dynamics
Moshi’s dual-stream structure models system and user audio in parallel, never yielding the microphone, enabling true full-duplex interaction. Natural dialogue “gaps” and overlaps, interruptions, and backchanneling emerge as simultaneous sequences in the token stream, without explicit turn boundaries.
- Silence in either stream decodes to near-silence; PAD tokens indicate non-speaking status.
- Simultaneous autoregressive sampling enables realistic overlap and responsiveness.
- Theoretical and measured round-trip latency is 160–200 ms from microphone input to speaker output, matching human conversational pacing.
Dialogue dynamics, such as interpausal unit (IPU), gap, and overlap statistics on the Fisher corpus, match human ground truth closely. Perplexity against DialoGPT is lower (41.9 vs. 45.9) than pipeline baselines, indicating greater naturalness (Défossez et al., 17 Sep 2024).
5. Training Regimen and Implementation Details
Moshi leverages a multi-stage, multi-source training pipeline:
- Unsupervised Pre-Training: 7M hours public audio and Whisper v3 transcripts in single-stream and mixture settings, 1M steps, 16h batch size.
- Post-Training (Multi-Stream): Simulated dialogues using PyAnnote diarization, 100K steps, 8h batch.
- Fine-tuning: Fisher corpus (2K hours phone calls, upsampled from 8 kHz to 24 kHz) to induce realistic full-duplex behavior, 10K steps.
- Instruct-Fine-Tuning: 20K hours synthetic TTS data (single professional voice plus 92 styles), scripted using Helium generations.
- Text Data: 12.5% curated sources (Wikipedia, StackExchange, scientific), 87.5% filtered CommonCrawl.
- Hardware/Optimization: Runs on H100 GPUs with FSDP, activation checkpointing, FlashAttention; AdamW optimizer, separate text/audio state, learning rates 3e–4→2e–6.
Token vocabularies are , , and frame rate is 12.5 Hz.
6. Evaluation and Benchmarking
Moshi sets state-of-the-art quality in streaming, full-duplex speech generation and dialogue. Evaluation highlights include:
- Language Modeling: Helium achieves 54.3% MMLU, outperforming Llama 2, Falcon, and MPT in the 7B class.
- Codec Performance: Mimi’s split RVQ + Transformer bottlenecks + adversarial losses yield ABX 8.1%, MUSHRA 64%@1.1 kbps, outperforming SpeechTokenizer and SemantiCodec at comparable or lower framerates.
- Audio LM: Cold-start sWUGGY 74.8%, sBLIMP 59.9%, sTopic-StoryCloze 80.9%; text→audio warm-start further improves sStoryCloze to 83.6%. Multimodal Moshi yields 49.7% (5-shot) MMLU.
- Spoken QA: Without Inner Monologue, 9% (WebQ), 21% (LlamaQ); with it, 26.6% (WebQ), 62.3% (LlamaQ), outperforming Spectron and SpeechGPT while remaining streaming.
- ASR/TTS (Streaming): By interchanging text/audio delay, achieves 2s look-ahead ASR WER of 5.7% (Libri-clean) and TTS WER of 4.7% from the same model, unchanged.
- Compression/Quantization: 8-bit quantization halves model size with <2pt MMLU drop; 4 bits retains good audio at 2pt degradation; 2–3 bits yields artifacts (detected by entropy-based tests).
- Safety and Robustness: ALERT toxicity ∼83% (comparable to other open models, behind GPT-4 and Llama 2); deduplication avoids audio regurgitation; speaker embedding drift <1.3% over 1,000 turns; watermarking vulnerable to low-bit audio, with text-based watermarking suffering idempotence issues.
7. Applications, Limitations, and Prospective Developments
Moshi’s streaming, full-duplex, concurrent-token paradigm applies to use cases including always-on conversational assistants, customer support bots, in-car co-pilots, real-time meeting summarization, and voice-based information retrieval; a live demonstration is available at moshi.chat.
Identified limitations:
- Singly English language (current version)
- Performance decrease on highly structured syntactic tasks (e.g. TriviaQA gap vs. Helium baseline)
- Susceptibility to real-world environmental noise and remaining safety challenges
Planned and suggested extensions include:
- Multilingual and code-switching capabilities
- Larger context handling (>5 min sequences)
- Privacy-preserving on-device deployment
- Quantization-aware training, learned watermarking, and integration of vision for multi-modal interactions
- Fine-grained prosodic/emotional control
All source code, models, and data preparation recipes are publicly available at github.com/kyutai-labs/moshi, enabling external benchmarking and further research (Défossez et al., 17 Sep 2024).