Papers
Topics
Authors
Recent
2000 character limit reached

Moshi: Unified Speech-Text Dialogue Model

Updated 2 December 2025
  • Moshi is a full-duplex speech-text foundation model that unifies ASR, LLM, and TTS into a single autoregressive architecture for natural, overlapping dialogue.
  • It employs a hierarchical multi-stream generative framework and an inner monologue method to reduce latency and preserve paralinguistic cues.
  • The model achieves real-time responses (~200 ms latency) and outperforms traditional pipelines in ASR, TTS, and spoken QA benchmarks.

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework that redefines machine spoken interaction by eschewing the cascaded pipeline architecture in favor of a unified, autoregressive, speech-to-speech model. Unlike traditional spoken dialogue systems that chain independent voice activity detectors, automatic speech recognition (ASR), LLMs, and text-to-speech (TTS) modules—resulting in compounded multi-second latencies, loss of non-linguistic and paralinguistic cues, and strict one-at-a-time turn-taking—Moshi is designed to continuously listen and speak, model overlapping speech, interruptions, interjections, and emit output in real-time with practical end-to-end latency of approximately 200 ms. It introduces a hierarchical, multi-stream generative framework built upon a large text LLM backbone, streaming neural audio codec with residual quantization, and a parallel tokenization scheme for both system and user speech, enabling natural, realistic conversational dynamics (Défossez et al., 17 Sep 2024).

1. Motivation and Limitations of Pipeline Approaches

Conventional spoken dialogue systems operate as a pipeline: wake-word detection, ASR, textual LLM, and TTS. These cascaded architectures introduce three principal limitations:

  • Latency Compounding: Round-trip dialogue latency accumulates across each independent stage, leading to several seconds of lag, in contrast to the 200 ms conversational gaps typical in human dialogues.
  • Textual Bottleneck: Intermediate textual representation erases paralinguistic elements such as emotion, speaker style, and non-speech sounds, severely limiting naturalness and expressivity.
  • Rigid Turn Segmentation: Pipeline systems segment conversation into discrete speaker turns, making it impossible to natively model the 10–20% overlap, interruptions, and interjections common in real human dialogue.

Moshi resolves these limitations by unifying input and output speech streams and casting conversation as speech-to-speech autoregression. Dialogue unfolds as parallel, interleaved user and system token streams, removing explicit turn boundaries and supporting overlapping, naturalistic exchanges (Défossez et al., 17 Sep 2024).

2. Model Architecture

Moshi’s architecture consists of three principal components: Helium (text LLM backbone), Mimi (streaming neural audio codec with residual quantization), and the multi-stream hierarchical token generative model.

Helium: Text LLM Backbone

  • Parameters: 7B
  • Vocabulary: 32K-token SentencePiece
  • Context: 4096 tokens
  • Architectural Features: RoPE positional encoding, gated linear units, FlashAttention, RMSNorm
  • Pretraining Data: ~2.1T public English tokens (Wikipedia, StackExchange, filtered CommonCrawl)
  • Performance: Matches or exceeds other ≤2.5T-token, 7B models on MMLU (54.3%), ARC, OBQA, NQ benchmarks

Mimi: Streaming Neural Audio Codec

  • Encoder/Decoder: SeaNet convolutional stacks + 2 causal Transformer bottlenecks (8 layers, 8 heads, 20s context)
  • Codebooks: Q=8Q=8, each of size NA=2048N_A=2048 at 12.5 Hz (1.1 kbps)
  • Residual Vector Quantization (RVQ): “Split RVQ”—a semantic level-1 VQ distilled from WavLM embeddings in parallel with a 7-level acoustic RVQ, ensuring orthogonalization of semantic and acoustic information (semantic ABX↓ 23.3%→8.1%)
  • Training: Pure adversarial loss (no reconstruction loss) achieves a perceptual MUSHRA of 81.0; objective VisQOL of 1.84
  • Streaming: Causal encoding/decoding in 80 ms frames

Hierarchical Multi-Stream Generative Model (RQ-Transformer)

At each timestep s=1Ss=1\ldots S, K=2Q+1K=2Q+1 parallel subtokens are sampled:

  • k=1k=1: Text token WsW_s (system transcript)
  • k=2k=2: Semantic audio As,1A_{s,1} (system)
  • k=3Q+2k=3…Q+2: Delayed acoustic tokens Asτ,2QA_{s–\tau,2…Q} (system)
  • k=Q+3k=Q+3: Semantic audio As,1A'_{s,1} (user)
  • k=Q+42Q+2k=Q+4…2Q+2: Delayed acoustic tokens Asτ,2QA'_{s–\tau,2…Q} (user)

Autoregression is factored along time (Temporal Transformer, 32 layers, initialized from Helium) and depth (Depth Transformer, 6 layers, 16 heads, 1024 dimensions). Acoustic delay is τ=2\tau=2 frames (160 ms) in pretraining, τ=1\tau=1 (80 ms) in fine-tuning, yielding a theoretical latency of 160 ms and practical end-to-end latency of ~200 ms.

Component Key Hyperparameters Output Characteristics
Helium (LLM) 7B params, 32K vocab, 4096 ctx 54.3% MMLU, outperforms Llama 2
Mimi (Codec) 8 codebooks (2048), 12.5 Hz ABX 8.1%, MUSHRA 64%@1.1kbps
Multi-Stream Generator 32L Temporal, 6L Depth, τ=1\tau=1 ~200ms E2E latency, concurrent streams

3. Hierarchical Tokenization and the "Inner Monologue" Method

To address the need for precise linguistic control in speech generation, Moshi employs the "Inner Monologue" approach (“Editor’s term”): time-aligned text tokens (WsW_s) are prepended as a prefix to the semantic codebook at each timestep, derived from Whisper ASR timestamps and PAD tokens to maintain 12.5 Hz alignment.

Training optimizes the joint cross-entropy:

Ls=CE(ls,1,Ws)+1k=2Kαkk=2KαkCE(ls,k,Vs,k)L_s = \mathrm{CE}(l_{s,1}, W_s) + \frac{1}{\sum_{k=2}^K \alpha_k} \sum_{k=2}^K \alpha_k \mathrm{CE}(l_{s,k}, V_{s,k})

with α2=100\alpha_2 = 100 (semantic), α3K=1\alpha_{3…K} = 1 (acoustic). Incorporating Inner Monologue reduces NLL from 4.36→2.77, extends maximum transcript length (486→1920 chars), and substantially increases spoken QA performance (WebQ 9%→26.6%, LlamaQ 21%→62.3%, AudioTriv 7.3%→22.8%).

Restoration of text control underpins both streaming ASR and TTS capabilities within a single neural architecture by swapping the output delays for text and audio.

4. Streaming, Full-Duplex, and Conversational Dynamics

Moshi’s dual-stream structure models system and user audio in parallel, never yielding the microphone, enabling true full-duplex interaction. Natural dialogue “gaps” and overlaps, interruptions, and backchanneling emerge as simultaneous sequences in the token stream, without explicit turn boundaries.

  • Silence in either stream decodes to near-silence; PAD tokens indicate non-speaking status.
  • Simultaneous autoregressive sampling enables realistic overlap and responsiveness.
  • Theoretical and measured round-trip latency is 160–200 ms from microphone input to speaker output, matching human conversational pacing.

Dialogue dynamics, such as interpausal unit (IPU), gap, and overlap statistics on the Fisher corpus, match human ground truth closely. Perplexity against DialoGPT is lower (41.9 vs. 45.9) than pipeline baselines, indicating greater naturalness (Défossez et al., 17 Sep 2024).

5. Training Regimen and Implementation Details

Moshi leverages a multi-stage, multi-source training pipeline:

  • Unsupervised Pre-Training: 7M hours public audio and Whisper v3 transcripts in single-stream and mixture settings, 1M steps, 16h batch size.
  • Post-Training (Multi-Stream): Simulated dialogues using PyAnnote diarization, 100K steps, 8h batch.
  • Fine-tuning: Fisher corpus (2K hours phone calls, upsampled from 8 kHz to 24 kHz) to induce realistic full-duplex behavior, 10K steps.
  • Instruct-Fine-Tuning: 20K hours synthetic TTS data (single professional voice plus 92 styles), scripted using Helium generations.
  • Text Data: 12.5% curated sources (Wikipedia, StackExchange, scientific), 87.5% filtered CommonCrawl.
  • Hardware/Optimization: Runs on H100 GPUs with FSDP, activation checkpointing, FlashAttention; AdamW optimizer, separate text/audio state, learning rates 3e–4→2e–6.

Token vocabularies are Vtext=32K|V_{text}|=32K, Vaudio=2048|V_{audio}|=2048, and frame rate is 12.5 Hz.

6. Evaluation and Benchmarking

Moshi sets state-of-the-art quality in streaming, full-duplex speech generation and dialogue. Evaluation highlights include:

  • Language Modeling: Helium achieves 54.3% MMLU, outperforming Llama 2, Falcon, and MPT in the 7B class.
  • Codec Performance: Mimi’s split RVQ + Transformer bottlenecks + adversarial losses yield ABX 8.1%, MUSHRA 64%@1.1 kbps, outperforming SpeechTokenizer and SemantiCodec at comparable or lower framerates.
  • Audio LM: Cold-start sWUGGY 74.8%, sBLIMP 59.9%, sTopic-StoryCloze 80.9%; text→audio warm-start further improves sStoryCloze to 83.6%. Multimodal Moshi yields 49.7% (5-shot) MMLU.
  • Spoken QA: Without Inner Monologue, 9% (WebQ), 21% (LlamaQ); with it, 26.6% (WebQ), 62.3% (LlamaQ), outperforming Spectron and SpeechGPT while remaining streaming.
  • ASR/TTS (Streaming): By interchanging text/audio delay, achieves 2s look-ahead ASR WER of 5.7% (Libri-clean) and TTS WER of 4.7% from the same model, unchanged.
  • Compression/Quantization: 8-bit quantization halves model size with <2pt MMLU drop; 4 bits retains good audio at 2pt degradation; 2–3 bits yields artifacts (detected by entropy-based tests).
  • Safety and Robustness: ALERT toxicity ∼83% (comparable to other open models, behind GPT-4 and Llama 2); deduplication avoids audio regurgitation; speaker embedding drift <1.3% over 1,000 turns; watermarking vulnerable to low-bit audio, with text-based watermarking suffering idempotence issues.

7. Applications, Limitations, and Prospective Developments

Moshi’s streaming, full-duplex, concurrent-token paradigm applies to use cases including always-on conversational assistants, customer support bots, in-car co-pilots, real-time meeting summarization, and voice-based information retrieval; a live demonstration is available at moshi.chat.

Identified limitations:

  • Singly English language (current version)
  • Performance decrease on highly structured syntactic tasks (e.g. TriviaQA gap vs. Helium baseline)
  • Susceptibility to real-world environmental noise and remaining safety challenges

Planned and suggested extensions include:

  • Multilingual and code-switching capabilities
  • Larger context handling (>5 min sequences)
  • Privacy-preserving on-device deployment
  • Quantization-aware training, learned watermarking, and integration of vision for multi-modal interactions
  • Fine-grained prosodic/emotional control

All source code, models, and data preparation recipes are publicly available at github.com/kyutai-labs/moshi, enabling external benchmarking and further research (Défossez et al., 17 Sep 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Moshi: A Speech-Text Foundation Model.