WavChat: Survey of Spoken Dialogue Models
- The paper demonstrates a shift from modular, rule‐based SDS to neural end-to-end architectures, enabling more natural and synchronous conversational behaviors.
- It details methodologies from cascaded systems with separate ASR, LLM, and TTS components to unified models that jointly optimize speech and language outputs.
- The research outlines emerging trends such as full-duplex interaction and multimodal learning, supported by benchmarks and metrics for robust evaluation.
Spoken dialogue models, also referred to as spoken dialogue systems (SDS), are artificial agents that engage in real-time, multi-turn interactions with humans using speech as the primary modality. These systems map incoming speech to an internal representation, perform response planning, and synthesize speech in reply, often under stringent requirements for latency, turn-taking, and robustness to uncertainty. The field has shifted from deterministic, modular cascades to neural end-to-end architectures capable of simultaneous listening and speaking, supporting increasingly natural, synchronous, and expressive conversational behaviors (Ji et al., 15 Nov 2024, Chen et al., 18 Sep 2025, Patlan et al., 2021).
1. Historical Evolution and System Paradigms
Early SDS adopted rigid, rule-based architectures (e.g., ELIZA, PARRY) that matched patterns and produced deterministic replies, favoring interpretability at the expense of scalability and learning. The introduction of statistical models, notably Partially Observable Markov Decision Processes (POMDPs), enabled explicit uncertainty modeling across the ASR-to-dialogue-to-TTS pipeline, optimizing response selection with principled reward functions but at the cost of high dimensionality and data requirements (Patlan et al., 2021).
In the 2010s, deep learning models supplanted modular pipelines. Encoder–decoder architectures, transformers, and hierarchical sequencers facilitated both task-oriented and open-domain dialogues, often trained end-to-end with large-scale conversational corpora (Patlan et al., 2021, Ji et al., 15 Nov 2024). Recent milestones include:
- Cascaded systems (ASR → LLM → TTS): Modular, with each component trained independently; recent examples integrate paralinguistic features and LLM augmentation.
- End-to-end models: Direct mapping from speech input to speech output, often bypassing text entirely. Examples include dGSLM, SpeechGPT, Moshi, and full-duplex models with joint or parallel token streams (Ji et al., 15 Nov 2024, Chen et al., 18 Sep 2025).
These trends reveal a clear trajectory from isolated, component-based design to unified architectures emphasizing multi-modality, joint learning, and adaptive synchronization.
2. Core Technologies in Spoken Dialogue Modeling
2.1 Speech Representations
Modern systems rely on learned or quantized representations, including:
- Short-Time Fourier Transform (STFT) and Mel-spectrograms: Conventional feature extractors.
- Self-supervised representations: Wav2Vec 2.0 and HuBERT employ contrastive or masked prediction to encode phonetic, prosodic, and speaker information.
- Discrete acoustic tokens: Vector quantization approaches (EnCodec, SoundStream, SNAC) compress waveforms into codebook indices for LLM-compatible modeling (Ji et al., 15 Nov 2024).
2.2 Training Paradigms
- Cascaded: Separate ASR models (CTC loss on paired speech-text), dialogue LLMs (supervised cross-entropy or RL-based fine-tuning), and TTS systems, often trained on disjoint data sources.
- End-to-end: Modality-aligned post-training on paired speech-text, joint or streaming optimization, shared contextual encodings across modalities, with optional preference optimization (e.g., DPO).
Recent work also investigates joint modeling of dialogue responses and prosodic, phonological, or linguistic features via unified outputs (e.g., text plus JSON- or token-based annotation for TTS pre-conditioning) (Zhou et al., 2023).
2.3 Duplex and Streaming Capabilities
- Half-duplex: Strict alternation between input and output.
- Full-duplex: Simultaneous, overlapping input and output streams with real-time arbitration; enables natural backchannels, interruptions, and conversational overlaps (Chen et al., 18 Sep 2025).
- Causal architectures: Causal convolutions and causal self-attention ensure outputs depend only on past (and sometimes anticipated future) context, supporting low latency and synchronous interaction (Ji et al., 15 Nov 2024, Chen et al., 18 Sep 2025).
3. Interaction Strategies and Synchronization
A central challenge for SDS is synchronizing inbound and outbound speech streams to enable natural conversational flow. Taxonomies distinguish:
- Engineered synchronization: Modular, often external, controllers (finite-state machines, binary activity classifiers) manage turn-taking decisions. Systems like FlexDuo and Freeze-Omni typify this approach (Chen et al., 18 Sep 2025).
- Learned synchronization: End-to-end architectures natively support dual-stream processing, with emergent arbitration learned from synchronous data. Examples include dual-tower transformers (dGSLM), joint autoregressive models (Moshi), and interleaved token streams (SyncLLM) (Chen et al., 18 Sep 2025).
Turn-taking, backchanneling, and overlap generation can be modeled with explicit loss formulations (e.g., Floor-Taking Offset, Interruption Success Rate), duration-prediction heads, and specialized input/output representations (Mitsui et al., 2023, Li et al., 2022).
4. Datasets and Benchmarks
The progress of SDS research is tightly coupled to the availability of high-fidelity, well-annotated datasets:
| Dataset | Modality | Scale | Highlights |
|---|---|---|---|
| Switchboard | Speech, Text | 2,500 conv. | Real telephone speech, ASR noise |
| MultiWOZ | Text | 10k+ dialogs | Task-oriented, multi-domain |
| SpokenWOZ | Speech, Text | 5.7k dialogs, 249 h | Cross-turn, reasoning slots |
| Fisher | Speech, Dual-Stream | 964 h | Synchronous dialogues |
| MultiDialog | Audio-Visual, Text | 9k dialogs, 340 h | Face-to-face, emotion-annotated |
SpokenWOZ establishes benchmarks for dialogue state tracking (DST), end-to-end completion, and slot reasoning under realistic ASR noise and spontaneous conversational phenomena (Si et al., 2023). MultiDialog enables training of face-to-face, audio-visual dialogue models with fine-grained emotion annotation (Park et al., 12 Jun 2024).
Benchmarks and metrics include ASR WER, subjective MOS, BLEU/ROUGE/METEOR for textual alignment, Fréchet Audio Distance (FAD), and interaction-centric quantities such as Floor-Taking Offset (FTO), speaker latency, and backchannel duration ratios (Ji et al., 15 Nov 2024, Chen et al., 18 Sep 2025).
5. Evaluation, Metrics, and Limitations
Evaluation in spoken dialogue is multi-faceted, covering text, audio, and interactive dynamics:
- Textual metrics: BLEU, ROUGE, METEOR for semantic similarity; perplexity for model uncertainty; BERTScore and GPT-4 judge scores for open-ended responses.
- Speech/audio metrics: WER, CER, MOS (subjective and neural), MCD.
- Duplex/interaction metrics: FTO, SL, IRD, ISR, overlap/gap duration matching, and MAE on lead-time prediction (Li et al., 2022, Chen et al., 18 Sep 2025).
- Task completion: Joint goal accuracy (DST) and dialog-level completion (Si et al., 2023).
Current limitations span data scarcity for synchronous, spontaneous interaction; semantic vs. acoustic representation trade-offs; high autoregression latency for speech tokens; lack of unified benchmarks for full-duplex evaluation; and brittle adaptation under domain shift or speech variation (Ji et al., 15 Nov 2024).
6. Emerging Directions and Open Challenges
Key frontiers for spoken dialogue modeling include:
- Full-duplex synchronization: Real-time systems require architectures capable of cognitive parallelism—encoding and generating audio simultaneously with minimal latency—with emergent behavioral arbitration and robust interruption management (Chen et al., 18 Sep 2025).
- Unified modeling: Moving toward end-to-end, text-free approaches that jointly plan response content and prosody, leveraging LLM-scale models for both semantic and phonological generation (Zhou et al., 2023, Park et al., 12 Jun 2024).
- Multimodal and multimodal learning: Incorporation of visual cues, gestures, and face-to-face synthesis substantially improves robustness in noise and user engagement (Park et al., 12 Jun 2024).
- Advanced tokenization: Transitioning to hybrid tokenizers (combining semantic and acoustic features) and low-frame-rate representations to balance compression and expressivity (Ji et al., 15 Nov 2024).
- Improved reward learning and RLHF: Human feedback and preference optimization for tuning speech quality and safety criteria are underexplored but required for practical deployment.
- Evaluation standardization: There is a critical need for shared, multi-domain, open-source benchmarks and measurement suites that reflect both interactive naturalness and downstream task success (Si et al., 2023, Ji et al., 15 Nov 2024).
7. Synthesis and Outlook
The WavChat research landscape encompasses deterministic modular cascades, advanced neural architectures for streaming, full-duplex interaction, and multi-modal generation. There is clear evidence that self-supervised and cross-modal learning unlocks robust representations for speech perception and production. Token-based and hybrid models bridge symbolic reasoning with prosodic and paralinguistic control, while benchmarks such as SpokenWOZ and MultiDialog drive research on ASR error tolerance, slot reasoning, real-time turn-taking, and expressive interaction (Ji et al., 15 Nov 2024, Si et al., 2023, Park et al., 12 Jun 2024).
Despite these advances, pressing challenges persist: seamless full-duplex integration, interpretability, robust few-shot/generalization, and reproducible evaluation. Addressing these will require harmonized efforts in architectural design, data collection, open evaluation, and human-in-the-loop preference optimization to realize conversational agents that truly “hear, think, and speak” in natural, synchronous dialogue (Ji et al., 15 Nov 2024, Chen et al., 18 Sep 2025).