End-to-End Spoken Dialogue Systems

Updated 2 September 2025

End-to-end spoken dialogue systems are unified architectures that jointly process raw audio to generate conversational responses using trainable neural pipelines.
They leverage encoder-decoder or transformer-based models to convert speech into semantic units and integrate contextual and paralinguistic cues for multi-turn dialogue.
Advanced strategies like retrieval-augmented generation and chain-of-thought decoding enhance response quality, robustness, and empathetic understanding.

End-to-end spoken dialogue systems (E2E-SDSs) are computational architectures that unify audio understanding, semantic interpretation, dialogue management, and synthetic speech generation within trainable neural pipelines. Eschewing discrete modular boundaries, E2E-SDSs aspire to jointly model the mapping from raw or minimally processed speech input to conversational speech output, capturing not only linguistic but also paralinguistic, contextual, and interactive features. This paradigm has advanced rapidly due to the availability of large-scale conversational datasets, improvements in foundation models for both text and audio, and systematic benchmarking efforts.

1. System Architectures and Training Paradigms

E2E-SDSs are typically instantiated as either encoder-decoder or transformer-based architectures, with input and output modalities comprising direct speech-to-speech mappings or hybridized text-token intermediaries. Systems like dGSLM, SpeechGPT, Moshi, and Mini-Omni tokenize input speech via approaches such as HuBERT, EnCodec, or discrete self-supervised representations, and perform dialogue processing using LLMs adapted for speech token streams (Ji et al., 15 Nov 2024). This model family is distinguished by:

Speech Representation Stage: Conversion of raw audio to semantic or acoustic discrete units, e.g., HuBERT units for linguistic content and EnCodec for timbre, prosody, and emotion cues.
Central Dialogue Model: Typically an LLM, pre-trained on text, then adapted to accept/emit speech tokens, with optional modality-bridging adapters or embedding projections.
Speech Synthesis/Decoding: Autoregressive generation of output speech tokens from the LLM, which are subsequently decoded back to waveform.
Unified Training: Models are pre-trained in stages—ASR or speech-to-semantics, text-based dialogue, TTS, then jointly fine-tuned for end-to-end dialogue (Arora et al., 31 May 2025).

An important trend is alignment between LLM pre-training tasks and E2E-SDS objectives, e.g., intermediate speech recognition and text-to-text reasoning before final speech synthesis, as formalized in factorized posteriors such as

$P(Y \mid X, X^{\mathrm{spk}}) \approx P(Y \mid X, X^{\mathrm{spk}}, \hat{S}^{(\mathrm{asr})}, \hat{S}^{(\mathrm{res})})$

(Arora et al., 31 May 2025).

2. Contextual Modeling and Dialogue History Integration

Capturing dialogue context is essential for robust multi-turn conversation. Solutions include:

Explicit Dialog Context Embeddings: Use of h-vectors, BERT-prompted embeddings, or sequence-level context vectors drawn from previous system/user utterances, which are appended to acoustic or intermediate features within the processing pipeline (Tomashenko et al., 2020, Ganhotra et al., 2021).
Hierarchical/Transformer Context Encoders: Multi-level encoders aggregate utterance-level (speech/text) embeddings into richer context vectors. High-level context encoding (e.g., via six-layer transformer as in (Sunder et al., 2022)) allows history to be considered in full speech form, not only via text intermediaries.
Attentive and Gated Fusion: Multi-head attention mechanisms weigh previous utterances, system acts, or slots; gating allows the model to modulate the contribution of context dynamically (Wei et al., 2021).

Empathy-Specific Context: Recent systems incorporate both linguistic and paralinguistic context tracking (e.g., emotion, age, gender, via chain-of-thought reasoning), improving empathetic response generation with reasoning stages that integrate > tokens preceding response (Geng et al., 13 Aug 2025).

Such strategies have resulted in demonstrable gains in intent/slot F1, macro-F1, semantic error rate (SemER), and robustness to ambiguous or elliptical utterances (Ganhotra et al., 2021, Wei et al., 2021, Sunder et al., 2022, Geng et al., 13 Aug 2025).

3. Response Generation, Retrieval, and Grounding

Generating contextually appropriate, informative, and coherent system responses is a core challenge addressed via:

Sequence-to-Sequence and Copy Mechanisms: Transformer-based seq2seq models with dual decoders (for belief state and response), including copy attention over input for accurately including slot values or database entries (Měkota et al., 2020).

Retrieval-Augmented Generation (RAG): Integration of external textual knowledge by mapping speech queries and candidate documents to a shared embedding space using multi-encoders. Retrieval is performed directly in the speech-to-textual vector space, removing the need for explicit ASR decoding, thereby reducing latency and preserving speech cues (Feng et al., 27 Apr 2025).

Chain-of-Thought Decoding: Structured intermediate reasoning—first ASR, then text response, then speech—which enables the LLM to reason in stages, improving semantic coherence as reflected in ROUGE-1 and METEOR scores (Arora et al., 31 May 2025).

Empathetic Reasoning Cascades: Incorporation of explicit paralinguistic reasoning prior to spoken response, leveraging chain-of-thought decoding to produce emotionally congruent outputs (Geng et al., 13 Aug 2025).

This coordinated approach to response draws on both in-situ context and external knowledge, evidenced by improvements in response diversity and accuracy.

4. Evaluation Methodologies and Benchmarks

Evaluation of E2E-SDSs has evolved beyond text metrics alone to encompass the full speech pipeline and paralinguistic understanding:

Criterion Metrics/Frameworks Key Aspects Captured

Speech/ASR WER, CER, FWER, automatic ASR-WER scoring Speech recognition, anticipation, predictive timing

Semantic/Dialogue Intent F1/Macro-F1, ROUGE-1, METEOR, perplexity Intent, slot, and content relevance

Audio/Paralinguistics UTMOS, MOS, DNS_overall, emotion2vec, GenEmotion Speech quality, emotion/timbre, style similarity

Dialogue Quality BERTScore, self-BLEU/auto-BLEU, backchannel rate Coherence, diversity, naturalness

Latency Real-Time Factor, first packet latency, module times System responsiveness, streaming capabilities

Benchmarks URO-Bench (Yan et al., 25 Feb 2025), EChat-eval (Geng et al., 13 Aug 2025), ESPnet-SDS (Arora et al., 11 Mar 2025) Multilingualism, reasoning, empathy, multi-round

Recent benchmarks such as URO-Bench provide multifaceted, scenario-rich evaluation of SDMs, covering multilingualism, contextual tracking, reasoning, and paralinguistic expression (Yan et al., 25 Feb 2025). EChat-eval targets empathetic ability via a battery of paralinguistic and emotional cues (Geng et al., 13 Aug 2025). Toolkits like ESPnet-SDS enable standardized, component-level and end-to-end evaluation with direct human-in-the-loop feedback and on-the-fly computation of modular and overall metrics (Arora et al., 11 Mar 2025).

5. Practical Applications and System Robustness

E2E-SDSs have been deployed in task-oriented domains (e.g., task-completion, m-booking, restaurant reservation (Li et al., 2018)), education (math tutoring (Okur et al., 2022)), empathetic agents (Nishimura et al., 2022, Geng et al., 13 Aug 2025), and question answering systems (You et al., 2022). Results show that:

Fully differentiable E2E-SDSs, tightly coupling ASR, NLU, and TTS, propagate semantic objectives to lower layers, reducing downstream error cascades and improving performance on tasks with unseen slot arguments or rare vocabulary (Saxon et al., 2021).

Retrieval-augmented E2E pipelines allow for low-latency access to factual knowledge, benefiting time-critical conversational responses even at modest expense in retrieval accuracy relative to cascaded methods (Feng et al., 27 Apr 2025).

Streaming and duplex capabilities (e.g., causal convolutions, mask-based predictive ASR, anticipatory EOU detection (Zink et al., 30 Sep 2024, Ji et al., 15 Nov 2024)) enable more human-like turn-taking by predicting utterance completion and starting response generation during the user’s speech.

Deployed studies reveal the impact of real-world noise, child speech variability, data scarcity, and error propagation. System robustness is further influenced by the choice of embeddings, context integration, and dialog manager design (Okur et al., 2022).

6. Challenges, Limitations, and Prospective Directions

Despite advances, E2E-SDSs face several open challenges:

Data Scarcity and Modality Gap: Speech data remains sparser and lower-density than text. Effective bridging of semantics and acoustic detail—especially for emotion and timbral cues—remains an active research area (Ji et al., 15 Nov 2024).

Representation Trade-offs: Semantic representations are efficient, but sacrifice prosodic and stylistic information; acoustic representations preserve these but increase sequence length and computational cost.

Instruction Following and Reasoning: Catastrophic forgetting, diminished instruction-following, and weak reasoning relative to text-based LLMs are prevalent issues (Yan et al., 25 Feb 2025).

Diversity and Audio Quality: Current E2E systems tend to generate less diverse responses and have lower audio quality compared to modular cascaded architectures (Arora et al., 11 Mar 2025).

Paralinguistic and Multimodal Integration: Speaker identification, environmental cue processing, and contextually correct emotion expression remain areas of weakness in open-source models (Yan et al., 25 Feb 2025, Geng et al., 13 Aug 2025).

Research directions include development of unified semantic-acoustic representations, more data-efficient training strategies (e.g., joint speech-text tokenization, chain-of-thought learning (Arora et al., 31 May 2025)), advanced retrieval schemes for knowledge augmentation, reinforcement learning for enhanced duplex interaction, and expanding evaluation to noisy, environment-rich and multimodal real-world data.

7. Impact on the Field and Future Outlook

The emergence of E2E spoken dialogue systems—with empirical advances in low-latency processing, integration of paralinguistic cues, and more expressive dialogue—has redefined the bounds of conversational AI architectures. Benchmarking platforms and open-source toolkits now facilitate rigorous, reproducible comparisons. While challenges remain in learning efficiency, context modeling, robustness, and speech quality, the field is rapidly evolving toward unified, contextually fluent, and emotionally intelligent conversational agents, positioning E2E-SDSs as a focal point for future research in human-computer interaction.

Criterion	Metrics/Frameworks	Key Aspects Captured
Speech/ASR	WER, CER, FWER, automatic ASR-WER scoring	Speech recognition, anticipation, predictive timing
Semantic/Dialogue	Intent F1/Macro-F1, ROUGE-1, METEOR, perplexity	Intent, slot, and content relevance
Audio/Paralinguistics	UTMOS, MOS, DNS_overall, emotion2vec, GenEmotion	Speech quality, emotion/timbre, style similarity
Dialogue Quality	BERTScore, self-BLEU/auto-BLEU, backchannel rate	Coherence, diversity, naturalness
Latency	Real-Time Factor, first packet latency, module times	System responsiveness, streaming capabilities
Benchmarks	URO-Bench (Yan et al., 25 Feb 2025), EChat-eval (Geng et al., 13 Aug 2025), ESPnet-SDS (Arora et al., 11 Mar 2025)	Multilingualism, reasoning, empathy, multi-round