End-to-End Spoken DST Framework

Updated 2 December 2025

End-to-End Spoken DST is a neural framework that directly maps raw speech and dialogue history into structured slot–value representations, unifying speech recognition and state tracking.
Advanced extraction mechanisms, including pointer networks, MRC approaches, and memory networks, enable accurate open-vocabulary slot value prediction.
Joint optimization using ASR pretraining, multi-modal training, and reinforcement learning enhances performance and domain generalization in real-time dialogue systems.

End-to-end spoken Dialogue State Tracking (DST) denotes a class of neural architectures and methodologies that directly map raw, turn-level speech input (audio or ASR hypotheses) and dialogue history into structured dialogue state representations, usually comprising slot–value pairs or serialized JSONs. These frameworks eliminate or integrate conventional separations between speech recognition (ASR), spoken language understanding (SLU), and state tracking, enabling joint optimization and improving robustness to error propagation, open-vocabulary tracking, and generalization across domains. Recent advances encompass both classic neural pointer networks, Transformer-based fusion, speech–LLM alignment pipelines, context compression techniques, and large-scale synthetic speech data strategies.

1. System Architectures and Input Encodings

End-to-end spoken DST systems are typically composed of:

A speech encoder, often a self-supervised model pre-trained on large-scale corpora (e.g., WavLM-large (Sedláček et al., 10 Jun 2025), Wav2Vec 2.0 (Lee et al., 2023), Whisper (Druart et al., 2023)), which ingests raw waveform audio and outputs high-dimensional frame embeddings.
A connector or modality bridging module, such as a multi-layer Transformer encoder (Sedláček et al., 10 Jun 2025, Ghazal et al., 10 Oct 2025), strided convolutions (Vendrame et al., 27 Nov 2025), or convolutional downsampling layers (Druart et al., 2023), maps speech features into a space compatible with downstream text encoders or LLMs.
Dialogue histories encoded as either concatenated raw speech embeddings (full spoken context (Ghazal et al., 10 Oct 2025)), textual history (traditional multimodal context), or compressed vector summaries via learnable query-based pooling (Ghazal et al., 10 Oct 2025).
A state tracking decoder, usually an autoregressive LLM (e.g., OLMo, Gemma, T5) augmented with LoRA adapters for parameter-efficient adaptation (Sedláček et al., 10 Jun 2025, Vendrame et al., 27 Nov 2025), that generates the dialogue state as a string or structured JSON object.

Input construction varies: some models process the entire audio history for both user and agent turns (Ghazal et al., 10 Oct 2025), while others use only current turn speech with preceding history in text form (Vendrame et al., 27 Nov 2025, Sedláček et al., 10 Jun 2025). Compression modules, based on TransformerDecoder cross-attention, reduce context dimensionality without losing salient slot–value content (Ghazal et al., 10 Oct 2025).

2. Slot Value Extraction Mechanisms

Slot value prediction in end-to-end spoken DST comprises a range of neural techniques:

Pointer networks: Systems such as the BiLSTM-based pointer network (Xu et al., 2018) and BERT-based single-step pointer architectures (Yang et al., 2020) extract slot values by learning attention distributions that localize slot value spans in the input, robustly handling unknown or out-of-ontology values. The single-step pointer in STN4DST directly predicts the start position, with span continuation informed by IOB/BIO token tags (Yang et al., 2020). Targeted feature dropout regularizes against memorization and sharpens generalization to unseen values (Xu et al., 2018).
Machine reading comprehension (MRC): Some frameworks cast slot value extraction as span-based QA over dialogue context (e.g., XLNet (Ma et al., 2019)) with slot descriptions as natural-language questions, supporting unconstrained and zero-shot slot types. For non-categorical slots, a span is selected via start/end probabilities; for categorical slots, a wide & deep classifier integrates both hand-crafted features and Transformer embeddings (Ma et al., 2019).
Memory networks: MemN2N frames DST as a multi-hop question-answering problem over memory slots representing utterances, accommodating standard slot filling as well as counting and yes/no reasoning tasks (Perez et al., 2016).

3. Joint Optimization and Training Paradigms

End-to-end systems maximize the synergy between speech, text, and state modules via joint and multi-task objectives:

ASR pre-training: Speech encoders and connectors are first optimized for transcription via CTC or decoder cross-entropy, exposing the model to acoustic variability (Sedláček et al., 10 Jun 2025, Vendrame et al., 27 Nov 2025).
DST fine-tuning: Downstream state tracking is learned via cross-entropy over serializations of the dialogue state, using parameters shared with speech-to-text modules and optionally updated via adapter-based approaches (LoRA (Vendrame et al., 27 Nov 2025), Task-Optimized Adapters (Bang et al., 2023)).
Reinforcement learning: Reward-driven objectives (e.g., expected Joint Goal Accuracy) further refine DST predictions, allowing direct optimization of downstream metrics (Bang et al., 2023).
Multi-modal training: Joint training on both spoken and textual DST data (e.g., MultiWOZ text with SpokenWOZ speech) enables cross-domain generalization without explicit labeled audio for target domains (Vendrame et al., 27 Nov 2025). Connector and LoRA parameters are shared across both pipelines, and loss weights can be tuned to balance source/target domain performance.

4. Context Management and Propagation

Contextualization in spoken DST is crucial for resilient state tracking:

Full spoken history: Feeding connector-compressed or raw speech embeddings for all turns enables the LLM to attend to rich prosodic, lexical, and cross-turn cues, outperforming multimodal text/speech fusion baselines (Ghazal et al., 10 Oct 2025).
Attention pooling and compression: Cross-attention modules compress each turn's embeddings into fixed vectors using learned queries, maintaining slot recall while reducing sequence length. Increasing queries per turn boosts performance but with diminishing efficiency returns (Ghazal et al., 10 Oct 2025).
Uncertainty propagation: The ensemble and fusion of previous dialogue states as embedded vectors allow models to (re-)attend to prior slot assignments, but unmodeled uncertainty in past state predictions can accumulate errors over long dialogues (Druart et al., 2023). Proposed future work includes integrating soft probability distributions and scheduled sampling.

5. Datasets, Evaluation Metrics, and Generalization

Benchmarking of end-to-end spoken DST spans diverse datasets:

SpokenWOZ: Human and TTS dialog corpus with domains including restaurant, hotel, train, attraction, and profile (Sedláček et al., 10 Jun 2025, Ghazal et al., 10 Oct 2025, Vendrame et al., 27 Nov 2025).
MultiWOZ and Speech-Aware MultiWOZ: Large multi-domain text and speech datasets used for fine-tuning and zero/few-shot transfer (Vendrame et al., 27 Nov 2025).
SynthWOZ: Synthetic TTS corpus generated from MultiWOZ, with multi-speaker and multi-turn variation, facilitating experiments without costly human speech collection (Lee et al., 2023).
DSTC2, bAbI Task 5: Classical slot-filling and API parameter tasks for OOV and open-vocabulary assessment (Xu et al., 2018, Perez et al., 2016).

Standard metrics include Joint Goal Accuracy (JGA: strict full-state match per turn), Slot F1 (categorical, span, open), and, for phonetic matching, PhonemeF1 (accounting for pronunciation similarity in slot values) (Lee et al., 2023). Fuzzy matching and normalization (e.g., token-level Levenshtein ratio) are routinely applied to mitigate transcription and string mismatches in outputs (Sedláček et al., 10 Jun 2025, Ghazal et al., 10 Oct 2025).

6. Empirical Results and Comparative Analysis

Across recent studies:

End-to-end architectures routinely outperform cascade baselines (ASR→SLU→DST), especially in audio-native settings (Druart et al., 2023, Ghazal et al., 10 Oct 2025).
Full spoken history context delivers 4+ percentage points absolute JGA gain over multimodal baseline, with further gains possible via larger LLMs (e.g., Gemma-2-9B) or compressed context modules (Ghazal et al., 10 Oct 2025).
LoRA-adapted LLMs and alignment connectors close the gap between transcription and semantic state tracking, with dominant models achieving 42.17% JGA on SpokenWOZ test (Sedláček et al., 10 Jun 2025).
Cross-domain transfer via joint speech/text training can recover 40–70% of the generalization gap when target domain speech data is unavailable (Vendrame et al., 27 Nov 2025).
Synthetic data training shows E2E models can approach human speech test performance within 1.8 PhonemeF1 points and even eclipse cascading pipelines on categorical slot types (Lee et al., 2023).

A table of JGA results on SpokenWOZ from representative models:

Model	SWOZ Test JGA (%)	Reference
SPACE+WavLMalign	25.65	(Sedláček et al., 10 Jun 2025)
Whisper+T5 (E2E)	24.10	(Druart et al., 2023)
WavLM+Connector+OLMo-1B (LoRA+FT)	34.66	(Sedláček et al., 10 Jun 2025)
WavLM+Connector+Gemma-2-9B (+Fuzzy)	42.17	(Sedláček et al., 10 Jun 2025)
Compressed Spoken History (Ours)	36.49	(Ghazal et al., 10 Oct 2025)
Full Spoken History (Ours)	39.32	(Ghazal et al., 10 Oct 2025)

Editor's term: “Full Spoken Context” denotes architectures utilizing all prior user and agent turns' speech embeddings as prompt context for LLM-based state tracking (Ghazal et al., 10 Oct 2025).

7. Open Challenges and Future Directions

Persistent challenges in end-to-end spoken DST include:

Context propagation error: Sequential models accumulate mistakes as the dialogue rolls forward, especially for strictly-scored metrics.
Domain generalization: Models trained on available speech domains have limited transfer to unseen domains unless enhanced by large text DST corpora and joint training (Vendrame et al., 27 Nov 2025).
Slot recall on open-value and profile slots: Attention compression sacrifices fine-grained details, requiring adaptive pool sizes.
Prosodic and phonetic normalization: Synthetic data may lack sufficient acoustic diversity and real-world noise, motivating adversarial augmentation and phoneme-based supervision (Lee et al., 2023).
Scaling in computational regime: Larger LLM backbones and deeper speech encoders improve JGA but incur resource costs.
Explicit uncertainty modeling: Context sampling and probabilistic propagation of past dialogue states remain open research areas (Druart et al., 2023).

Planned avenues include adapter-based modality fusion, curriculum and retrieval-driven data sampling, hierarchical context summarization, and multilingual/multi-domain expansion (Sedláček et al., 10 Jun 2025, Ghazal et al., 10 Oct 2025, Vendrame et al., 27 Nov 2025).

End-to-end spoken DST represents a convergence of advanced speech modeling, neural sequence generation, and robust state tracking methodologies. Its ongoing evolution prioritizes open-vocabulary tracking, cross-modal prompt alignment, scalable context fusion, and fully synthetic data training, enabling increasingly resilient and domain-general task-oriented dialogue systems.