SpokenWOZ: Dual-Modal Dialogue Benchmark

Updated 13 October 2025

SpokenWOZ is a dual-modal corpus integrating telephone audio and text annotations for evaluating task-oriented dialogue (TOD) systems under real-world conditions.
It features extensive annotations including dialogue states, acts, cross-turn slots, and reasoning slots to address challenges from ASR noise and disfluencies.
Benchmark results show dual-modal and LLM-based approaches achieving up to 42.17% JGA, highlighting ongoing challenges in robust speech-aware dialogue management.

SpokenWOZ is a large-scale dual-modal corpus and benchmark designed for robust evaluation and development of spoken task-oriented dialogue (TOD) agents. Distinct from written TOD datasets, SpokenWOZ contains human-to-human telephone conversations with aligned speech and text, extensive dialogue state and dialogue act annotations, and explicit consideration of the disfluency, segmentation, and reasoning phenomena characteristic of spoken language. Its ambitious scope and architectural lessons position it as a central testbed for the next generation of speech-aware dialogue modeling.

1. Corpus Construction and Annotation

SpokenWOZ encompasses 5,700 dialogues, over 203,000 turns, and 249 hours of speech audio spanning eight domains (including a dedicated “profile” domain for personal information). Data collection was performed with real human-to-human telephone conversations, ensuring authentic spoken phenomena such as incremental utterances, hesitations, and back-channel feedback.

Each dialogue is annotated with text transcriptions, dialogue states (domain, slots, values), and dialogue acts. The design diverges from previous benchmarks by labeling two novel slot classes:

Cross-turn slots: Slot values distributed piecewise across multiple turns (e.g., a user spelling out an email address).
Reasoning slots: Values that require non-trivial inference, such as temporal mapping (“tomorrow” referenced relative to a day) or semantic deduction (“sushi” implying a Japanese restaurant).

SpokenWOZ transcripts retain automatic speech recognition (ASR) errors as a reflection of real-world deployment noise. This ensures that models trained or evaluated on the corpus encounter naturally noisy textual input, capturing the true difficulty of spoken language understanding and state tracking.

2. Challenges Unique to SpokenWOZ

SpokenWOZ directly addresses linguistic and modeling difficulties inadequately reflected in written corpora:

ASR Noise: Recognition errors, such as mis-segmentations or misspellings, impact downstream DST performance, simulating deployment conditions.
Word-by-word and disfluent utterances: Unlike written dialogue, spoken interactions exhibit fragmented expressions and turn-level segmentation, introducing challenges for dialog context modeling and slot accumulation.
Cross-turn aggregation: Models must detect distributed slot value recognitions over several turns, a scenario rarely addressed in written benchmarks.
Reasoning slot detection: Requires multi-turn reasoning (arithmetic, temporal, semantic) instead of explicit value extraction, demanding language understanding that goes beyond token-level matching.
Benchmarking with robust metrics: Core, the joint goal accuracy (JGA) is used, defined as

$\text{JGA} = \frac{\text{Number of turns with completely correct dialogue state prediction}}{\text{Total number of dialog turns}}$

For response generation, a combined metric is also used:

$\text{Comb} = 0.5 \times (\text{INFORM} + \text{SUCCESS}) + \text{BLEU}$

emphasizing both task accuracy and generation quality.

These aspects produce a dataset and benchmark that more accurately reflect practical, deployed spoken dialog systems.

3. Baselines, Experimental Results, and State-of-the-Art

SpokenWOZ served as the primary benchmark for a spectrum of text-modal, dual-modal, and LLM-based methods.

Text-modal baselines: Models such as BERT+TripPy, UBAR, and SPACE, evaluated solely on text input, obtain suboptimal results due to their inability to leverage audio cues or handle ASR noise disfluently.
Dual-modal approaches: Methods incorporating both textual and raw speech signal features (e.g., SPACE+WavLM, SPACE+WavLMalign) outperform text-only architectures. The SPACE+WavLMalign configuration achieved the strongest text+speech baseline for dialogue state tracking, with 25.65% JGA (or 28.15% excluding cross-turn slots), markedly lower than written-MultiWOZ benchmarks.
LLM-based models: LLMs (e.g., ChatGPT, InstructGPT003) in zero-shot regimes exhibit strong response diversity but suffer from notable hallucinations and reduced DST robustness, especially under ASR noise, yielding lower JGA relative to dual-modal models.
End-to-end results: The most advanced end-to-end model on SpokenWOZ completed user requests correctly in only about 52.1% of dialogues using the combined response metric, highlighting the intrinsic difficulty of robust spoken dialogue management.

Advancements reported by follow-up work—such as aligning speech encoders and LLMs with a connector module (Sedláček et al., 10 Jun 2025)—have further improved test JGA on SpokenWOZ to 34.66% (with OLMo-1B) and up to 42.17% (with Gemma-2-9B-Instruct and fuzzy matching), establishing new state-of-the-art results in the domain while still leaving a substantial gap relative to written-tabletop tasks.

Model Setup	JGA (%)
SPACE+WavLMalign	25.65
WavLM+Connector+OLMo-1B	34.66
Gemma-2-9B-Instruct*	42.17

*With fuzzy output matching post-processing.

4. Methodological Implications and Lessons

Several methodological best practices and findings have emerged from development and competitive evaluation on SpokenWOZ:

Data augmentation is essential to reduce overfitting and combat memorization. Techniques from previous work (Soltau et al., 2022), such as slot value randomization, enhance generalization and discourage lexical lookup reliance.
Inclusion of raw audio and multi-layer annotations enables both cascaded ASR→NLP and joint (end-to-end) modeling. Releasing additional artifacts such as gold ASR transcripts, word time stamps, and latent speech representations facilitates experimentation with hybrid architectures.
Connector-based alignment of modalities—where a small transformer bridges speech encoder outputs with LLMs—has proven effective for robust DST, particularly when pre-trained on large-scale ASR objectives and fine-tuned for DST with negative log-likelihood loss:

$L = -\sum_t \log P(y_t | y_1, \dots, y_{t-1}, s, H)$

where $y_t$ are output tokens, $s$ denotes speech-conditioned embeddings, and $H$ the dialogue history.

Cross-dataset augmentation: Augmenting with synthetic spoken corpora (e.g., Speech-Aware MultiWOZ, built via TTS from MultiWOZ 2.1) improves coverage and robustness, despite minor schema divergences.

In sum, the findings emphasize that speech-text dual modality with end-to-end neural architectures considerably narrows—but does not eliminate—the gap imposed by spoken conversation phenomena and ASR noise.

5. Broader Impact and Public Resources

SpokenWOZ provides a public, scalable environment for realistic spoken dialogue agent evaluation. The dataset, codebase, and live leaderboard are released at https://spokenwoz.github.io/, enabling:

Transparent benchmarking and method comparison for the community.
Reproducible research and direct measurement of progress in conversational speech understanding outside the written-domain paradigm.
A foundation for future work on highly multi-modal, robust, and contextually aware dialogue management systems, including advances in cross-turn and reasoning slot modeling, joint speech-language modeling, and practical deployment resilience.

A plausible implication is that the dual-modal, naturalistic corpus design adopted by SpokenWOZ now constitutes the de facto gold standard for evaluation in speech-aware TOD and will inspire similar benchmarks in other languages and domains.

6. Relation to Prior and Contemporary Work

SpokenWOZ’s design responds directly to gaps identified in earlier spoken dialogue datasets and technology challenges. Specifically, the DSTC11 corpus (Soltau et al., 2022) introduced parallel speech versions of MultiWOZ (TTS, human-verbatim, human-paraphrased) and advocated for diversified audio-annotation pairings, layered ASR outputs, and evaluation set redesign to challenge memorization.

SpokenWOZ extends these insights by:

Employing real, non-paraphrased, in-the-wild telephone conversation data, enhancing ecological validity.
Introducing new slot annotation schemes to highlight reasoning and cross-turn inference as recurring spoken language challenges.
Propelling method innovation, as seen in subsequent studies focused on speech encoder–LLM alignment (e.g., WavLM+connector+OLMo/Gemma, (Sedláček et al., 10 Jun 2025)) that have rapidly advanced state-of-the-art JGA on this benchmark and suggested a template for future end-to-end speech-NLP system design.

The dataset thus embodies progressive advances in spoken dialog system evaluation and gives empirical grounding for analysis and development of increasingly speech-robust task-oriented agents.