MultiWOZ: Multi-Domain Dialogue Dataset

Updated 4 December 2025

MultiWOZ is a comprehensive, crowd-sourced dataset featuring fully labeled multi-domain dialogues that support research in dialog state tracking, natural language understanding, and dialog policy learning.
It was collected using a Wizard-of-Oz paradigm and refined through multiple revisions, ensuring precise annotations of dialogue states, acts, and slot-value pairs across diverse domains.
The dataset underpins extensive benchmarking in conversational AI, with evaluations based on metrics like joint goal accuracy and slot-level precision to drive methodological innovations.

The Multi-Domain Wizard-of-Oz Dataset (MultiWOZ) is a large-scale, crowd-sourced corpus of fully labeled human-human written dialogues for task-oriented conversational modeling across multiple domains. Since its introduction, MultiWOZ and its subsequent revisions have become the de facto benchmark for research in dialog state tracking, natural language understanding, dialog policy learning, and end-to-end modeling in goal-oriented dialog systems. The dataset’s size, multi-domain scope, and fine-grained semantic annotation (dialogue states, dialogue acts, slot-value pairs) have made it central to the empirical evaluation and methodological innovation in conversational AI.

1. Dataset Structure, Domains, and Statistics

MultiWOZ 1.0 comprises 10,438 human-human dialogues with a total of 115,434 turns and approximately 1.49 million tokens, making it at least an order of magnitude larger than previous task-oriented corpora (Budzianowski et al., 2018). Dialogues span up to five simultaneous domains per conversation, selected from Restaurant, Hotel, Attraction, Taxi, Train, Hospital, and Police. The average number of turns is 8.93 for single-domain and 15.39 for multi-domain dialogues. The dataset features 24 informable/requestable slots and over 4,500 canonical slot values.

Subsequent revisions extend domain coverage (e.g., adding Bus, increasing slots to 30–37), refine annotation (see below), and formalize data splits (8,438 train, 1,000 validation, 1,000 test). File format is JSON, comprising per-dialogue logs, turn-level utterances, belief states, system/user dialog acts, and, from 2.1 onward, slot descriptions and user-side dialog acts (Eric et al., 2019). MultiWOZ 2.2 and later impose stricter slot ontology division—categorical (enumerable) versus non-categorical (free-form, span-annotated)—with explicit schema-driven annotation (Zang et al., 2020).

2. Data Collection and Annotation Methodology

MultiWOZ was collected in a Wizard-of-Oz paradigm on Amazon Mechanical Turk. Each dialogue involved two crowd workers: (1) the user, provided an incrementally revealed composite goal defined over one or more domains, (2) the wizard (agent), using a GUI form to enter slot-value constraints, which implicitly defines the dialog belief state in real-time (Budzianowski et al., 2018).

The annotation pipeline included:

Explicit turn-level segmentation.
Belief state labeling: at each user turn $t$ , the belief state $B^t$ is a set of slot–value constraints $(slot_i, op_i, value_i)$ .
Dialogue act labeling: intents such as INFORM, REQUEST, OFFER_BOOK, and their slot–value arguments for both system and (in later releases) user turns.
Quality control: phased qualification tasks for annotators; manual post-annotation review.
Subsequent corrections using crowdsourcing (MultiWOZ 2.1), expert annotation (MultiWOZ 2.4 validation/test), and automated regular-expression or schema-based corrections (Zang et al., 2020, Ye et al., 2021).

From MultiWOZ 2.2 onwards, non-categorical slot values are span-annotated: for an utterance tokenized as $(u_1, ..., u_T)$ , a slot value’s span $s = (i, j)$ , $1 \leq i \leq j \leq T$ is recorded, permitting direct supervision for extraction models (Zang et al., 2020).

3. Annotation Corrections, Schema Evolution, and Co-reference

Early versions exhibited substantial annotation noise—delayed or missing slot values, inconsistent slot updates, misannotations, typos, and lack of normalization. Correction efforts are as follows:

MultiWOZ 2.1: Over 32% of slot value annotations and 40% of turns modified via fresh crowdworker annotation and automated canonicalization (Eric et al., 2019).
MultiWOZ 2.2: Corrections to 17.3% of user turns, especially hallucinated, misaligned, or inconsistent state updates, span annotation for non-categorical slots (Zang et al., 2020).
MultiWOZ 2.3: Systematic unification of belief state and dialog act annotations, explicit annotation of co-reference (e.g., "the same area") for slot-value copying; slot-level dialog act correctness exceeding 84% (strict) and 90% (relaxed) (Han et al., 2020).
MultiWOZ 2.4: Correction restricted to validation/test splits; 41.17% of turns and 65.78% of dialogues in val/test contain at least one correction; thorough normalization and de-duplication of slot values (Ye et al., 2021).

Co-reference annotation in MultiWOZ 2.3 is applied where slot values are filled by anaphoric or elliptical expressions, mapping current slot value, referred value, referred turn, and referred span. About 20% of dialogues contain at least one coreferential slot (Han et al., 2020).

4. Dataset Variants and Representation Extensions

MultiWOZ’s flat slot-value and dialog act formalism has been converted into richer representation frameworks for advanced semantic tasks:

MultiWOZ-DF (Meron et al., 2022): Transformation into an executable dataflow (DF) formalism, where each turn is mapped onto a computational graph via function calls (e.g., find(domain, params), revise_domain(params)). Graph execution updates state and enables compositional semantic parsing. Turn-level translation accuracy for the most learnable (simplified) variant is 78.3%; state match to the original labels is 87.8% at turn level.
MultiWOZ 3.0 / ThingTalk (Campagna et al., 2020): Re-annotation into a precise, canonical, and executable formal language with support for logical operators, sorting, cross-domain transfer, and API calls. Coverage is 98% of test turns; a contextual semantic parser achieves 79% turn-level exact match and 44.1% dialogue-level accuracy.

These extensions enable system development in compositional semantic parsing, executable dialogue agents, and formal evaluations of full conversation trajectories.

5. Benchmarks, Baselines, and Key Evaluation Metrics

Dialogue state tracking (DST) benchmarks remain central in MultiWOZ evaluation. Metrics include:

Joint Goal Accuracy (JGA):

$\mathrm{JGA} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}(\hat{g}_i = g_i)$

with $N$ turns and $g_i$ the gold state at turn $i$ .

Slot Accuracy/Slot-level F1:

$\mathrm{SlotAcc} = \frac{1}{N S}\sum_{i=1}^N\sum_{s=1}^S \mathbf{1}(v_{i,s}^{pred} = v_{i,s}^{gold})$

Canonical baseline models include semantic-similarity trackers, pointer-generator models (TRADE), BERT-based DSTs (SUMBT, DS-DST), schema-guided DST, and hybrid span/slot classifiers (Eric et al., 2019, Zang et al., 2020).

Sample JGA test-set values (MultiWOZ 2.1/2.2/2.3):

Model	v2.1	v2.2	v2.3
TRADE	45.6	46.6	49.2
SUMBT	49.2	49.7	52.9
STAR	—	—	73.6 (2.4)

Evaluation on MultiWOZ 2.4 (clean val/test) reveals substantial JGA boosts (up to +17 points) compared to 2.1, confirming that prior metrics underestimated attainable accuracy due to annotation noise (Ye et al., 2021).

6. Shortcomings, Entity Bias, and Data Quality Considerations

MultiWOZ exhibits structural biases and persistent annotation challenges:

Annotation Inconsistency: An estimated 70% of dialogues in the original data have at least one state annotation inconsistency across contexts. Automated and manual corrections yield 7–10 percentage point JGA gains (Qian et al., 2021).
Entity Bias: Head–tail skew is evident. For example, “cambridge” accounts for 50% of the train-destination slot’s values. Normalized entropy metrics ( $H_1/H_0 \approx 0.75$ ) quantify this bias. As a result, generative DSTs may memorize frequent entities, severely degrading generalization—performance drops by 29 percentage points (JGA) when test entities are replaced with out-of-training-set values (Qian et al., 2021).
Slot Ontology Ambiguity: Merging of semantically distinct slots or overlapping slot values can confound both span-based and generation-based trackers. This led to explicit slot definitions, “dontcare” and “unknown” special values, and ontology simplification in 2.2+ (Zang et al., 2020).

Best practices recommended for future schema design and dataset curation include pre-defining slot ontology, enforcing annotation via schema tools, mandatory span annotation for non-categorical slots, multi-annotator redundancy, and release of hidden-entity evaluation splits (Qian et al., 2021, Zang et al., 2020).

7. Practical Impact and Reference Implementations

MultiWOZ underpins multiple open-source toolkits, most notably ConvLab, which integrates fully annotated MultiWOZ (user dialog-acts included), reference modular NLU/NLG/DST/policy models (including supervised, RL, and hand-crafted policies), and user simulators (Lee et al., 2019). ConvLab demonstrates extensibility and reproducibility benefits, with end-to-end task success rates up to 69% for rule-based DST plus template NLG (DST accuracy 90.2%).

The standardization, scale, and ongoing correction efforts have established MultiWOZ as the principal benchmark for advancing task-oriented dialogue modeling, robust DST under noisy label conditions, schema-driven architectures, compositional graph-based semantic parsing, and the empirical paper of data bias and model generalization.

(Budzianowski et al., 2018, Eric et al., 2019, Lee et al., 2019, Zang et al., 2020, Han et al., 2020, Qian et al., 2021, Ye et al., 2021, Meron et al., 2022, Campagna et al., 2020)