MultiWOZ 2.2 Dialogue Dataset

Updated 18 December 2025

MultiWOZ 2.2 is a multi-domain, task-oriented dialogue dataset featuring over 10,000 dialogues with extensive annotation corrections and a redefined slot ontology.
It employs automated detection, manual inspection, and crowd-sourcing to standardize slot-span labels and correct errors like hallucinations and inconsistencies.
Benchmark evaluations reveal that models such as TRADE and DS-DST benefit from improved dialogue state tracking accuracy and reduced entity bias.

MultiWOZ 2.2 is a large-scale, multi-domain, task-oriented dialogue dataset that builds upon previous iterations of the MultiWOZ resource by introducing extensive annotation corrections, a redefined slot ontology, and standardized slot-span labeling. The dataset comprises over 10,000 human-human dialogues across 8 domains, supporting research in dialogue state tracking (DST), natural language generation (NLG), and end-to-end dialogue modeling. MultiWOZ 2.2 addresses annotation noise and consistency issues identified in prior versions and is widely used as a benchmark for training and evaluating conversational AI models (Zang et al., 2020, Qian et al., 2021).

1. Structure and Scope

MultiWOZ 2.2 consists of approximately 10,438 dialogues (matching MultiWOZ 2.0/2.1), containing around 115,000 utterances distributed across the following eight domains: Restaurant, Hotel, Attraction, Taxi, Train, Bus, Hospital, and Police. It preserves the domain coverage of earlier releases but introduces several enhancements:

Corrected dialogue-state annotations in 17.3% of user turns, affecting 28.2% of dialogues.
A redefined schema, partitioning slots into categorical and non-categorical types.
Standardized slot span annotations for non-categorical slots, improving evaluation consistency for DST models.

The dataset’s JSON structure encodes dialog goals, a sequence of user/system utterances, turn-level state triples, slot-span annotations, and dialog-act tags (Zang et al., 2020, Qian et al., 2021).

2. Annotation Corrections and Error Typology

MultiWOZ 2.2 employs a mixture of automated detection, manual inspection, and crowd-sourcing to correct a range of annotation errors:

Hallucinated Values: Detected via automated scans; values not grounded in prior dialog context were identified and flagged.
Inconsistent Slot Tracking: Detected inconsistencies across dialogs; for example, the same semantic slot filled differently in similar interaction contexts.
Error Types:
- Early-Markup: Premature recording of slot values based on system suggestions before user confirmation.
- Database-Injection: Slot values copied from backend APIs present in the state without explicit user/system mention.
- Typographical Issues: E.g., book time mismatches (“15:00” vs. “5:00”).
- Implicit Calculations: Unjustified deductions or transformations (e.g., auto-deriving arrival time from departure and duration).

Corrections were iteratively validated to ensure that all slot values appear verbatim or as normalized paraphrases in the dialog. Across the train split, up to 74.2% of dialogs were affected by these corrections, with particularly high rates for “name” and “type” slots in Attraction, Hotel, and Restaurant domains (Zang et al., 2020, Qian et al., 2021).

3. Ontology Redefinition and Slot Schema

A central innovation in MultiWOZ 2.2 is the introduction of an explicit schema, designed to resolve challenges associated with open vocabularies and annotation drift in prior versions:

Categorical Slots: Finite, pre-defined sets of values (e.g., pricerange, stars, area).
Non-categorical Slots: Open class; values are extracted directly from language in the dialog (e.g., restaurant name, booktime).

Schema Table

Domain	Categorical Slots	Non-categorical Slots
Restaurant	pricerange, area, bookday, bookpeople	food, name, booktime
Attraction	area, type	name
Hotel	pricerange, parking, internet, stars, area, type, bookpeople, bookday, bookstay	name
Taxi	–	destination, departure, arriveby, leaveat
Train	destination, departure, day, bookpeople	arriveby, leaveat
Bus	day	departure, destination, leaveat
Hospital	–	department
Police	–	name

For non-categorical slots, each annotated value is grounded to a token span within the relevant utterance, with cross-turn copying chains explicitly recorded (Zang et al., 2020).

4. Slot Span Annotation and Standardization

Slot span annotations in MultiWOZ 2.2 are generated by running a normalized string-matching pipeline over the dialog context, handling typographical variation and paraphrase. If multiple matches are possible, the most recent mention is used. These span annotations remove the need for custom string-matching heuristics and serve as gold-standard targets for all DST models, allowing robust and consistent evaluation.

Compared with earlier practices—where models implemented disparate string-matching approaches—the provision of gold spans standardizes evaluation and model development pipelines (Zang et al., 2020).

5. Annotation Consistency and Entity Bias

Despite 2.2's improvements, subsequent analysis revealed pervasive annotation inconsistencies, especially for “name” and “type” slots, and significant slot value distribution skew (“entity bias”):

Annotation Inconsistency: Approximately 66–74% of dialogs required further slot-type normalization, particularly in slots appearing in system-provided utterances. Corrections systematically added missing slot-value annotations, unified type tags, and normalized entity variants.
Entity Bias: Certain slot-values (e.g., “cambridge” for train-destination) dominate the data, as quantified by normalized Shannon entropy and min-entropy metrics. For example, “cambridge” accounts for ~50% of destination values despite 13 possible cities.

This bias encourages generative models to memorize and over-predict high-frequency entities, sometimes hallucinating them even without evidence in the dialog—a phenomenon confirmed in DST model evaluations (Qian et al., 2021).

6. Benchmarking and Robustness Evaluation

MultiWOZ 2.2 provides public benchmarks for multiple DST models, including TRADE, SGD-baseline, and DS-DST. Joint Goal Accuracy (JGA), defined as:

$\mathrm{JGA} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\hat{s}_i = s_i]$

is the principal performance metric.

Key Benchmark Results

Model	MultiWOZ 2.0	MultiWOZ 2.1	MultiWOZ 2.2
TRADE	0.486	0.460	0.454
SGD-baseline	–	0.434	0.420
DS-DST	0.522	0.512	0.517

Disaggregation shows consistent patterns for categorical and non-categorical joint accuracy.

Impact of Further Corrections and Entity Replacement

Recent work introduced an automated correction (“2.2+”), yielding a 7–10 percentage point increase in JGA for SOTA models and a new “Test_Unseen” split featuring unseen entities. DST-BART, for example, drops from 67.4% (2.2+) to 27.0% JGA when evaluated on Test_Unseen, directly evidencing entity memorization. Error reductions are largest in heavily annotated slots after these corrections (Qian et al., 2021).

7. Data Release, Usage, and Best Practices

MultiWOZ 2.2 and its corrected variants ("2.2+") are distributed under an Apache-2.0 license. The standard release includes: ontology.json (slot schema), schema.txt (slot descriptions), database files, split dialogues, and the Test_Unseen split. Dialog states, slot spans, and dialog acts are fully annotated in JSON, providing direct compatibility for DST, NLG, and E2E dialog models.

Recommended best practices include schema definition prior to annotation, enforcing controlled value selection (dropdowns for categorical, span marking for non-categorical), automated validation for non-ontology values or missing provenance, and crowd-filled correction pipelines. Such protocols are essential to mitigate annotation drift, slot-value hallucination, and overfitting to frequent entities (Zang et al., 2020, Qian et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines (2020)

Annotation Inconsistency and Entity Bias in MultiWOZ (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiWOZ 2.2 Dataset.

MultiWOZ 2.2 Dialogue Dataset

1. Structure and Scope

2. Annotation Corrections and Error Typology

3. Ontology Redefinition and Slot Schema

Schema Table

4. Slot Span Annotation and Standardization

5. Annotation Consistency and Entity Bias

6. Benchmarking and Robustness Evaluation

Key Benchmark Results

Impact of Further Corrections and Entity Replacement

7. Data Release, Usage, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MultiWOZ 2.2 Dialogue Dataset

1. Structure and Scope

2. Annotation Corrections and Error Typology

3. Ontology Redefinition and Slot Schema

Schema Table

4. Slot Span Annotation and Standardization

5. Annotation Consistency and Entity Bias

6. Benchmarking and Robustness Evaluation

Key Benchmark Results

Impact of Further Corrections and Entity Replacement

7. Data Release, Usage, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research