MultiWOZ 2.2 Dialogue Dataset
- MultiWOZ 2.2 is a multi-domain, task-oriented dialogue dataset featuring over 10,000 dialogues with extensive annotation corrections and a redefined slot ontology.
- It employs automated detection, manual inspection, and crowd-sourcing to standardize slot-span labels and correct errors like hallucinations and inconsistencies.
- Benchmark evaluations reveal that models such as TRADE and DS-DST benefit from improved dialogue state tracking accuracy and reduced entity bias.
MultiWOZ 2.2 is a large-scale, multi-domain, task-oriented dialogue dataset that builds upon previous iterations of the MultiWOZ resource by introducing extensive annotation corrections, a redefined slot ontology, and standardized slot-span labeling. The dataset comprises over 10,000 human-human dialogues across 8 domains, supporting research in dialogue state tracking (DST), natural language generation (NLG), and end-to-end dialogue modeling. MultiWOZ 2.2 addresses annotation noise and consistency issues identified in prior versions and is widely used as a benchmark for training and evaluating conversational AI models (Zang et al., 2020, Qian et al., 2021).
1. Structure and Scope
MultiWOZ 2.2 consists of approximately 10,438 dialogues (matching MultiWOZ 2.0/2.1), containing around 115,000 utterances distributed across the following eight domains: Restaurant, Hotel, Attraction, Taxi, Train, Bus, Hospital, and Police. It preserves the domain coverage of earlier releases but introduces several enhancements:
- Corrected dialogue-state annotations in 17.3% of user turns, affecting 28.2% of dialogues.
- A redefined schema, partitioning slots into categorical and non-categorical types.
- Standardized slot span annotations for non-categorical slots, improving evaluation consistency for DST models.
The dataset’s JSON structure encodes dialog goals, a sequence of user/system utterances, turn-level state triples, slot-span annotations, and dialog-act tags (Zang et al., 2020, Qian et al., 2021).
2. Annotation Corrections and Error Typology
MultiWOZ 2.2 employs a mixture of automated detection, manual inspection, and crowd-sourcing to correct a range of annotation errors:
- Hallucinated Values: Detected via automated scans; values not grounded in prior dialog context were identified and flagged.
- Inconsistent Slot Tracking: Detected inconsistencies across dialogs; for example, the same semantic slot filled differently in similar interaction contexts.
- Error Types:
- Early-Markup: Premature recording of slot values based on system suggestions before user confirmation.
- Database-Injection: Slot values copied from backend APIs present in the state without explicit user/system mention.
- Typographical Issues: E.g., book time mismatches (“15:00” vs. “5:00”).
- Implicit Calculations: Unjustified deductions or transformations (e.g., auto-deriving arrival time from departure and duration).
Corrections were iteratively validated to ensure that all slot values appear verbatim or as normalized paraphrases in the dialog. Across the train split, up to 74.2% of dialogs were affected by these corrections, with particularly high rates for “name” and “type” slots in Attraction, Hotel, and Restaurant domains (Zang et al., 2020, Qian et al., 2021).
3. Ontology Redefinition and Slot Schema
A central innovation in MultiWOZ 2.2 is the introduction of an explicit schema, designed to resolve challenges associated with open vocabularies and annotation drift in prior versions:
- Categorical Slots: Finite, pre-defined sets of values (e.g., pricerange, stars, area).
- Non-categorical Slots: Open class; values are extracted directly from language in the dialog (e.g., restaurant name, booktime).
Schema Table
| Domain | Categorical Slots | Non-categorical Slots |
|---|---|---|
| Restaurant | pricerange, area, bookday, bookpeople | food, name, booktime |
| Attraction | area, type | name |
| Hotel | pricerange, parking, internet, stars, area, type, bookpeople, bookday, bookstay | name |
| Taxi | – | destination, departure, arriveby, leaveat |
| Train | destination, departure, day, bookpeople | arriveby, leaveat |
| Bus | day | departure, destination, leaveat |
| Hospital | – | department |
| Police | – | name |
For non-categorical slots, each annotated value is grounded to a token span within the relevant utterance, with cross-turn copying chains explicitly recorded (Zang et al., 2020).
4. Slot Span Annotation and Standardization
Slot span annotations in MultiWOZ 2.2 are generated by running a normalized string-matching pipeline over the dialog context, handling typographical variation and paraphrase. If multiple matches are possible, the most recent mention is used. These span annotations remove the need for custom string-matching heuristics and serve as gold-standard targets for all DST models, allowing robust and consistent evaluation.
Compared with earlier practices—where models implemented disparate string-matching approaches—the provision of gold spans standardizes evaluation and model development pipelines (Zang et al., 2020).
5. Annotation Consistency and Entity Bias
Despite 2.2's improvements, subsequent analysis revealed pervasive annotation inconsistencies, especially for “name” and “type” slots, and significant slot value distribution skew (“entity bias”):
- Annotation Inconsistency: Approximately 66–74% of dialogs required further slot-type normalization, particularly in slots appearing in system-provided utterances. Corrections systematically added missing slot-value annotations, unified type tags, and normalized entity variants.
- Entity Bias: Certain slot-values (e.g., “cambridge” for train-destination) dominate the data, as quantified by normalized Shannon entropy and min-entropy metrics. For example, “cambridge” accounts for ~50% of destination values despite 13 possible cities.
This bias encourages generative models to memorize and over-predict high-frequency entities, sometimes hallucinating them even without evidence in the dialog—a phenomenon confirmed in DST model evaluations (Qian et al., 2021).
6. Benchmarking and Robustness Evaluation
MultiWOZ 2.2 provides public benchmarks for multiple DST models, including TRADE, SGD-baseline, and DS-DST. Joint Goal Accuracy (JGA), defined as:
is the principal performance metric.
Key Benchmark Results
| Model | MultiWOZ 2.0 | MultiWOZ 2.1 | MultiWOZ 2.2 |
|---|---|---|---|
| TRADE | 0.486 | 0.460 | 0.454 |
| SGD-baseline | – | 0.434 | 0.420 |
| DS-DST | 0.522 | 0.512 | 0.517 |
Disaggregation shows consistent patterns for categorical and non-categorical joint accuracy.
Impact of Further Corrections and Entity Replacement
Recent work introduced an automated correction (“2.2+”), yielding a 7–10 percentage point increase in JGA for SOTA models and a new “Test_Unseen” split featuring unseen entities. DST-BART, for example, drops from 67.4% (2.2+) to 27.0% JGA when evaluated on Test_Unseen, directly evidencing entity memorization. Error reductions are largest in heavily annotated slots after these corrections (Qian et al., 2021).
7. Data Release, Usage, and Best Practices
MultiWOZ 2.2 and its corrected variants ("2.2+") are distributed under an Apache-2.0 license. The standard release includes: ontology.json (slot schema), schema.txt (slot descriptions), database files, split dialogues, and the Test_Unseen split. Dialog states, slot spans, and dialog acts are fully annotated in JSON, providing direct compatibility for DST, NLG, and E2E dialog models.
Recommended best practices include schema definition prior to annotation, enforcing controlled value selection (dropdowns for categorical, span marking for non-categorical), automated validation for non-ontology values or missing provenance, and crowd-filled correction pipelines. Such protocols are essential to mitigate annotation drift, slot-value hallucination, and overfitting to frequent entities (Zang et al., 2020, Qian et al., 2021).