Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines (2007.12720v1)

Published 10 Jul 2020 in cs.CL and cs.AI

Abstract: MultiWOZ is a well-known task-oriented dialogue dataset containing over 10,000 annotated dialogues spanning 8 domains. It is extensively used as a benchmark for dialogue state tracking. However, recent works have reported presence of substantial noise in the dialogue state annotations. MultiWOZ 2.1 identified and fixed many of these erroneous annotations and user utterances, resulting in an improved version of this dataset. This work introduces MultiWOZ 2.2, which is a yet another improved version of this dataset. Firstly, we identify and fix dialogue state annotation errors across 17.3% of the utterances on top of MultiWOZ 2.1. Secondly, we redefine the ontology by disallowing vocabularies of slots with a large number of possible values (e.g., restaurant name, time of booking). In addition, we introduce slot span annotations for these slots to standardize them across recent models, which previously used custom string matching heuristics to generate them. We also benchmark a few state of the art dialogue state tracking models on the corrected dataset to facilitate comparison for future work. In the end, we discuss best practices for dialogue data collection that can help avoid annotation errors.

An Analytical Overview of MultiWOZ 2.2: Enhancements in Dialogue Dataset Annotation and State Tracking

The present discourse explores the improvements manifested in the MultiWOZ 2.2 dataset, a significant upgrade upon its predecessors, namely MultiWOZ 2.0 and 2.1. The corpus remains a cornerstone in the field of task-oriented dialogue systems, extensively utilized in evaluating dialogue state tracking (DST) models. This paper introduces several pivotal adjustments aimed at rectifying noise issues inherent in the earlier iterations of MultiWOZ, thereby refining the quality and utility of the dataset.

Major Contributions of MultiWOZ 2.2

The paper underscores three primary contributions of the revised dataset:

  1. Correction of Annotation Errors: A comprehensive revision was undertaken to rectify inaccuracies in dialogue state annotations. The corrections include addressing hallucinated values, early markups, and database-induced errors. Specifically, errors were found in approximately 17.3% of the utterances across 28.2% of dialogues. These errors ranged from typographical inaccuracies to inconsistent state updates, which previously impeded the performance and reliability of DST models.
  2. Refined Ontology through Schema Adoption: Shifting from an exhaustive ontology to a schema methodology was a key strategic modification. Categorical and non-categorical slots were redefined, facilitating better scalability and robustness of the DST models. This approach addresses issues with ontology completeness and consistency, particularly pertinent for slots with extensive or dynamic sets of possible values such as "restaurant-name" and "restaurant-booktime."
  3. Enhanced Annotations and Benchmarks: The dataset now includes additional span annotations for both user and system utterances, capturing active user intents and requested slots. This inclusion aids in crafting more efficient and contextually aware dialogue systems. Furthermore, benchmarking was conducted on state-of-the-art DST models like TRADE, SGD-baseline, and DS-DST, with the intent to provide a comparative analysis to guide future research developments.

Implications and Future Directions

The corrections introduced in MultiWOZ 2.2 have profound implications for the development of DST models, especially in ensuring more potent evaluations and facilitating fairer comparisons. In particular, the dataset’s shift towards a schema-based ontology marks a methodological pivot that enhances the generalization capabilities of dialogue models. Moreover, addressing the dual challenges of annotation accuracy and ontology completeness paves the way for constructing more robust, scalable dialogue systems.

The paper also serves as a compendium of best practices for dialogue data collection, emphasizing the importance of pre-defined ontology or schema in circumventing annotation errors. This foresight not only aids in dataset reliability but also ensures consistency, a factor critical to the nuanced evaluation of dialogue models.

Looking forward, as dialogue systems approach a broader range of applications and domains, there lies substantial potential in exploring methodologies for comprehensively managing logical expressions within dialogue states. Enhanced representation formats and models capable of handling such complexities would add further dimensions to the adaptability and precision of task-oriented dialogue systems.

In conclusion, MultiWOZ 2.2 signifies a significant stride in the domain of dialogue datasets, addressing previous limitations and charting a course for future advancements in dialogue state tracking. As a more robust benchmark, it offers the AI research community a cleaner, more reliable, and more representative dataset, even as it prompts ongoing discussions on best practices in dialogue system development and data annotation methodologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaoxue Zang (28 papers)
  2. Abhinav Rastogi (29 papers)
  3. Srinivas Sunkara (12 papers)
  4. Raghav Gupta (24 papers)
  5. Jianguo Zhang (97 papers)
  6. Jindong Chen (21 papers)
Citations (259)