MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling (1810.00278v3)

Published 29 Sep 2018 in cs.CL

Abstract: Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of $10$k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.

PDF Abstract

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

The paper "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling" introduces the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a comprehensive resource designed to advance the field of task-oriented dialogue systems. The dataset contains around 10,000 dialogues and is significantly larger than previous labeled corpora, making it a valuable addition to the research community.

Data Collection and Structure

MultiWOZ was collected via a crowd-sourcing approach, which facilitated the creation of high-quality, natural dialogues without the need for professional annotators. The dialogues encompass multiple domains such as hotel, restaurant, train, and taxi services and are annotated with belief states and dialogue acts. This dual annotation allows the dataset to serve as a benchmark for various subtasks in dialogue modeling, including belief tracking and natural language generation.

The data collection followed a Wizard-of-Oz setup, where human participants played the roles of both the user (tourist) and the system (clerk), fostering natural interaction workflows. The use of a crowd-sourcing pipeline ensured diverse and semantically rich dialogues, addressing a common shortfall in earlier datasets that lacked linguistic variability or multi-domain coverage.

Dataset Statistics and Comparison

The MultiWOZ dataset is quantitatively superior to earlier datasets like DSTC2, SFX, WOZ 2.0, FRAMES, KVRET, and M2M. Specifically:

Dialogue Count: 10,438 dialogues, significantly outnumbering the largest existing datasets.
Turn Count: 115,434 turns, averaging 11.75 words per user turn and 15.12 per wizard turn.
Unique Tokens: 23,689, indicating a rich vocabulary.
Diversity: Dialogues span between 1 and 5 domains, promoting complexity and natural flow in interactions.

Benchmarking and Baselines

To illustrate the utility of the MultiWOZ dataset, the authors present benchmark results on three critical subtasks: dialogue state tracking, dialogue-act-to-text generation, and dialogue-context-to-text generation.

Dialogue State Tracking: Leveraging a state-of-the-art model, the authors report a joint goals accuracy of 80.9% on the MultiWOZ restaurant sub-domain. This is lower compared to the 85.5% accuracy on the smaller, single-domain WOZ 2.0 dataset, indicating the increased complexity of MultiWOZ.
Dialogue-Act-to-Text Generation: Using the SC-LSTM model, they achieved a BLEU score of 0.616 and a slot error rate (SER) of 4.378% on the MultiWOZ restaurant subset. These metrics suggest that MultiWOZ presents significant challenges compared to the SFX dataset (BLEU of 0.731 and SER of 0.46%).
Dialogue-Context-to-Text Generation: End-to-end models with oracle belief states yielded an Inform metric of 71.33% and a Success metric of 60.96% on the MultiWOZ dataset, showing a notable drop from simpler datasets like Cam676. The complexity and linguistic richness of MultiWOZ require more advanced architectures for effective response generation.

Implications and Future Research

The introduction of MultiWOZ provides a robust benchmark for the development and evaluation of dialogue systems. Its multi-domain nature, combined with the rich variability in language use, poses new challenges for both modular and end-to-end approaches in dialogue modeling. Future research may focus on refining state tracking models, improving multi-domain response generation, and exploring new methods for natural language understanding within such complex datasets. Furthermore, the free availability of MultiWOZ encourages extensive experimentation, which can lead to significant advancements in conversational AI.

In conclusion, MultiWOZ represents an essential resource for the community, addressing previous limitations and setting a new standard for task-oriented dialogue modeling datasets. It offers expansive opportunities for research, pushing the frontiers of natural and effective human-computer interaction.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Paweł Budzianowski (27 papers)
Tsung-Hsien Wen (27 papers)
Bo-Hsiang Tseng (20 papers)
Iñigo Casanueva (18 papers)
Stefan Ultes (32 papers)
Osman Ramadan (5 papers)
Milica Gašić (57 papers)

Citations (1,239)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/salman_paracha/status/1863040346768495022