MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

Published 29 Sep 2018 in cs.CL | (1810.00278v3)

Abstract: Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of $10$k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (1,239)

View on Semantic Scholar

Summary

The paper introduces MultiWOZ, a benchmark dataset with over 10,000 annotated dialogues that set a new standard for task-oriented dialogue modeling.
The dataset features dual annotations for belief tracking and dialogue-act-to-text generation, reporting metrics like 80.9% joint goal accuracy and a BLEU score of 0.616.
The work underscores the challenges of multi-domain dialogue systems and promotes further research with its diverse, semantically rich, and freely available resource.

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

The paper "MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling" introduces the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a comprehensive resource designed to advance the field of task-oriented dialogue systems. The dataset contains around 10,000 dialogues and is significantly larger than previous labeled corpora, making it a valuable addition to the research community.

Data Collection and Structure

MultiWOZ was collected via a crowd-sourcing approach, which facilitated the creation of high-quality, natural dialogues without the need for professional annotators. The dialogues encompass multiple domains such as hotel, restaurant, train, and taxi services and are annotated with belief states and dialogue acts. This dual annotation allows the dataset to serve as a benchmark for various subtasks in dialogue modeling, including belief tracking and natural language generation.

The data collection followed a Wizard-of-Oz setup, where human participants played the roles of both the user (tourist) and the system (clerk), fostering natural interaction workflows. The use of a crowd-sourcing pipeline ensured diverse and semantically rich dialogues, addressing a common shortfall in earlier datasets that lacked linguistic variability or multi-domain coverage.

Dataset Statistics and Comparison

The MultiWOZ dataset is quantitatively superior to earlier datasets like DSTC2, SFX, WOZ 2.0, FRAMES, KVRET, and M2M. Specifically:

Dialogue Count: 10,438 dialogues, significantly outnumbering the largest existing datasets.
Turn Count: 115,434 turns, averaging 11.75 words per user turn and 15.12 per wizard turn.
Unique Tokens: 23,689, indicating a rich vocabulary.
Diversity: Dialogues span between 1 and 5 domains, promoting complexity and natural flow in interactions.

Benchmarking and Baselines

To illustrate the utility of the MultiWOZ dataset, the authors present benchmark results on three critical subtasks: dialogue state tracking, dialogue-act-to-text generation, and dialogue-context-to-text generation.

Dialogue State Tracking: Leveraging a state-of-the-art model, the authors report a joint goals accuracy of 80.9% on the MultiWOZ restaurant sub-domain. This is lower compared to the 85.5% accuracy on the smaller, single-domain WOZ 2.0 dataset, indicating the increased complexity of MultiWOZ.
Dialogue-Act-to-Text Generation: Using the SC-LSTM model, they achieved a BLEU score of 0.616 and a slot error rate (SER) of 4.378% on the MultiWOZ restaurant subset. These metrics suggest that MultiWOZ presents significant challenges compared to the SFX dataset (BLEU of 0.731 and SER of 0.46%).
Dialogue-Context-to-Text Generation: End-to-end models with oracle belief states yielded an Inform metric of 71.33% and a Success metric of 60.96% on the MultiWOZ dataset, showing a notable drop from simpler datasets like Cam676. The complexity and linguistic richness of MultiWOZ require more advanced architectures for effective response generation.

Implications and Future Research

The introduction of MultiWOZ provides a robust benchmark for the development and evaluation of dialogue systems. Its multi-domain nature, combined with the rich variability in language use, poses new challenges for both modular and end-to-end approaches in dialogue modeling. Future research may focus on refining state tracking models, improving multi-domain response generation, and exploring new methods for natural language understanding within such complex datasets. Furthermore, the free availability of MultiWOZ encourages extensive experimentation, which can lead to significant advancements in conversational AI.

In conclusion, MultiWOZ represents an essential resource for the community, addressing previous limitations and setting a new standard for task-oriented dialogue modeling datasets. It offers expansive opportunities for research, pushing the frontiers of natural and effective human-computer interaction.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Glossary

attention: A neural mechanism that focuses the model on relevant parts of its inputs when generating outputs. "The attention is conditioned on the oracle belief state and the database pointer."
BLEU score: An automatic metric that evaluates text generation quality by comparing n-grams to reference texts. "fluency is measured via BLEU score~\cite{papineni2002bleu}."
belief state: The system’s internal representation of the user’s goals and constraints at a given point in the dialogue. "the annotation of a belief state is performed implicitly while the wizard is allowed to fully focus on providing the required information."
belief tracker: A model component that infers or maintains the dialogue’s belief state from conversational context. "a sequence-to-sequence model~\citep{sutskever2014sequence} is augmented with a belief tracker and a discrete database accessing component"
belief tracking: The task of estimating and updating the dialogue’s belief state over time. "benchmark results of belief tracking, dialogue act and response generation is reported"
bidirectional GRU: A recurrent neural network variant that processes sequences in both forward and backward directions using Gated Recurrent Units. "The best results on the Cam676 corpus were obtained with bidirectional GRU cell."
co-referencing: The use of linguistic expressions that refer to previously mentioned entities in discourse. "Natural incorporation of co-referencing and lexical entailment into the dialogue was achieved through implicit mentioning of some slots in the goal."
database pointer: A signal indicating database query results or availability that conditions generation or decision-making. "The attention is conditioned on the oracle belief state and the database pointer."
Dialogue act: A structured label of communicative intent (e.g., inform, request) often with slot-value arguments. "a set of dialogue acts with slots per turn."
Dialogue-context-to-text generation: Generating system responses directly from dialogue history and auxiliary signals. "Dialogue-Context-to-Text Generation"
Dialogue management: The decision-making component that chooses system actions based on context, goals, and knowledge. "the next challenge becomes the dialogue management and response generation components."
Dialogue State Tracking: The process or task of maintaining the current dialogue belief state from user inputs. "A robust natural language understanding and dialogue state tracking is the first step towards building a good conversational system."
Dialogue State Tracking Challenge (DSTC): A benchmark series focused on tracking user goals and states in dialogue systems. "first Dialogue State Tracking Challenge \cite{williams2013dialog}."
end-to-end dialogue modelling: Training dialogue systems to map inputs to outputs directly without decomposing into hand-engineered modules. "and even end-to-end dialogue modelling~\cite{zhao2016towards,wen2016network,eric2017key}."
Fleiss' kappa: A statistical measure of inter-annotator agreement for categorical labels across multiple annotators. "we used Fleiss' kappa metric \cite{fleiss1971measuring} per single dialogue act."
global attention: An attention variant that attends over all encoder states when decoding. "with the global type of attention \cite{bahdanau2014neural}."
informable slots: Attributes that users can specify to constrain a search (e.g., area, price range). "In general, the slots may be divided into informable slots and requestable slots."
Inform rate: An evaluation metric indicating whether the system provided appropriate entities during a task. "the first two metrics relate to the dialogue task completion - whether the system has provided an appropriate entity (Inform rate) and then answered all the requested attributes (Success rate);"
inter-annotator agreement: A measure of consistency among different annotators labeling the same data. "To estimate the inter-annotator agreement, the averaged weighted kappa value for all dialogue acts was computed over $291$ turns."
joint goals: A metric assessing whether all slot constraints are correctly tracked simultaneously. "Joint goals"
lexical entailment: A semantic relation where one phrase logically implies another (used to ensure coherent dialogue). "Natural incorporation of co-referencing and lexical entailment into the dialogue was achieved through implicit mentioning of some slots in the goal."
Long Short-Term Memory (LSTM): A recurrent neural network architecture designed to capture long-range dependencies in sequences. "the LSTM cell serving as a decoder and an encoder achieved the highest score"
ontology: A structured specification of entities, slots, and values that define a task domain and database schema. "The domain of a task-oriented dialogue system is often defined by an ontology, a structured representation of the back-end database."
oracle belief state: A gold-standard belief state used to condition models during training or evaluation. "an oracle belief-state obtained from the wizard annotations as discussed in Section \ref{sec:sysside}."
oracle tracker: A perfect or ground-truth tracker used to isolate evaluation of downstream components. "the annotations of the dialogue state are used as an oracle tracker."
Reinforcement learning-based models: Systems that learn dialogue policies by optimizing rewards through interaction. "These create a new challenge for reinforcement learning-based models requiring them to operate on concurrent actions."
requestable slots: Attributes users can ask the system to provide for a selected entity (e.g., phone, address). "In general, the slots may be divided into informable slots and requestable slots."
SC-LSTM: The Semantically Conditioned LSTM architecture for mapping structured acts to natural language. "Semantically Conditioned Long Short-term Memory network (SC-LSTM) proposed by~\citet{wensclstm15}"
sequence-to-sequence model: A neural architecture that encodes an input sequence and decodes an output sequence. "a sequence-to-sequence model~\citep{sutskever2014sequence} is augmented with a belief tracker and a discrete database accessing component"
slot error rate (SER): A generation metric measuring incorrect, missing, or redundant slot realizations. "slot error rate (SER)~\cite{wensclstm15}."
slot-value pairs: Structured arguments in dialogue acts specifying attribute-value information. "a dialogue act consists of the intent (such as request or inform) and slot-value pairs."
Success rate: An evaluation metric indicating whether the system answered all requested attributes for provided entities. "the first two metrics relate to the dialogue task completion - whether the system has provided an appropriate entity (Inform rate) and then answered all the requested attributes (Success rate);"
weighted kappa: A variant of Cohen’s/Fleiss’ kappa that weights disagreements by severity. "Although the weighted kappa value averaged over dialogue acts was at a high level of $0.704$,"
Wizard-of-Oz (WOZ): A data collection paradigm where human “wizards” simulate system responses to capture natural dialogues. "The Wizard-of-Oz framework (WOZ)~\cite{kelley1984iterative} was first proposed as an iterative approach to improve user experiences when designing a conversational system."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

Summary

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

Data Collection and Structure

Dataset Statistics and Comparison

Benchmarking and Baselines

Implications and Future Research

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Collections

Tweets