MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling
Abstract: Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of $10$k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Glossary
- attention: A neural mechanism that focuses the model on relevant parts of its inputs when generating outputs. "The attention is conditioned on the oracle belief state and the database pointer."
- BLEU score: An automatic metric that evaluates text generation quality by comparing n-grams to reference texts. "fluency is measured via BLEU score~\cite{papineni2002bleu}."
- belief state: The system’s internal representation of the user’s goals and constraints at a given point in the dialogue. "the annotation of a belief state is performed implicitly while the wizard is allowed to fully focus on providing the required information."
- belief tracker: A model component that infers or maintains the dialogue’s belief state from conversational context. "a sequence-to-sequence model~\citep{sutskever2014sequence} is augmented with a belief tracker and a discrete database accessing component"
- belief tracking: The task of estimating and updating the dialogue’s belief state over time. "benchmark results of belief tracking, dialogue act and response generation is reported"
- bidirectional GRU: A recurrent neural network variant that processes sequences in both forward and backward directions using Gated Recurrent Units. "The best results on the Cam676 corpus were obtained with bidirectional GRU cell."
- co-referencing: The use of linguistic expressions that refer to previously mentioned entities in discourse. "Natural incorporation of co-referencing and lexical entailment into the dialogue was achieved through implicit mentioning of some slots in the goal."
- database pointer: A signal indicating database query results or availability that conditions generation or decision-making. "The attention is conditioned on the oracle belief state and the database pointer."
- Dialogue act: A structured label of communicative intent (e.g., inform, request) often with slot-value arguments. "a set of dialogue acts with slots per turn."
- Dialogue-context-to-text generation: Generating system responses directly from dialogue history and auxiliary signals. "Dialogue-Context-to-Text Generation"
- Dialogue management: The decision-making component that chooses system actions based on context, goals, and knowledge. "the next challenge becomes the dialogue management and response generation components."
- Dialogue State Tracking: The process or task of maintaining the current dialogue belief state from user inputs. "A robust natural language understanding and dialogue state tracking is the first step towards building a good conversational system."
- Dialogue State Tracking Challenge (DSTC): A benchmark series focused on tracking user goals and states in dialogue systems. "first Dialogue State Tracking Challenge \cite{williams2013dialog}."
- end-to-end dialogue modelling: Training dialogue systems to map inputs to outputs directly without decomposing into hand-engineered modules. "and even end-to-end dialogue modelling~\cite{zhao2016towards,wen2016network,eric2017key}."
- Fleiss' kappa: A statistical measure of inter-annotator agreement for categorical labels across multiple annotators. "we used Fleiss' kappa metric \cite{fleiss1971measuring} per single dialogue act."
- global attention: An attention variant that attends over all encoder states when decoding. "with the global type of attention \cite{bahdanau2014neural}."
- informable slots: Attributes that users can specify to constrain a search (e.g., area, price range). "In general, the slots may be divided into informable slots and requestable slots."
- Inform rate: An evaluation metric indicating whether the system provided appropriate entities during a task. "the first two metrics relate to the dialogue task completion - whether the system has provided an appropriate entity (Inform rate) and then answered all the requested attributes (Success rate);"
- inter-annotator agreement: A measure of consistency among different annotators labeling the same data. "To estimate the inter-annotator agreement, the averaged weighted kappa value for all dialogue acts was computed over $291$ turns."
- joint goals: A metric assessing whether all slot constraints are correctly tracked simultaneously. "Joint goals"
- lexical entailment: A semantic relation where one phrase logically implies another (used to ensure coherent dialogue). "Natural incorporation of co-referencing and lexical entailment into the dialogue was achieved through implicit mentioning of some slots in the goal."
- Long Short-Term Memory (LSTM): A recurrent neural network architecture designed to capture long-range dependencies in sequences. "the LSTM cell serving as a decoder and an encoder achieved the highest score"
- ontology: A structured specification of entities, slots, and values that define a task domain and database schema. "The domain of a task-oriented dialogue system is often defined by an ontology, a structured representation of the back-end database."
- oracle belief state: A gold-standard belief state used to condition models during training or evaluation. "an oracle belief-state obtained from the wizard annotations as discussed in Section \ref{sec:sysside}."
- oracle tracker: A perfect or ground-truth tracker used to isolate evaluation of downstream components. "the annotations of the dialogue state are used as an oracle tracker."
- Reinforcement learning-based models: Systems that learn dialogue policies by optimizing rewards through interaction. "These create a new challenge for reinforcement learning-based models requiring them to operate on concurrent actions."
- requestable slots: Attributes users can ask the system to provide for a selected entity (e.g., phone, address). "In general, the slots may be divided into informable slots and requestable slots."
- SC-LSTM: The Semantically Conditioned LSTM architecture for mapping structured acts to natural language. "Semantically Conditioned Long Short-term Memory network (SC-LSTM) proposed by~\citet{wensclstm15}"
- sequence-to-sequence model: A neural architecture that encodes an input sequence and decodes an output sequence. "a sequence-to-sequence model~\citep{sutskever2014sequence} is augmented with a belief tracker and a discrete database accessing component"
- slot error rate (SER): A generation metric measuring incorrect, missing, or redundant slot realizations. "slot error rate (SER)~\cite{wensclstm15}."
- slot-value pairs: Structured arguments in dialogue acts specifying attribute-value information. "a dialogue act consists of the intent (such as request or inform) and slot-value pairs."
- Success rate: An evaluation metric indicating whether the system answered all requested attributes for provided entities. "the first two metrics relate to the dialogue task completion - whether the system has provided an appropriate entity (Inform rate) and then answered all the requested attributes (Success rate);"
- weighted kappa: A variant of Cohen’s/Fleiss’ kappa that weights disagreements by severity. "Although the weighted kappa value averaged over dialogue acts was at a high level of $0.704$,"
- Wizard-of-Oz (WOZ): A data collection paradigm where human “wizards” simulate system responses to capture natural dialogues. "The Wizard-of-Oz framework (WOZ)~\cite{kelley1984iterative} was first proposed as an iterative approach to improve user experiences when designing a conversational system."
Collections
Sign up for free to add this paper to one or more collections.