The paper introduces a hybrid imitation and reinforcement learning method for training task-oriented dialogue systems through online user interactions. The approach addresses the dialogue state distribution mismatch between offline supervised learning (SL) and online interactive reinforcement learning (RL) stages. The proposed method allows a dialogue agent to learn from human teaching and feedback, effectively improving its ability to complete tasks successfully.
The paper makes the following contributions:
- It presents a neural network-based task-oriented dialogue system that can be optimized end-to-end for NLU, DST, and dialogue policy learning.
- It introduces a hybrid imitation and reinforcement learning method to address the challenge of dialogue state distribution mismatch between offline training and interactive learning.
The system architecture comprises several key components:
- Utterance Encoding: A bidirectional LSTM encodes user utterances into continuous vector representations, capturing both forward and backward contextual information.
- Dialogue State Tracking: A dialogue-level LSTM maintains a continuous representation of the dialogue state, updated with each turn using the encoded user utterance and previous system action. The model maintains a probability distribution over candidate values for each goal slot type :
- is the dialogue-level LSTM state at turn
- is the encoding of user utterance at turn
- is the encoding of the previous turn system output
- is the dialogue-level LSTM
- is the slot type at the th turn
- is the user utterance up to and including the th turn
- is the previous system action before the th turn
- is a single hidden layer MLP with activation over slot type
- KB Operation: The DST outputs are used to formulate API call commands to retrieve information from a KB. Symbolic queries are sent to the KB, and the ranking of KB entities is handled by an external recommender system. The model encodes a summary of the query results (item availability, number of matched items) as input to the policy network.
- Dialogue Policy: A deep neural network models the dialogue policy, selecting the next system action based on the dialogue-level LSTM state (), the log probabilities of candidate values from the belief tracker (), and the encoding of the query results summary (). The policy network emits a system action in the form of a dialogue act conditioning on these inputs:
- is the system action
- is the user utterance up to and including the th turn
- is the previous system action before the th turn
- is the encoding of the query results summary up to and including the th turn
- is a single hidden layer MLP with activation function over all system actions
- Supervised Pre-training: The system is initially trained in a supervised manner using task-oriented dialogue samples, minimizing a linear interpolation of cross-entropy losses for DST and system action prediction.
- Imitation Learning with Human Teaching: To address the covariate shift between training and test data, the agent interacts with users, and when it makes a mistake in tracking the user's goal, users correct the mistake by demonstrating the correct actions. These user-corrected dialogue samples are added to the training corpus, and the dialogue policy is fine-tuned using dialogue sample aggregation.
- Reinforcement Learning with Human Feedback: After the imitation learning stage, the system is further optimized with RL, learning from user feedback collected at the end of each dialogue (positive reward for successful tasks, zero reward for failed tasks). A step penalty is applied to each dialogue turn. The REINFORCE algorithm is used to optimize the network parameters.
The proposed method was evaluated on the DSTC2 dataset in the restaurant search domain and an internally collected dialogue corpus in the movie booking domain. The movie booking dialogue corpus has an average number of 8.4 turns per dialogue with 100K dialogues in the training set, and 10K dialogues in the development and test sets each.
Experimental results demonstrate the effectiveness of the proposed approach:
- The SL model achieves near state-of-the-art DST results on the DSTC2 corpus.
- In the movie booking domain, the SL model achieves promising performance on both individual slot tracking and joint slot tracking accuracy, achieving 84.57% joint accuracy.
- Interactive learning with imitation and reinforcement learning improves task success rate, reduces dialogue turn size, and enhances DST accuracy.
- The results suggest that imitation learning with human teaching effectively adapts the supervised training model to the dialogue state distribution during user interactions.
- RL optimization further improves dialogue state tracking performance and dialogue policy.
- End-to-end RL optimization achieves higher dialogue task success rates compared to policy-only training.
- Human evaluations using Amazon Mechanical Turk indicate that interactive learning with imitation and reinforcement learning improves the quality of the model, with mean human scores increasing from 3.987 for the SL model to 4.603 for the SL + IL + RL model.
In conclusion, the paper presents a hybrid learning approach for training task-oriented dialogue systems that leverages both imitation and reinforcement learning. The proposed method effectively addresses the dialogue state distribution mismatch issue and enables the agent to learn from human teaching and feedback, leading to improved task success rates and overall system performance.