Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems (1804.06512v1)

Published 18 Apr 2018 in cs.CL
Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems

Abstract: In this work, we present a hybrid learning method for training task-oriented dialogue systems through online user interactions. Popular methods for learning task-oriented dialogues include applying reinforcement learning with user feedback on supervised pre-training models. Efficiency of such learning method may suffer from the mismatch of dialogue state distribution between offline training and online interactive learning stages. To address this challenge, we propose a hybrid imitation and reinforcement learning method, with which a dialogue agent can effectively learn from its interaction with users by learning from human teaching and feedback. We design a neural network based task-oriented dialogue agent that can be optimized end-to-end with the proposed learning method. Experimental results show that our end-to-end dialogue agent can learn effectively from the mistake it makes via imitation learning from user teaching. Applying reinforcement learning with user feedback after the imitation learning stage further improves the agent's capability in successfully completing a task.

The paper introduces a hybrid imitation and reinforcement learning method for training task-oriented dialogue systems through online user interactions. The approach addresses the dialogue state distribution mismatch between offline supervised learning (SL) and online interactive reinforcement learning (RL) stages. The proposed method allows a dialogue agent to learn from human teaching and feedback, effectively improving its ability to complete tasks successfully.

The paper makes the following contributions:

  • It presents a neural network-based task-oriented dialogue system that can be optimized end-to-end for NLU, DST, and dialogue policy learning.
  • It introduces a hybrid imitation and reinforcement learning method to address the challenge of dialogue state distribution mismatch between offline training and interactive learning.

The system architecture comprises several key components:

  • Utterance Encoding: A bidirectional LSTM encodes user utterances into continuous vector representations, capturing both forward and backward contextual information.
  • Dialogue State Tracking: A dialogue-level LSTM maintains a continuous representation of the dialogue state, updated with each turn using the encoded user utterance and previous system action. The model maintains a probability distribution P(lkm)P(l^{m}_k) over candidate values for each goal slot type mMm \in M:

    sk=LSTMD(sk1,[Uk,Ak1])s_k = \operatorname{LSTM_D}(s_{k-1}, \hspace{1mm} [U_k, \hspace{1mm} A_{k-1}])

    P(lkmUk,A<k)=SlotDistm(sk)P(l^{m}_k \hspace{1mm} | \hspace{1mm} \mathbf{U}_{\le k}, \hspace{1mm} \mathbf{A}_{< k}) = \operatorname{SlotDist}_{m}(s_k)

    • sks_k is the dialogue-level LSTM state at turn kk
    • UkU_k is the encoding of user utterance at turn kk
    • Ak1A_{k-1} is the encoding of the previous turn system output
    • LSTMD\operatorname{LSTM_D} is the dialogue-level LSTM
    • lkml^{m}_k is the slot type mMm \in M at the kkth turn
    • Uk\mathbf{U}_{\le k} is the user utterance up to and including the kkth turn
    • A<k\mathbf{A}_{< k} is the previous system action before the kkth turn
    • SlotDistm\operatorname{SlotDist}_{m} is a single hidden layer MLP with softmax\operatorname{softmax} activation over slot type mMm \in M
  • KB Operation: The DST outputs are used to formulate API call commands to retrieve information from a KB. Symbolic queries are sent to the KB, and the ranking of KB entities is handled by an external recommender system. The model encodes a summary of the query results (item availability, number of matched items) as input to the policy network.
  • Dialogue Policy: A deep neural network models the dialogue policy, selecting the next system action based on the dialogue-level LSTM state (sks_k), the log probabilities of candidate values from the belief tracker (vkv_k), and the encoding of the query results summary (EkE_k). The policy network emits a system action in the form of a dialogue act conditioning on these inputs:

    P(akUk,A<k,Ek)=PolicyNet(sk,vk,Ek)P(a_{k} \hspace{1mm} | \hspace{1mm} U_{\le k}, \hspace{1mm} A_{< k}, \hspace{1mm} E_{\le k}) = \operatorname{PolicyNet}(s_{k}, v_{k}, E_{k})

    • aka_k is the system action
    • UkU_{\le k} is the user utterance up to and including the kkth turn
    • A<kA_{< k} is the previous system action before the kkth turn
    • EkE_{\le k} is the encoding of the query results summary up to and including the kkth turn
    • PolicyNet\operatorname{PolicyNet} is a single hidden layer MLP with softmax\operatorname{softmax} activation function over all system actions
  • Supervised Pre-training: The system is initially trained in a supervised manner using task-oriented dialogue samples, minimizing a linear interpolation of cross-entropy losses for DST and system action prediction.
  • Imitation Learning with Human Teaching: To address the covariate shift between training and test data, the agent interacts with users, and when it makes a mistake in tracking the user's goal, users correct the mistake by demonstrating the correct actions. These user-corrected dialogue samples are added to the training corpus, and the dialogue policy is fine-tuned using dialogue sample aggregation.
  • Reinforcement Learning with Human Feedback: After the imitation learning stage, the system is further optimized with RL, learning from user feedback collected at the end of each dialogue (positive reward for successful tasks, zero reward for failed tasks). A step penalty is applied to each dialogue turn. The REINFORCE algorithm is used to optimize the network parameters.

The proposed method was evaluated on the DSTC2 dataset in the restaurant search domain and an internally collected dialogue corpus in the movie booking domain. The movie booking dialogue corpus has an average number of 8.4 turns per dialogue with 100K dialogues in the training set, and 10K dialogues in the development and test sets each.

Experimental results demonstrate the effectiveness of the proposed approach:

  • The SL model achieves near state-of-the-art DST results on the DSTC2 corpus.
  • In the movie booking domain, the SL model achieves promising performance on both individual slot tracking and joint slot tracking accuracy, achieving 84.57% joint accuracy.
  • Interactive learning with imitation and reinforcement learning improves task success rate, reduces dialogue turn size, and enhances DST accuracy.
  • The results suggest that imitation learning with human teaching effectively adapts the supervised training model to the dialogue state distribution during user interactions.
  • RL optimization further improves dialogue state tracking performance and dialogue policy.
  • End-to-end RL optimization achieves higher dialogue task success rates compared to policy-only training.
  • Human evaluations using Amazon Mechanical Turk indicate that interactive learning with imitation and reinforcement learning improves the quality of the model, with mean human scores increasing from 3.987 for the SL model to 4.603 for the SL + IL + RL model.

In conclusion, the paper presents a hybrid learning approach for training task-oriented dialogue systems that leverages both imitation and reinforcement learning. The proposed method effectively addresses the dialogue state distribution mismatch issue and enables the agent to learn from human teaching and feedback, leading to improved task success rates and overall system performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bing Liu (211 papers)
  2. Gokhan Tur (47 papers)
  3. Pararth Shah (13 papers)
  4. Larry Heck (41 papers)
  5. Dilek Hakkani-Tur (94 papers)
Citations (151)