On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems (1605.07669v2)

Published 24 May 2016 in cs.CL and cs.LG

Abstract: The ability to compute an accurate reward function is essential for optimising a dialogue policy via reinforcement learning. In real-world applications, using explicit user feedback as the reward signal is often unreliable and costly to collect. This problem can be mitigated if the user's intent is known in advance or data is available to pre-train a task success predictor off-line. In practice neither of these apply for most real world applications. Here we propose an on-line learning framework whereby the dialogue policy is jointly trained alongside the reward model via active learning with a Gaussian process model. This Gaussian process operates on a continuous space dialogue representation generated in an unsupervised fashion using a recurrent neural network encoder-decoder. The experimental results demonstrate that the proposed framework is able to significantly reduce data annotation costs and mitigate noisy user feedback in dialogue policy learning.

PDF Abstract

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

This paper presents a novel approach for optimizing dialogue policies using reinforcement learning (RL) within the domain of spoken dialogue systems (SDS). The authors propose an on-line learning framework combining active reward learning and policy optimization, specifically targeting task-oriented SDS. The core focus is on overcoming challenges associated with obtaining reliable user feedback for reward modeling, which is crucial for effective RL in dialogue systems.

The authors identify the inherent difficulty in obtaining explicit and reliable user feedback in real-world settings. Traditional RL approaches often rely on predefined task completions or user ratings, but these methods have limitations due to noisy or partial feedback and the absence of prior knowledge about user goals. To address this, the paper introduces an active learning strategy coupled with a Gaussian process classification (GPC) model to robustly learn the reward signal from user feedback.

Key elements of the proposed approach include:

Dialogue Representation: Leveraging a recurrent neural network (RNN) encoder-decoder architecture, the system generates continuous space dialogue representations in an unsupervised manner. This feature helps transform variable-length dialogues into fixed-dimensional input spaces for consistent processing and reward estimation.
Active Learning and Gaussian Processes: Employing a Gaussian process in the reward model allows for modeling uncertainty in user feedback, enabling effective noise handling and minimizing the impact of erroneous ratings. Active learning selectively prompts users for feedback in uncertain cases, reducing unnecessary queries and improving data efficiency.
Policy Training: The policy is optimized using GP-SARSA, which is inherently sample-efficient. This setup ensures the system can learn effective dialogue strategies even with a limited number of interactions.

Experimental results obtained through live interactions with users demonstrate the efficacy of the proposed system in reducing annotation costs and handling noisy feedback effectively. Comparisons with other state-of-the-art reward modeling approaches, such as subjective ratings and simulated data-based models, highlight the advantages of the on-line GP system, particularly in terms of higher success rates and reduced labeling effort.

The implications of this research are significant for real-world SDS applications. The framework decreases reliance on large annotated datasets and costly simulators, presenting a scalable solution for deploying dialogue systems that continuously learn and adapt from real user interactions. The combination of Bayesian modeling for uncertainty estimation and unsupervised neural network-based representations promises advancements in the robustness and efficiency of SDS policy training.

Looking ahead, this research opens avenues for more complex reward functions in dialogue systems that extend beyond task completion. Future work may involve integrating additional dimensions of dialogue quality, such as user satisfaction, to further enhance the optimization paradigms in SDS. Moreover, exploring the theoretical bounds and applicability of the proposed methods across varied domains can enrich our understanding of adaptive dialogue systems in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Pei-Hao Su (25 papers)
Lina Rojas-Barahona (11 papers)
Stefan Ultes (32 papers)
David Vandyke (18 papers)
Tsung-Hsien Wen (27 papers)
Steve Young (30 papers)
Milica Gasic (18 papers)
Nikola Mrksic (10 papers)

Citations (167)

View on Semantic Scholar

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems (1605.07669v2)

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

Related Papers