On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
This paper presents a novel approach for optimizing dialogue policies using reinforcement learning (RL) within the domain of spoken dialogue systems (SDS). The authors propose an on-line learning framework combining active reward learning and policy optimization, specifically targeting task-oriented SDS. The core focus is on overcoming challenges associated with obtaining reliable user feedback for reward modeling, which is crucial for effective RL in dialogue systems.
The authors identify the inherent difficulty in obtaining explicit and reliable user feedback in real-world settings. Traditional RL approaches often rely on predefined task completions or user ratings, but these methods have limitations due to noisy or partial feedback and the absence of prior knowledge about user goals. To address this, the paper introduces an active learning strategy coupled with a Gaussian process classification (GPC) model to robustly learn the reward signal from user feedback.
Key elements of the proposed approach include:
- Dialogue Representation: Leveraging a recurrent neural network (RNN) encoder-decoder architecture, the system generates continuous space dialogue representations in an unsupervised manner. This feature helps transform variable-length dialogues into fixed-dimensional input spaces for consistent processing and reward estimation.
- Active Learning and Gaussian Processes: Employing a Gaussian process in the reward model allows for modeling uncertainty in user feedback, enabling effective noise handling and minimizing the impact of erroneous ratings. Active learning selectively prompts users for feedback in uncertain cases, reducing unnecessary queries and improving data efficiency.
- Policy Training: The policy is optimized using GP-SARSA, which is inherently sample-efficient. This setup ensures the system can learn effective dialogue strategies even with a limited number of interactions.
Experimental results obtained through live interactions with users demonstrate the efficacy of the proposed system in reducing annotation costs and handling noisy feedback effectively. Comparisons with other state-of-the-art reward modeling approaches, such as subjective ratings and simulated data-based models, highlight the advantages of the on-line GP system, particularly in terms of higher success rates and reduced labeling effort.
The implications of this research are significant for real-world SDS applications. The framework decreases reliance on large annotated datasets and costly simulators, presenting a scalable solution for deploying dialogue systems that continuously learn and adapt from real user interactions. The combination of Bayesian modeling for uncertainty estimation and unsupervised neural network-based representations promises advancements in the robustness and efficiency of SDS policy training.
Looking ahead, this research opens avenues for more complex reward functions in dialogue systems that extend beyond task completion. Future work may involve integrating additional dimensions of dialogue quality, such as user satisfaction, to further enhance the optimization paradigms in SDS. Moreover, exploring the theoretical bounds and applicability of the proposed methods across varied domains can enrich our understanding of adaptive dialogue systems in AI.