Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog (1907.00456v2)

Published 30 Jun 2019 in cs.LG, cs.AI, and stat.ML

Abstract: Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

PDF Abstract

An Analysis of Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

The paper "Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog" presents a methodological advancement in the field of reinforcement learning (RL), specifically addressing the challenge of off-policy batch reinforcement learning (BRL) where models need to learn from a fixed batch of historical data without the opportunity to explore.

Key Contributions

The paper introduces a class of off-policy batch RL algorithms capable of learning effectively from human interaction data without active exploration. Central to this approach is leveraging models that are pre-trained on large datasets, thus providing a robust prior. This prior is retained and integrated into the RL process using techniques such as KL-control to penalize deviations from the known prior model during training.

KL-Control and Pre-trained Priors: The core contribution is the incorporation of KL-control in penalizing policy divergence from a pre-trained prior, ensuring that learning remains close to realistic data distributions. This is particularly effective in avoiding overestimation bias, a common pitfall in batch RL.
Dropout-Based Uncertainty Estimates: The research employs dropout-based uncertainty estimates to compute a lower bound on target Q-values, optimizing the learning process without the computational burden associated with Double Q-Learning.
Application to Dialog Systems: Applying these methodologies, the paper tackles open-domain dialog generation — a problem characterized by a vast action-space due to the myriad possible sentence constructions — highlighting the algorithm's ability to learn from implicit human preferences expressed in dialog interactions.

Significant Findings

Upon testing these algorithms in real-world dialog settings, significant improvements over existing off-policy methods were observed, particularly when measured against explicit human feedback and qualitative responses. The KL-control mechanism, in conjunction with an accurate prior, enabled the dialog systems to engage meaningfully by recognizing implicit signals such as sentiment and conversation length.

Theoretical and Practical Implications

Theoretically, this research enhances the robustness of batch RL methods by integrating the strength of prior knowledge while mitigating overestimation biases. Practically, the implications are far-reaching for dialog systems that interact with humans in unpredictable, dynamic environments. This approach enables models to refine policies offline using pre-collected data without risking real-world failures, crucial for deploying AI in sensitive or safety-critical applications.

Speculation on Future Developments

The paper's approach suggests potential expansion into other generative tasks beyond dialog systems, where large action spaces present similar RL challenges. Additionally, further exploration of KL-control's applications in multi-modal interaction settings could yield interesting insights into collaborative human-machine interaction paradigms.

In conclusion, this paper advances both the theoretical framework and practical application of off-policy batch RL, demonstrating substantial improvements in the capability of models to understand and adapt to human preferences through efficient offline learning. Such developments pave the way for more human-aligned AI systems across diverse domains.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Natasha Jaques (32 papers)
Asma Ghandeharioun (19 papers)
Judy Hanwen Shen (21 papers)
Craig Ferguson (6 papers)
Agata Lapedriza (26 papers)
Noah Jones (3 papers)
Shixiang Gu (23 papers)
Rosalind Picard (26 papers)

Citations (321)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/svlevine/status/1732842488648638744