Interactive Learning from Policy-Dependent Human Feedback (1701.06049v2)

Published 21 Jan 2017 in cs.AI

Abstract: This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner's current policy. We present empirical results that show this assumption to be false -- whether human trainers give a positive or negative feedback for a decision is influenced by the learner's current policy. Based on this insight, we introduce {\em Convergent Actor-Critic by Humans} (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.

Authors (8)

James MacGlashan (5 papers)
Robert Loftin (12 papers)
Bei Peng (34 papers)
Guan Wang (52 papers)
David Roberts (12 papers)
Matthew E. Taylor (69 papers)
Michael L. Littman (50 papers)
Mark K Ho (2 papers)

Citations (281)

View on Semantic Scholar

Summary

Interactive Learning from Policy-Dependent Human Feedback

The paper "Interactive Learning from Policy-Dependent Human Feedback" addresses the intricacies of reinforcement learning (RL) in environments where human trainers provide feedback aimed at guiding agent behavior. Previous assumptions in this domain have typically posited that human feedback is independent of the learner's current policy, focusing instead on individual actions as independent entities. The core contribution of this research is the presentation of evidence that human feedback is indeed policy-dependent, motivating the introduction of the COACH (Convergent Actor-Critic by Humans) algorithm. COACH is designed to leverage this feedback dependency by converging to a local optimum through the incorporation of the advantage function, promoting improvements in action selection relative to the agent's current policy.

Experimental Insights

The empirical findings from the paper reveal that human trainers adjust their feedback not just based on specific actions but considering the observed trajectory of the agent's actions over time. This observation underscores a misalignment with prior algorithmic assumptions of policy-independence, which may lead to suboptimal learning outcomes. In the conducted studies, participants demonstrated a differential feedback pattern contingent on whether the agent's performance was perceived as improving or degrading, a nuance that previous algorithms failed to account for.

COACH Algorithm

At the heart of the COACH algorithm is the actor-critic framework modified to exploit policy-dependent feedback. The advantage function is leveraged to estimate the benefit of actions, aligning well with how human trainers naturally provide guidance. COACH refines agent policies through human feedback that correlates with the advantage, encapsulating properties like diminishing returns and differential feedback commonly emphasized in behavior analysis. The authors present a generalized update rule that ensures convergence, where human feedback is treated as an unbiased estimate of the advantage function, thereby facilitating policy improvement without reliance on a critic component in the traditional sense.

Practical Robotic Application

The paper extends COACH to a practical robotics scenario, employing a TurtleBot for the demonstration of varied behaviors. These include straightforward tasks such as object avoidance and more complex, compositional behaviors like alternating between targets. Notably, Real-time COACH demonstrates robust performance, efficiently adapting agent behavior to the multi-faceted challenges posed by real-world environments, including rapid decision cycles and partial observability. The algorithm's capacity to generalize beyond simulated environments to physical robots emphasizes its potential applicability across practical human-robot interaction scenarios.

Comparative Analysis

In experimental comparisons, COACH exhibits a distinct advantage over algorithms like TAMER and traditional Q-learning. TAMER, while effective under certain feedback models, struggles with feedback-driven policy forgetting and complexities arising from unintended feedback interpretations. COACH, by contrast, sustains learning through its nuanced handling of human feedback, avoiding pitfalls like positive cycling and policy forgot ting. The algorithm's architecture is adept at incorporating advanced training methodologies such as lure training, demonstrating flexibility and robustness in realizing desired behavior.

Implications and Future Directions

The introduction of a policy-aware feedback model prompts reconsideration of how human trainers should interact with learning systems, potentially allowing for more effective human-agent collaboration. The implications for designing interactive learning systems are profound, suggesting a paradigm where machine learning practitioners must consider the alignment of human cognitive models with algorithmic feedback mechanisms.

Future research could explore optimizing COACH's integration with demonstration-based learning, facilitating richer, hybrid learning systems that merge interactive feedback with observed demonstrations. Additionally, exploring human feedback dynamics further, particularly understanding the discrepancies between perceived and actual agent policies, promises to refine algorithmic design even further.

In essence, the paper underscores the necessity of algorithms that understand and adapt to the subtleties of human feedback dynamics, setting a foundation for more sophisticated, human-compatible learning models in the space of interactive reinforcement learning.

PDF Markdown

Related Papers

Find Related Papers