Learning from Active Human Involvement through Proxy Value Propagation (2502.03369v1)

Published 5 Feb 2025 in cs.AI and cs.RO

Abstract: Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents' actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents' exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: https://metadriverse.github.io/pvp

Summary

The paper introduces PVP, a method that encodes human intent by assigning high Q-values to human-demonstrated actions and low Q-values to intervened actions.
It employs a balanced dual-buffer strategy and temporal difference learning to propagate proxy values from sparsely labeled transitions.
Experimental results show PVP achieves superior efficiency across diverse tasks compared to traditional methods like TD3 and DQN.

The paper introduces Proxy Value Propagation (PVP), a reward-free, human-in-the-loop policy optimization method designed to learn from active human involvement. The core idea revolves around learning a proxy value function that encodes human intents, guiding policy learning to emulate human behaviors.

The PVP method assigns high Q values to state-action pairs demonstrated by humans and low Q values to agent actions that are subsequently intervened upon by human subjects. This proxy value function is then propagated to unlabeled state-action pairs generated by the agent's exploration using Temporal Difference (TD) learning. The method is designed to be integrated into existing value-based RL algorithms with minimal modifications.

The problem is formulated within the Markov Decision Process (MDP) framework, $M=\left\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, r, \gamma, d_{0}\right\rangle$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\to\mathcal{S}$ is the state transition function, $r: \mathcal{S}\times\mathcal{A}\to[R_{\min}, R_{\max}]$ is the reward function, $\gamma\in(0,1)$ is the discount factor, and $d_0:\mathcal{S}\to[0,1]$ is the initial state distribution. Unlike conventional RL, PVP operates without reliance on a predefined reward function, instead using human interventions $I(s, a)$ and demonstrations $a_h \sim \pi_h(\cdot|s)$ as the sole sources of supervision.

PVP transforms a value-based RL method into a human-in-the-loop policy optimization method. During training, a human subject supervises the agent-environment interactions. Those exploratory transitions by the agent are stored in the Novice Buffer $\mathcal B_{n} = \{(s, a_n, s')\}$ . At any time, the human subject can intervene the free exploration of the agent by pressing a button in the control device. While pressing the button, the human takes over the control and provides a demonstration of how to behave. During human involvement, both human and novice actions will be recorded into the Human Buffer $\mathcal B_{h} = \{(s, a_n, a_h, s')\}$ .

The loss function for the method is defined as: $J(\theta) = J^\text{PV}(\theta) + J^\text{TD}(\theta)$ , where $J^\text{PV}(\theta)$ is the proxy value loss and $J^\text{TD}(\theta)$ is the TD loss.

The proxy value loss is defined as: $J\text{PV}(\theta) = {(s, a_n, a_h)} [| Q\theta (s, a_h) - 1 |2 + | Q_\theta (s, a_n) + 1 |2 ] I(s, a_n).$ where $Q_\theta$ is the Q network parameterized by $\theta$ , $a_h$ is the human action, $a_n$ is the novice action and $I(s, a_n)$ is the intervention policy.

The TD loss is defined as:

$J^\text{TD}(\theta) = _{(s, a, s')} | Q_\theta(s, a) - \gamma \max_{a'} Q_{\hat{\theta}(s', a') |^2.$

where $\gamma$ is the discount factor and $Q_{\hat{\theta}}$ is a delay-updated target network.

The PVP method bears resemblance to Conservative Q-Learning (CQL) by augmenting the CQL objective with an $L_2$ regularization term on the Q-values for human-involved transitions. This regularization constrains Q-values, preventing unbounded growth and potential overfitting.

The method employs balanced buffers to address the sparsity of intervention signals as the agent learns. The intervention gradually becomes sparse as the agent learns to reduce human intervention. However, those sparse intervention signals contain even more important information on how to behave under critical situations. In each training iteration, the method samples two equally-sized batches from the human buffer and the novice buffer to balance the transitions.

The method was evaluated in continuous and discrete action spaces, using TD3 and DQN, respectively. The primary reason is that according to the feedback of human subjects, stochastic novice makes human subjects experience excessive fatigue due to the difficulty in monitoring and correcting agents’ noisy actions.

Experiments were conducted on a range of control tasks, including continuous control tasks in MetaDrive, CARLA Town01, and Grand Theft Auto V (GTA V), and a discrete control task in MiniGrid Two Room. These environments provide varied observation spaces (sensory state vectors, bird-eye view images, semantic maps) and require different levels of agent exploration and exploitation.

Results indicate that PVP achieves higher learning efficiency and performance compared to baselines, demonstrating its applicability across diverse task settings and human control devices. In MetaDrive, PVP achieves 350 returns in 37K steps, whereas the TD3 baseline fails to achieve comparable results even after 300K steps. In CARLA, PVP agents learn to drive within 30 minutes, whereas TD3 cannot solve the task. In GTA V, PVP solves the task with 1.2K human data usage and 20K total data usage in 16 minutes, compared to TD3, which utilizes 300K steps. In MiniGrid tasks, PVP successfully solves the tasks while DQN fails, demonstrating its applicability to discrete action spaces.

A user paper was conducted to assess the human experience, evaluating compliance, performance, and stress. The results indicate that PVP is more user-friendly compared to other human-in-the-loop methods, due to the deterministic novice policy alleviating jitter and unexpected behaviors, and the balanced buffer and proxy value improving user experience in compliance and performance. Ablation studies validate the importance of TD learning, balanced buffers, and the novice buffer for the success of PVP.