Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism (2305.18438v3)

Published 29 May 2023 in cs.LG, cs.AI, math.OC, math.ST, stat.ML, and stat.TH

Abstract: In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing BeLLMan mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

References (54)

Authors (3)

Zihao Li (161 papers)
Zhuoran Yang (155 papers)
Mengdi Wang (199 papers)

Citations (38)

View on Semantic Scholar

Summary

Reinforcement Learning with Human Feedback Under Dynamic Choices

The paper "Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism" focuses on advancing offline reinforcement learning when incorporated with human feedback (RLHF). Specifically, the paper proposes and examines the Dynamic Discrete Choice (DDC) model, which provides a way to understand human decision-making processes characterized by bounded rationality and forward-looking behavior. Rooted in econometrics, this model is particularly beneficial when rewards are not directly observable, which contrasts with typical reinforcement learning setups.

Methodological Approach

The authors introduce the Dynamic-Choice-Pessimistic-Policy-Optimization (DCPPO) method to address the challenging aspects of RLHF in large state spaces with limited human feedback and off-policy distribution shifts. The DCPPO algorithm unfolds in three pivotal stages:

Behavior Policy and State-Action Value Estimation: Using maximum likelihood estimation, the human behavior policy and state-action value function are first modeled. This step draws from the Conditional Choice Probability (CCP) method, a staple in econometrics literature.
Human Reward Function Recovery: The second stage involves minimizing BeLLMan mean squared error to estimate the reward function. This step leverages the learned value functions to infer unobservable rewards, employing a linear regression approach.
Pessimistic Policy Optimization: With the reward function estimated, the final stage of DCPPO applies pessimistic value iteration to derive a near-optimal policy. The algorithm integrates a penalty to counterbalance the distribution shift and confines the uncertainty intrinsic in the choice model.

Results and Theoretical Guarantees

Notably, the authors claim that DCPPO matches or closely rivals established pessimistic offline RL algorithms in terms of suboptimality's dependency on distribution shift and dimensionality. They present their results within two frameworks:

Linear Model MDP: Here, DCPPO aligns MLE and ridge regression to efficiently estimate the reward function and associated policies. The suboptimality gap achieved is of order $\mathcal{O}(n^{-1/2})$ , even without observable rewards, paralleling standard offline RL results.
Reproducing Kernel Hilbert Space (RKHS): When the reward and value functions are represented as RKHS, the algorithm similarly constructs estimation guarantees under various eigenvalue decay scenarios relevant to the kernel function. Depending on the decay condition, the algorithm achieves suboptimality boundaries from polynomial to exponential bounds, maintaining robustness across different model complexities.

Implications and Future Directions

The paper substantiates the DCPPO approach through rigorous theoretical guarantees, asserting it as the first effective algorithm for off-policy RLHF under the dynamic discrete choice model. Practically, this has implications for areas requiring human feedback but constrained by unobservable rewards, such as autonomous driving and clinical decision-making.

From a theoretical standpoint, the discussion on utilizing pessimism, particularly when rewards are inferred rather than directly observed, contributes a novel dimension to reinforcement learning strategies. Future efforts might explore refining the estimation accuracy for rewards and policies under multitude state-action spaces or consider alternative noise models beyond Gumbel perturbation. Additionally, addressing computational efficiencies in larger scale implementations stands as a promising research trajectory.

The paper's insights into combining econometric choice models with reinforcement learning principles illuminate new pathways for crafting human-centric machine learning models that are both adaptable and theoretically sound. As AI continues to integrate deeper into decision support systems, understanding and strategically applying human feedback remain crucial for developing reliable, empathetic autonomous systems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos