Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism (2305.18438v3)
Abstract: In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing BeLLMan mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Temporal-difference estimation of dynamic discrete choice models. arXiv preprint arXiv:1912.09509, 2019.
- Flambe: Structural complexity and representation learning of low rank mdps, 2020.
- Swapping the nested fixed point algorithm: A class of estimators for discrete markov decision models. Econometrica, 70(4):1519–1543, 2002.
- Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156(1):38–67, 2010.
- Practical methods for estimation of dynamic discrete choice models. Annu. Rev. Econ., 3(1):363–394, 2011.
- Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
- Identification and efficient semiparametric estimation of a dynamic discrete game. Technical report, National Bureau of Economic Research, 2015.
- Dynamic assortment optimization with changing contextual information. The Journal of Machine Learning Research, 21(1):8918–8961, 2020.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation, 2022.
- Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
- On kernelized multi-armed bandits, 2017.
- Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395, 2020.
- Scaling laws for reward model overoptimization, 2022.
- On the provable advantage of unsupervised pretraining, 2023.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60(3):497–529, 1993.
- A simulation estimator for dynamic models of discrete choice. The Review of Economic Studies, 61(2):265–289, 1994.
- Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
- Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
- Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
- Linear iv regression estimators for structural dynamic discrete choice models. Journal of Econometrics, 222(1):778–804, 2021.
- Provably feedback-efficient reinforcement learning via active reward learning, 2023.
- Learning dynamic robot-to-human object handover from human feedback. Robotics Research: Volume 1, pages 161–176, 2018.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
- Interactively learning preference constraints in linear bandits, 2022.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR, 2020.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- John Rust. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, pages 999–1033, 1987.
- Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pages 968–994. PMLR, 2022.
- Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
- Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. In Artificial intelligence and statistics, pages 856–865. PMLR, 2015.
- Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pages 19967–20025. PMLR, 2022.
- Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- Support vector machines. In Information Science and Statistics, 2008.
- Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226, 2021.
- Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
- Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 31, 2018.
- Sample-optimal parametric q-learning using linearly additive features, 2019.
- Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems, 33:13903–13916, 2020.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- Structural estimation of markov decision processes in high-dimensional state space with finite-time guarantees, 2022.
- Understanding expertise through demonstrations: A maximum likelihood framework for offline inverse reinforcement learning, 2023.
- Pac reinforcement learning for predictive state representations, 2022.
- Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets. In International Conference on Machine Learning, pages 27117–27142. PMLR, 2022.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons, 2023.
- Zihao Li (161 papers)
- Zhuoran Yang (155 papers)
- Mengdi Wang (199 papers)