Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism (2305.18438v3)

Published 29 May 2023 in cs.LG, cs.AI, math.OC, math.ST, stat.ML, and stat.TH

Abstract: In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing BeLLMan mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

Reinforcement Learning with Human Feedback Under Dynamic Choices

The paper "Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism" focuses on advancing offline reinforcement learning when incorporated with human feedback (RLHF). Specifically, the paper proposes and examines the Dynamic Discrete Choice (DDC) model, which provides a way to understand human decision-making processes characterized by bounded rationality and forward-looking behavior. Rooted in econometrics, this model is particularly beneficial when rewards are not directly observable, which contrasts with typical reinforcement learning setups.

Methodological Approach

The authors introduce the Dynamic-Choice-Pessimistic-Policy-Optimization (DCPPO) method to address the challenging aspects of RLHF in large state spaces with limited human feedback and off-policy distribution shifts. The DCPPO algorithm unfolds in three pivotal stages:

  1. Behavior Policy and State-Action Value Estimation: Using maximum likelihood estimation, the human behavior policy and state-action value function are first modeled. This step draws from the Conditional Choice Probability (CCP) method, a staple in econometrics literature.
  2. Human Reward Function Recovery: The second stage involves minimizing BeLLMan mean squared error to estimate the reward function. This step leverages the learned value functions to infer unobservable rewards, employing a linear regression approach.
  3. Pessimistic Policy Optimization: With the reward function estimated, the final stage of DCPPO applies pessimistic value iteration to derive a near-optimal policy. The algorithm integrates a penalty to counterbalance the distribution shift and confines the uncertainty intrinsic in the choice model.

Results and Theoretical Guarantees

Notably, the authors claim that DCPPO matches or closely rivals established pessimistic offline RL algorithms in terms of suboptimality's dependency on distribution shift and dimensionality. They present their results within two frameworks:

  • Linear Model MDP: Here, DCPPO aligns MLE and ridge regression to efficiently estimate the reward function and associated policies. The suboptimality gap achieved is of order O(n1/2)\mathcal{O}(n^{-1/2}), even without observable rewards, paralleling standard offline RL results.
  • Reproducing Kernel Hilbert Space (RKHS): When the reward and value functions are represented as RKHS, the algorithm similarly constructs estimation guarantees under various eigenvalue decay scenarios relevant to the kernel function. Depending on the decay condition, the algorithm achieves suboptimality boundaries from polynomial to exponential bounds, maintaining robustness across different model complexities.

Implications and Future Directions

The paper substantiates the DCPPO approach through rigorous theoretical guarantees, asserting it as the first effective algorithm for off-policy RLHF under the dynamic discrete choice model. Practically, this has implications for areas requiring human feedback but constrained by unobservable rewards, such as autonomous driving and clinical decision-making.

From a theoretical standpoint, the discussion on utilizing pessimism, particularly when rewards are inferred rather than directly observed, contributes a novel dimension to reinforcement learning strategies. Future efforts might explore refining the estimation accuracy for rewards and policies under multitude state-action spaces or consider alternative noise models beyond Gumbel perturbation. Additionally, addressing computational efficiencies in larger scale implementations stands as a promising research trajectory.

The paper's insights into combining econometric choice models with reinforcement learning principles illuminate new pathways for crafting human-centric machine learning models that are both adaptable and theoretically sound. As AI continues to integrate deeper into decision support systems, understanding and strategically applying human feedback remain crucial for developing reliable, empathetic autonomous systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. Temporal-difference estimation of dynamic discrete choice models. arXiv preprint arXiv:1912.09509, 2019.
  3. Flambe: Structural complexity and representation learning of low rank mdps, 2020.
  4. Swapping the nested fixed point algorithm: A class of estimators for discrete markov decision models. Econometrica, 70(4):1519–1543, 2002.
  5. Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156(1):38–67, 2010.
  6. Practical methods for estimation of dynamic discrete choice models. Annu. Rev. Econ., 3(1):363–394, 2011.
  7. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
  8. Identification and efficient semiparametric estimation of a dynamic discrete game. Technical report, National Bureau of Economic Research, 2015.
  9. Dynamic assortment optimization with changing contextual information. The Journal of Machine Learning Research, 21(1):8918–8961, 2020.
  10. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation, 2022.
  11. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
  12. On kernelized multi-armed bandits, 2017.
  13. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
  14. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
  15. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384–22395, 2020.
  16. Scaling laws for reward model overoptimization, 2022.
  17. On the provable advantage of unsupervised pretraining, 2023.
  18. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  19. Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60(3):497–529, 1993.
  20. A simulation estimator for dynamic models of discrete choice. The Review of Economic Studies, 61(2):265–289, 1994.
  21. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
  22. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
  23. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020.
  24. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  25. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  26. Linear iv regression estimators for structural dynamic discrete choice models. Journal of Econometrics, 222(1):778–804, 2021.
  27. Provably feedback-efficient reinforcement learning via active reward learning, 2023.
  28. Learning dynamic robot-to-human object handover from human feedback. Robotics Research: Volume 1, pages 161–176, 2018.
  29. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021.
  30. Interactively learning preference constraints in linear bandits, 2022.
  31. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
  32. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  33. Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR, 2020.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  35. Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
  36. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  37. John Rust. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, pages 999–1033, 1987.
  38. Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pages 968–994. PMLR, 2022.
  39. Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28–29, 2018.
  40. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. In Artificial intelligence and statistics, pages 856–865. PMLR, 2015.
  41. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pages 19967–20025. PMLR, 2022.
  42. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  43. Support vector machines. In Information Science and Statistics, 2008.
  44. Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226, 2021.
  45. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
  46. Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 31, 2018.
  47. Sample-optimal parametric q-learning using linearly additive features, 2019.
  48. Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems, 33:13903–13916, 2020.
  49. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  50. Structural estimation of markov decision processes in high-dimensional state space with finite-time guarantees, 2022.
  51. Understanding expertise through demonstrations: A maximum likelihood framework for offline inverse reinforcement learning, 2023.
  52. Pac reinforcement learning for predictive state representations, 2022.
  53. Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets. In International Conference on Machine Learning, pages 27117–27142. PMLR, 2022.
  54. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zihao Li (161 papers)
  2. Zhuoran Yang (155 papers)
  3. Mengdi Wang (199 papers)
Citations (38)
Youtube Logo Streamline Icon: https://streamlinehq.com