Provable Reward-Agnostic Preference-Based Reinforcement Learning (2305.18505v3)
Abstract: Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning LLMs, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we fill in such a gap between theoretical PbRL and practical algorithms by proposing a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Specifically, our framework can incorporate linear and low-rank MDPs with efficient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efficient querying algorithm tailored to this scenario.
- Reinforcement learning: Theory and algorithms. Technical report.
- Flambe: Structural complexity and representation learning of low rank mdps. arXiv preprint arXiv:2006.10814.
- Provable benefits of representational transfer in reinforcement learning. arXiv preprint arXiv:2205.14571.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- On the statistical efficiency of reward-free exploration in non-linear rl. arXiv preprint arXiv:2206.10770.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773–3793. PMLR.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068. PMLR.
- Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
- Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 462–469.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
- Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775–2785.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Learning to score behaviors for guided policy optimization. In International Conference on Machine Learning, pages 7445–7454. PMLR.
- Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850.
- Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33:18050–18062.
- On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737–8747. PMLR.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635.
- Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE international conference on robotics and automation, pages 1765–1772. IEEE.
- Benchmarks and algorithms for offline preference-based reward learning. arXiv preprint arXiv:2301.01392.
- Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pages 608–623. PMLR.
- Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pages 9767–9779. PMLR.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Learning general world models in a handful of reward-free deployments. Advances in Neural Information Processing Systems, 35:26820–26838.
- Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794.
- Prefrec: Preference-based recommender systems for reinforcing long-term user engagement. arXiv preprint arXiv:2212.02779.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556.
- Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756–11766.
- Invariant causal prediction for block mdps. In International Conference on Machine Learning, pages 11214–11224. PMLR.
- Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73–82.
- Wenhao Zhan (17 papers)
- Masatoshi Uehara (49 papers)
- Wen Sun (124 papers)
- Jason D. Lee (151 papers)