Provable Offline Preference-Based Reinforcement Learning (2305.14816v2)
Abstract: In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
- Advances in preference-based reinforcement learning: A review. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2527–2532. IEEE.
- Reinforcement learning: Theory and algorithms. Technical report.
- Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 1042–1051.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773–3793. PMLR.
- Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pages 3852–3878. PMLR.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Fast rates for contextual linear optimization. Management Science, 68(6):4236–4245.
- Fast rates for the regret of offline reinforcement learning. In Conference on Learning Theory. PMLR.
- Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
- Morel: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 21810–21823. Curran Associates, Inc.
- Conservative q-learning for offline reinforcement learning. In Conference on Neural Information Processing Systems (NeurIPS).
- Pessimism for offline linear contextual bandits using lp confidence sets. arXiv preprint arXiv:2205.10671.
- Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
- When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967.
- Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202.
- Performance guarantees for policy learning. In Annales de l’IHP Probabilites et statistiques, volume 56, page 2162. NIH Public Access.
- Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285–2294. PMLR.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850.
- The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721.
- Bayesian inverse reinforcement learning. Urbana, 51:61801.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716.
- Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635.
- Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. arXiv preprint arXiv:2202.13890.
- Benchmarks and algorithms for offline preference-based reward learning. arXiv preprint arXiv:2301.01392.
- Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Refined value-based offline rl under realizability and partial coverage. arXiv preprint arXiv:2302.02392.
- Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. In arXiv preprint arXiv:2107.06226.
- van de Geer, S. (2000). Empirical Processes in M-estimation, volume 6. Cambridge university press.
- Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
- Gap-dependent unsupervised exploration for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 4109–4131. PMLR.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926.
- Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794.
- Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
- Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, volume 33, pages 14129–14142.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556.
- Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR.
- Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Wenhao Zhan (17 papers)
- Masatoshi Uehara (49 papers)
- Nathan Kallus (133 papers)
- Jason D. Lee (151 papers)
- Wen Sun (124 papers)