Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Optimal Advantage from Preferences and Mistaking it for Reward (2310.02456v1)

Published 3 Oct 2023 in cs.LG and cs.AI

Abstract: We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, $\hat{A*_r}$, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of $\hat{A*_r}$ is less desirable than the appropriate and simpler approach of greedy maximization of $\hat{A*_r}$. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary LLMs with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, page 02783649211041652, 2021.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  4. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NIPS), pages 4299–4307, 2017.
  5. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  6. imitation: Clean imitation learning implementations. arXiv:2211.11972v1 [cs.LG], 2022. URL https://arxiv.org/abs/2211.11972.
  7. Inverse preference learning: Preference-based rl without a reward function. arXiv preprint arXiv:2305.15363, 2023.
  8. Reward learning from human preferences and demonstrations in atari. arXiv preprint arXiv:1811.06521, 2018.
  9. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022.
  10. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021a.
  11. B-pref: Benchmarking preference-based reinforcement learning. arXiv preprint arXiv:2111.03026, 2021b.
  12. Policy invariance under reward transformations: Theory and application to reward shaping. Sixteenth International Conference on Machine Learning (ICML), 1999.
  13. OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI Blog https://openai.com/blog/chatgpt/, 2022. Accessed: 2022-12-20.
  14. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  15. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  16. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  17. Active preference-based learning of reward functions. Robotics: Science and Systems, 2017.
  18. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  20. Skill preferences: Learning to extract and execute robotic skills from human feedback. In Conference on Robot Learning, pages 1259–1268. PMLR, 2022.
  21. Q-learning. Machine learning, 8:279–292, 1992.
  22. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. W. Bradley Knox (10 papers)
  2. Stephane Hatgis-Kessell (3 papers)
  3. Sigurdur Orn Adalgeirsson (2 papers)
  4. Serena Booth (11 papers)
  5. Anca Dragan (62 papers)
  6. Peter Stone (184 papers)
  7. Scott Niekum (67 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.