Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles (2401.00243v1)

Published 30 Dec 2023 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning LLMs. However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined by both rewards and uncertainties provided by the diverse reward LoRA ensembles. Our experimental results, based on two real human preference datasets, showcase the effectiveness of diverse reward LoRA ensembles in quantifying reward uncertainty. Additionally, uncertainty regularization in UP-RLHF proves to be pivotal in mitigating overoptimization, thereby contributing to the overall performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  3. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. International Conference on Learning Representations, 2022a.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022b.
  5. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface. co/blog/stackllama, 1, 2023.
  6. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  7. Breiman, L. Random forests. Machine learning, 45:5–32, 2001.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  10. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  11. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  12. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ethayarajh22a.html.
  13. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. PMLR, 2016.
  14. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
  15. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513–1589, 2023.
  16. Uncertainty estimation for language reward models. arXiv preprint arXiv:2203.07472, 2022.
  17. Aligning language models with preferences through f-divergence minimization. In International conference on machine learning. PMLR, 2023.
  18. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  19. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  20. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.
  21. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  22. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. Advances in Neural Information Processing Systems, 35:16203–16220, 2022.
  23. All the news that’s fit to fabricate: Ai-generated text as a tool of media misinformation. Journal of experimental political science, 9(1):104–117, 2022.
  24. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  25. Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584, 2023a.
  26. Remax: A simple, effective, and efficient method for aligning large language models. arXiv preprint arXiv:2310.10505, 2023b.
  27. A review of uncertainty for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, pp. 155–162, 2022.
  28. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022.
  29. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016.
  30. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023.
  31. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  33. Self-supervised exploration via disagreement. In International Conference on Machine Learning, pp. 5062–5071. PMLR, 2019.
  34. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  35. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3419–3448, 2022.
  36. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023.
  37. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023.
  38. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  40. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2023.
  41. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
  42. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  43. Quantifying uncertainty in foundation models via ensembles. In NeurIPS 2022 Workshop on Robustness in Sequence Modeling, 2022.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035, 2023.
  46. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  47. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212, 2023.
  48. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  49. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023.
  50. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  51. Rrhf: Rank responses to align language models with human feedback without tears. Advances in neural information processing systems, 2023.
  52. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  53. Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023.
  54. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuanzhao Zhai (10 papers)
  2. Han Zhang (338 papers)
  3. Yu Lei (56 papers)
  4. Yue Yu (343 papers)
  5. Kele Xu (62 papers)
  6. Dawei Feng (19 papers)
  7. Bo Ding (18 papers)
  8. Huaimin Wang (37 papers)
Citations (27)