Towards Reliable Alignment: Uncertainty-aware RLHF (2410.23726v1)
Abstract: Recent advances in aligning LLMs with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models used in Reinforcement Learning with Human Feedback (RLHF) are typically learned from small datasets using stochastic optimization algorithms, making them prone to high variability. We illustrate the inconsistencies between reward models empirically on numerous open-source datasets. We theoretically show that the fluctuation of the reward models can be detrimental to the alignment problem because the derived policies are more overfitted to the reward model and, hence, are riskier if the reward model itself is uncertain. We use concentration of measure to motivate an uncertainty-aware, conservative algorithm for policy optimization. We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards. We theoretically prove that our proposed methodology has less risk than the vanilla method. We corroborate our theoretical results with experiments based on designing an ensemble of reward models. We use this ensemble of reward models to align a LLM using our methodology and observe that our empirical findings match our theoretical predictions.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- AI Anthropic. Introducing claude, 2023.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Luigi Daniele. Suphavadeeprasit. Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for Effecient LLM Training. arXiv preprint arXiv:(coming soon), 2023.
- RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
- Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
- Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
- Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR, 2022.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Gemma. https://huggingface.co/google/gemma-2b, 2024.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Ensemble-based uncertainty estimation for deep learning. arXiv preprint arXiv:1612.01474, 2016.
- Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
- Alpacaeval: An automatic evaluator of instruction-following models, 2023.
- Openorca: An open dataset of gpt augmented flan reasoning traces, 2023.
- Reward uncertainty for exploration in preference-based reinforcement learning. arXiv preprint arXiv:2205.12401, 2022.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847, 2024.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
- R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5), 2023.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
- John Schulman. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- William F Sharpe. Mutual fund performance. The Journal of business, 39(1):119–138, 1966.
- The trickle-down impact of reward (in-) consistency on rlhf. arXiv preprint arXiv:2309.16155, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. ICML, 2024.
- Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024.
- Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024.
- Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2023.
- Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv preprint arXiv:2401.16635, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.