Self-Evolved Reward Learning for LLMs (2411.00418v3)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning LLMs with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the LLM's responses. As LLMs improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of LLMs. Resources of this paper can be found at https://aka.ms/ser
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Daniel Dewey. Reinforcement learning and the reward engineering principle. In 2014 AAAI Spring Symposium Series, 2014.
- The llama 3 herd of models. ArXiv, abs/2407.21783, 2024.
- Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081–79094, 2023.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024.
- Huggingface h4 stack exchange preference dataset, 2023. URL https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Language model self-improvement by reinforcement learning contemplation. arXiv preprint arXiv:2305.14483, 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017a. URL https://api.semanticscholar.org/CorpusID:28695052.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
- Learning to summarize from human feedback. In NeurIPS, 2020a.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020b.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426–9439, 2024.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024a.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
- Automl-gpt: Automatic machine learning with gpt. arXiv preprint arXiv:2305.02499, 2023.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36, 2024.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.