Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Self-Evolved Reward Learning for LLMs (2411.00418v3)

Published 1 Nov 2024 in cs.CL and cs.AI

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning LLMs with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the LLM's responses. As LLMs improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of LLMs. Resources of this paper can be found at https://aka.ms/ser

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  3. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  4. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  5. Daniel Dewey. Reinforcement learning and the reward engineering principle. In 2014 AAAI Spring Symposium Series, 2014.
  6. The llama 3 herd of models. ArXiv, abs/2407.21783, 2024.
  7. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081–79094, 2023.
  8. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  9. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  10. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  12. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  13. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  14. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024.
  15. Huggingface h4 stack exchange preference dataset, 2023. URL https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
  16. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  17. Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
  18. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  19. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  20. Language model self-improvement by reinforcement learning contemplation. arXiv preprint arXiv:2305.14483, 2023.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  22. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  23. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017a. URL https://api.semanticscholar.org/CorpusID:28695052.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
  25. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  26. Learning to summarize from human feedback. In NeurIPS, 2020a.
  27. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020b.
  28. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  30. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9426–9439, 2024.
  31. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  32. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024a.
  33. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
  34. Automl-gpt: Automatic machine learning with gpt. arXiv preprint arXiv:2305.02499, 2023.
  35. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  36. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36, 2024.
  37. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a self-labeling mechanism that allows reward models to generate additional training data and reduce the need for extensive human annotations.
  • It achieves a 7.88% average performance boost over initial models using only 15% of the annotated data required by traditional methods.
  • Experimental validation across multiple datasets demonstrates SER’s potential for scalable and efficient reinforcement learning in training large language models.

Self-Evolved Reward Learning for LLMs

This paper presents a novel approach, termed Self-Evolved Reward Learning (SER), to address the challenges associated with Reinforcement Learning from Human Feedback (RLHF) for training reliable reward models (RMs) in LLMs. The innovative SER method leverages the capabilities of LLMs to self-evolve through an internally generated feedback loop, thereby reducing reliance on extensive human-annotated data.

Methodology and Approach

The core premise of SER is to allow the RM, serving as both a feedback provider and learner, to generate additional training data using unlabeled datasets. This data is utilized iteratively to improve the RM via a self-labeling mechanism. The iterative process involves several key steps:

  1. Self-labeling: The RM assigns labels to unlabeled data based on current learning.
  2. Status Assessment and Data Selection: High-confidence data is selected by evaluating the learning status, ensuring that only data contributing to performance enhancement is utilized.
  3. Iterative Retraining: The RM is continually updated using selected self-labeled data.
  4. LLM Training: The refined RM guides the reinforcement learning process of the LLM.

The paper asserts that this method significantly curtails the need for extensive human-annotated data, achieving competitive results by utilizing only 15% of such data compared to traditional full datasets.

Experimental Validation

Extensive experimentation was conducted on multiple datasets, including HH-RLHF and UltraFeedback, with models such as Mistral and Llama 3, to assess the efficacy of SER across different LLMs and data sizes. The results indicate a notable average performance enhancement of 7.88% over initial models trained with limited annotation. SER consistently achieved or surpassed the performances of models trained with complete human-labeled data, underscoring its potential to improve model robustness and performance in data-scarce scenarios.

Functional and Theoretical Implications

Functionally, SER proposes an efficient alternative pathway to reward learning, mitigating the dependency on labor-intensive data labeling processes while maintaining, or even enhancing, LLM efficiency. The theoretical implications of this work extend to the reinforcement learning framework where reduced data dependency and iterative self-assessment become pivotal, providing a robust model training strategy.

Future Prospects

SER presents an evident pathway towards augmenting model capabilities with reduced data consumption, paving the way for further developments in AI self-improvement capabilities. The paper briefly highlights the potential for integrating LLMs further into the self-evolution loop, which could encompass automatic response generation and reward assignment in future research trajectories. These expansions would aim to refine self-labeling accuracy and enhance RL mechanisms through more sophisticated feedback systems.

Conclusion

Self-Evolved Reward Learning introduces a promising methodology for advancing the performance of LLMs through strategic use of self-generated training data. By fostering a model’s ability to independently refine its skills, SER significantly reduces the reliance on human annotations, achieving results that align closely with full-data scenarios. This research underscores a shift towards self-sustained AI learning paradigms, unlocking avenues of efficiency and scalability that align with the evolving landscapes of AI capabilities and requirements. The paper demonstrates that autonomous model evolution can become a cornerstone in the development of future AI systems, marking a substantive step forward in the practical application of AI self-learning methodologies.