Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Rewarding Language Models (2401.10020v2)

Published 18 Jan 2024 in cs.CL and cs.AI
Self-Rewarding Language Models

Abstract: We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding LLMs, where the LLM itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Introduction

Aligning LLMs with human values and preferences is critical for their effective and safe deployment. Typically, LLM training has involved human preference data to tune these models for better task compliance, using diverse approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). However, these methods face limitations due to the finite scope of available human feedback and the static nature of externally built reward models. A novel paper examines the concept of Self-Rewarding LLMs, where LLMs act as both respondent to tasks and judge of their own responses, establishing a framework for self-improving, dynamic reward modeling.

Training Self-Rewarding LLMs

The paper posits that by endowing LLMs with dual capabilities—they not only generate responses to tasks but also appraise the quality of generated responses—you achieve self-alignment. This approach involves Iterative DPO training, beginning with a base pretrained LLM supplemented by a limited set of human-annotated data. Subsequent models iterate through a cycle of creating self-instruction examples and then rewarding them based on the model's own judgments. The evaluations are not arbitrary but follow formulated criteria to ensure responses' relevancy, completeness, perspective, and quality.

Methodology Insights

In a series of experiments using the Llama 2 70B model as a base, researchers demonstrate an increase in instructional performance as well as in the model's innate reward-evaluating ability. Through self-generated feedback and Iterative DPO, subsequent models surpassed their predecessor's capabilities, resulting in increasingly sophisticated LLMs. Notably, the performance of these self-rewarded models on AlpacaEval 2.0 surpasses existing LLMs trained using larger, proprietary data sets.

Implications and Future Exploration

Early findings suggest that the concept of Self-Rewarding LLMs could redefine the training of LLMs. By facilitating self-improvement, models may bypass the limitations set by human-derived reward systems. The iterative process potentially enables a continuous quality augmentation beyond existing benchmarks of human feedback quality. However, the long-term saturation of self-rewarding efficiencies, safety implications, and broader evaluative measures have yet to be fully assessed, rendering these findings preliminary yet promising avenues for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854–8874, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.493. URL https://aclanthology.org/2023.acl-long.493.
  3. Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=IiRHQ7gvnq.
  7. AlpaGasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023.
  8. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  9. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167, 2008.
  10. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  11. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv preprint arXiv:2308.07286, 2023.
  12. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  13. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
  14. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
  15. OpenAssistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  16. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  17. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023a.
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  20. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  22. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
  23. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  25. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  28. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.
  29. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  30. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
  31. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=EdIGMCHk4l.
  32. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  33. Click: Controllable text generation with sequence likelihood contrastive learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1022–1040, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.65. URL https://aclanthology.org/2023.findings-acl.65.
  34. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=uccHPGDlao.
  35. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Weizhe Yuan (25 papers)
  2. Richard Yuanzhe Pang (26 papers)
  3. Kyunghyun Cho (292 papers)
  4. Sainbayar Sukhbaatar (53 papers)
  5. Jing Xu (244 papers)
  6. Jason Weston (130 papers)
  7. Xian Li (115 papers)
Citations (218)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Self-Rewarding Language Models (93 points, 60 comments)
  2. Self-Rewarding Language Models (3 points, 0 comments)