Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RewardBench: Evaluating Reward Models for Language Modeling (2403.13787v2)

Published 20 Mar 2024 in cs.LG

Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of LLMs and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Moral foundations of large language models. arXiv preprint arXiv:2310.15337, 2023.
  2. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  3. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  6. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022b.
  7. Stable LM 2 1.6B Technical Report. arXiv preprint arXiv:2402.17834, 2024.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
  9. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
  10. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  11. UltraFeedback: Boosting Language Models with High-quality Feedback. arXiv preprint arXiv:2310.01377, 2023.
  12. Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
  13. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
  14. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767, 2023.
  15. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  16. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
  17. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  18. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022.
  19. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  20. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024a.
  21. Glore: When, where, and how to improve llm reasoning via global and local refinements. arXiv preprint arXiv:2402.10963, 2024b.
  22. Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS, 2021.
  23. Camels in a Changing Climate: Enhancing LM Adaptation with Tülu 2. arXiv preprint arXiv:2311.10702, 2023.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
  25. Tigerscore: Towards building explainable metric for all text generation tasks. ArXiv, abs/2310.00752, 2023b. URL https://api.semanticscholar.org/CorpusID:263334281.
  26. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023c.
  27. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
  28. The history and risks of reinforcement learning and human feedback. arXiv e-prints, pages arXiv–2310, 2023.
  29. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  30. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023a.
  31. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  32. Let’s Verify Step by Step. arXiv preprint arXiv:2305.20050, 2023.
  33. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  34. Beyond training objectives: Interpreting reward model divergence in large language models, 2024.
  35. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023.
  36. Towards agile text classifiers for everyone, 2023.
  37. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124, 2023.
  38. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  39. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  40. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  41. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022. URL https://arxiv.org/abs/2210.01241.
  42. Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
  43. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. arXiv preprint arXiv:2308.01263, 2023.
  44. Unintended impacts of llm alignment on global representation, 2024.
  45. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  46. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/, 2022. Accessed: 2023-02-12.
  47. The trickle-down impact of reward (in-) consistency on rlhf. arXiv preprint arXiv:2309.16155, 2023.
  48. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
  49. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
  50. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  51. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
  52. Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944, 2023.
  53. Secrets of rlhf in large language models part ii: Reward modeling, 2024.
  54. Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. arXiv preprint arXiv:2308.13387, 2023.
  55. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  56. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
  57. Fairness feedback loops: Training on synthetic data amplifies bias, 2024.
  58. Evaluating Large Language Models at Evaluating Instruction Following. arXiv preprint arXiv:2310.07641, 2023.
  59. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023.
  60. Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF, November 2023a. URL https://starling.cs.berkeley.edu/.
  61. Judgelm: Fine-tuned large language models are scalable judges. 2023b.
  62. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Nathan Lambert (37 papers)
  2. Valentina Pyatkin (34 papers)
  3. Jacob Morrison (15 papers)
  4. LJ Miranda (2 papers)
  5. Bill Yuchen Lin (72 papers)
  6. Khyathi Chandu (17 papers)
  7. Nouha Dziri (40 papers)
  8. Sachin Kumar (68 papers)
  9. Tom Zick (31 papers)
  10. Yejin Choi (287 papers)
  11. Noah A. Smith (224 papers)
  12. Hannaneh Hajishirzi (176 papers)
Citations (128)

Summary

Evaluating Reward Models for LLMing with REWARD BENCH

Introduction to RewardBench

RewardBench presents a comprehensive framework for evaluating reward models in the context of Reinforcement Learning from Human Feedback (RLHF). This benchmark includes a diverse set of prompts to test reward models across various domains such as chat, reasoning, safety, and out-of-distribution queries. One of the primary goals is to explore the limitations of contemporary reward models and how they align with human values within LLMs. Further, RewardBench seeks to establish a repository that encourages reproducibility and consistent benchmarking across reward models, addressing a gap in the current literature where few resources exist for such evaluations.

Dataset Construction and Evaluation

RewardBench is structured into five principal sections, with prompts sourced from both new collections and repurposed from existing benchmarks. Notably, this dataset emphasizes the role of refusals in safe content generation and includes instruction-following, reasoning tasks, and tests reward models against crafted adversarial prompts to explore their handling of nuanced language understanding tasks.

The evaluation metric primarily used is accuracy, calculated as the percentage of instances where a reward model correctly identifies the preferred completion from a pair. This binary classification approach offers a straightforward measure of a reward model's effectiveness in aligning with human judgment. The final REWARD BENCH score represents an average across the subset scores, presenting a holistic assessment of a reward model's performance across varied domains.

Key Findings and Insights

Significant variability exists in the performance of tested reward models across different categories within RewardBench. While some models demonstrate strong alignment with human preferences in certain domains, others falter, particularly with adversarial or nuanced prompts. This variability underscores the complexity of reward modeling and highlights areas for improvement in understanding human values and preferences.

The evaluation also sheds light on the distinction between models trained directly with preference data (Direct Preference Optimization models) and those trained as classifiers. Interestingly, Direct Preference Optimization (DPO) models generally excel in the reasoning and safety categories but exhibit lower performance on established preference datasets. This discrepancy points to a potential divide between models optimized for generative tasks and those fine-tuned for classification, suggesting different avenues for refinement in each approach.

Practical Implications and Future Directions

The RewardBench benchmark catalyzes further research into reward models, particularly in addressing their limitations in understanding complex instructions, safety considerations, and reasoning capabilities. Moreover, the observed differences between DPO and classifier-based models open pathways to exploring hybrid approaches or new training paradigms to enhance model alignment with human values.

Future work could expand RewardBench to include dynamic scenarios where reward models must adapt to evolving contexts or preferences, further pushing the boundaries of model evaluation. Additionally, incorporating broader datasets representing diverse global perspectives can ensure that reward models align with a more inclusive set of human values, addressing potential biases and promoting fairness in AI applications.

In conclusion, RewardBench contributes a valuable framework to the ongoing effort to develop and refine reward models in language technology. By highlighting current challenges and offering a basis for comparison, it paves the way for advancements in creating more aligned, ethical, and effective AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com