RewardBench: Evaluating Reward Models for Language Modeling (2403.13787v2)
Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of LLMs and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
- Moral foundations of large language models. arXiv preprint arXiv:2310.15337, 2023.
- GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Stable LM 2 1.6B Technical Report. arXiv preprint arXiv:2402.17834, 2024.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
- UltraFeedback: Boosting Language Models with High-quality Feedback. arXiv preprint arXiv:2310.01377, 2023.
- Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388, 2023.
- Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024a.
- Glore: When, where, and how to improve llm reasoning via global and local refinements. arXiv preprint arXiv:2402.10963, 2024b.
- Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS, 2021.
- Camels in a Changing Climate: Enhancing LM Adaptation with Tülu 2. arXiv preprint arXiv:2311.10702, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
- Tigerscore: Towards building explainable metric for all text generation tasks. ArXiv, abs/2310.00752, 2023b. URL https://api.semanticscholar.org/CorpusID:263334281.
- Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023c.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
- The history and risks of reinforcement learning and human feedback. arXiv e-prints, pages arXiv–2310, 2023.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023a.
- AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Let’s Verify Step by Step. arXiv preprint arXiv:2305.20050, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- Beyond training objectives: Interpreting reward model divergence in large language models, 2024.
- Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023.
- Towards agile text classifiers for everyone, 2023.
- OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124, 2023.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022. URL https://arxiv.org/abs/2210.01241.
- Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. arXiv preprint arXiv:2308.01263, 2023.
- Unintended impacts of llm alignment on global representation, 2024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/, 2022. Accessed: 2023-02-12.
- The trickle-down impact of reward (in-) consistency on rlhf. arXiv preprint arXiv:2309.16155, 2023.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
- Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944, 2023.
- Secrets of rlhf in large language models part ii: Reward modeling, 2024.
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. arXiv preprint arXiv:2308.13387, 2023.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
- Fairness feedback loops: Training on synthetic data amplifies bias, 2024.
- Evaluating Large Language Models at Evaluating Instruction Following. arXiv preprint arXiv:2310.07641, 2023.
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685, 2023.
- Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF, November 2023a. URL https://starling.cs.berkeley.edu/.
- Judgelm: Fine-tuned large language models are scalable judges. 2023b.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Nathan Lambert (37 papers)
- Valentina Pyatkin (34 papers)
- Jacob Morrison (15 papers)
- LJ Miranda (2 papers)
- Bill Yuchen Lin (72 papers)
- Khyathi Chandu (17 papers)
- Nouha Dziri (40 papers)
- Sachin Kumar (68 papers)
- Tom Zick (31 papers)
- Yejin Choi (287 papers)
- Noah A. Smith (224 papers)
- Hannaneh Hajishirzi (176 papers)