RewardBench: Evaluating Reward Models for Language Modeling (2403.13787v2)

Published 20 Mar 2024 in cs.LG

Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of LLMs and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

View on arXiv

References (62)

Authors (12)

Nathan Lambert (37 papers)
Valentina Pyatkin (34 papers)
Jacob Morrison (15 papers)
LJ Miranda (2 papers)
Bill Yuchen Lin (72 papers)
Khyathi Chandu (17 papers)
Nouha Dziri (40 papers)
Sachin Kumar (68 papers)
Tom Zick (31 papers)
Yejin Choi (287 papers)
Noah A. Smith (224 papers)
Hannaneh Hajishirzi (176 papers)

Citations (128)

View on Semantic Scholar

Summary

Evaluating Reward Models for LLMing with REWARD BENCH

Introduction to RewardBench

RewardBench presents a comprehensive framework for evaluating reward models in the context of Reinforcement Learning from Human Feedback (RLHF). This benchmark includes a diverse set of prompts to test reward models across various domains such as chat, reasoning, safety, and out-of-distribution queries. One of the primary goals is to explore the limitations of contemporary reward models and how they align with human values within LLMs. Further, RewardBench seeks to establish a repository that encourages reproducibility and consistent benchmarking across reward models, addressing a gap in the current literature where few resources exist for such evaluations.

Dataset Construction and Evaluation

RewardBench is structured into five principal sections, with prompts sourced from both new collections and repurposed from existing benchmarks. Notably, this dataset emphasizes the role of refusals in safe content generation and includes instruction-following, reasoning tasks, and tests reward models against crafted adversarial prompts to explore their handling of nuanced language understanding tasks.

The evaluation metric primarily used is accuracy, calculated as the percentage of instances where a reward model correctly identifies the preferred completion from a pair. This binary classification approach offers a straightforward measure of a reward model's effectiveness in aligning with human judgment. The final REWARD BENCH score represents an average across the subset scores, presenting a holistic assessment of a reward model's performance across varied domains.

Key Findings and Insights

Significant variability exists in the performance of tested reward models across different categories within RewardBench. While some models demonstrate strong alignment with human preferences in certain domains, others falter, particularly with adversarial or nuanced prompts. This variability underscores the complexity of reward modeling and highlights areas for improvement in understanding human values and preferences.

The evaluation also sheds light on the distinction between models trained directly with preference data (Direct Preference Optimization models) and those trained as classifiers. Interestingly, Direct Preference Optimization (DPO) models generally excel in the reasoning and safety categories but exhibit lower performance on established preference datasets. This discrepancy points to a potential divide between models optimized for generative tasks and those fine-tuned for classification, suggesting different avenues for refinement in each approach.

Practical Implications and Future Directions

The RewardBench benchmark catalyzes further research into reward models, particularly in addressing their limitations in understanding complex instructions, safety considerations, and reasoning capabilities. Moreover, the observed differences between DPO and classifier-based models open pathways to exploring hybrid approaches or new training paradigms to enhance model alignment with human values.

Future work could expand RewardBench to include dynamic scenarios where reward models must adapt to evolving contexts or preferences, further pushing the boundaries of model evaluation. Additionally, incorporating broader datasets representing diverse global perspectives can ensure that reward models align with a more inclusive set of human values, addressing potential biases and promoting fairness in AI applications.

In conclusion, RewardBench contributes a valuable framework to the ongoing effort to develop and refine reward models in language technology. By highlighting current challenges and offering a basis for comparison, it paves the way for advancements in creating more aligned, ethical, and effective AI systems.