RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style (2410.16184v1)

Published 21 Oct 2024 in cs.CL

Abstract: Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide LLM alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. To this end, we introduce RM-Bench, a novel benchmark designed to evaluate reward models based on their sensitivity to subtle content differences and resistance to style biases. Extensive experiments demonstrate that RM-Bench strongly correlates with policy model performance, making it a reliable reference for selecting reward models to align LLMs effectively. We evaluate nearly 40 reward models on RM-Bench. Our results reveal that even state-of-the-art models achieve an average performance of only 46.6%, which falls short of random-level accuracy (50%) when faced with style bias interference. These findings highlight the significant room for improvement in current reward models. Related code and data are available at https://github.com/THU-KEG/RM-Bench.

Authors (6)

Yantao Liu (13 papers)
Zijun Yao (50 papers)
Rui Min (13 papers)
Yixin Cao (138 papers)
Lei Hou (127 papers)
Juanzi Li (144 papers)

Citations (3)

View on Semantic Scholar

Summary

Analysis of RM-Bench: Assessing Reward Models in LLMs with Subtlety and Style

The RM-Bench paper addresses a critical gap in the evaluation of reward models, essential components in techniques such as Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws. Reward models are pivotal for aligning LLMs with human preferences and selecting optimal responses. However, existing benchmarks often fall short in assessing a reward model's ability to detect subtle content changes and resist style biases, which can lead to low correlation with policy model performance.

Contribution of RM-Bench

The authors introduce RM-Bench, a robust benchmark designed to address these limitations. It evaluates reward models based on:

Sensitivity to Subtle Changes: The benchmark assesses how well models can identify nuanced content differences. This is crucial for ensuring that models prioritize factual correctness over responses merely formatted or structured differently.
Resistance to Style Biases: RM-Bench tests a model’s robustness against style influences, ensuring that decisions are based on content accuracy rather than stylistic preferences.
Correlation with Policy Models: The benchmark is designed to have a strong correlation with the performance of policy models, tested after fine-tuning with Proximal Policy Optimization (PPO), ensuring that the reward model chosen enhances downstream task performance.

Experimental Evaluation

Nearly 40 reward models were evaluated on RM-Bench, revealing significant insights into current capabilities and deficiencies:

Performance Analysis: Even leading models achieve an average accuracy of only 46.6% under interference from style bias, underperforming compared to a basic random-guess baseline. This underscores the room for improvement in accurately detecting substance over style.
Mathematics and Code Challenges: The reward models consistently demonstrated difficulties in the Math and Code domains, failing to surpass random baselines. This could mislead policy models during training, posing a considerable challenge for future developments.
DPO Advantages: Direct Policy Optimization (DPO) models showed promising potential, outperforming sequence-classification reward models when evaluated with a reference model. This suggests a strategic direction for future model development.

Theoretical and Practical Implications

The paper’s findings suggest practical steps for improving reward models. Addressing the style bias is crucial not only from an academic perspective but also for practical applications where accurate language understanding is essential, such as in conversational agents. The theoretical implication is significant: better alignment of reward models with human-like understanding of subtlety and style empowers AI models to make decisions more akin to human discourse, reducing misalignments and biases.

Future Directions

The paper highlights the need for continued exploration of biases not currently addressed, such as those concerning word choice or context. Future research could extend the benchmark to include these aspects, further refining the evaluation of reward models.

In conclusion, RM-Bench represents a progressive step towards refining reward model benchmarks, emphasizing the importance of substance over style and aligning model performance with policy objectives. The insights from RM-Bench are expected to direct future research efforts towards more robust, nuanced evaluation methodologies for reward models in AI systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/natolambert/status/1848743335172641139

https://twitter.com/FallMonkey/status/1850698223545163789

https://twitter.com/arXivGPT/status/1849178926557659568