HelpSteer2-Preference: Complementing Ratings with Preferences (2410.01257v2)

Published 2 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. This reward model can then be used with REINFORCE algorithm (RLHF) to align an Instruct model to reach 85.0 on Arena Hard, which is No. 1 as of 1 Oct 2024. We open-source this dataset (CC-BY-4.0 license) at https://huggingface.co/datasets/nvidia/HelpSteer2#preferences-new -- 1-oct-2024 and openly release the trained Reward and Instruct models at https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward and https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct

Citations (6)

View on Semantic Scholar

Summary

The paper introduces HelpSteer2-Preference, a dual-format dataset that integrates preference annotations and ratings to benchmark reward modeling methods.
The study empirically compares Scaled Bradley-Terry and Regression models, showing that combining both paradigms leads to superior performance on RewardBench.
The research offers open-sourced models and novel hybrid techniques, providing actionable insights for advancing AI alignment and reward model development.

Analysis of "HelpSteer2-Preference: Complementing Ratings with Preferences"

The paper "HelpSteer2-Preference: Complementing Ratings with Preferences" explores an innovative approach to developing reward models for aligning LLMs with human instructions. The authors specifically investigate the effectiveness of combining two prevalent training paradigms: Bradley-Terry style (BT) and Regression style, making a head-to-head comparison using a newly released dataset, HelpSteer2-Preference.

Core Contributions

Introduction of a Complementary Dataset: The authors present HelpSteer2-Preference, a dataset enriched with preference annotations designed for BT training, along with human-written justifications. This dataset complements existing ratings from HelpSteer2 aimed at Regression training. This dual-format dataset addresses the lack of comparable data necessary for evaluating the efficacy of BT and Regression training paradigms.
Empirical Comparison of Reward Modelling Approaches: The paper provides an empirical comparison between BT models and Regression models using adequately matched data. The findings suggest that each approach has its merits, and when appropriately tuned, both can perform effectively. The combination of both methods is shown to yield superior results, as demonstrated by the top-ranking performance on the RewardBench leaderboard.
Novel Modelling Techniques: A new approach, Scaled Bradley-Terry, is introduced to utilize the preference magnitude information effectively, outperforming traditional BT models.
Evaluation and Open-Sourcing of Models: A Llama-3.1-70B-Instruct model tuned with their approach achieved a score of 94.1 on RewardBench, positioning it as a leading reward model. The dataset and the trained reward model have been made publicly available, supporting reproducibility and further research.

Evaluation Methodology

The paper employs RewardBench to evaluate the performance of different reward models, providing a comprehensive and reliable assessment across a diverse range of tasks. RewardBench consists of several task categories: Chat, Chat-Hard, Safety, and Reasoning, each contributing to a holistic evaluation of model performance.

Key Findings

Regression vs. Bradley-Terry: The paper shows that when using ideal formulations, such as Scaled BT for BT models and Helpfulness-Only SteerLM for Regression models, both styles can achieve similar overall performance on RewardBench. This equivalency suggests the importance of leveraging dataset-specific insights rather than rigidly adhering to one modeling paradigm.
Enhanced Performance through Combination: Models initialized on Helpfulness-Only SteerLM and further trained using Scaled BT exhibit improved synergy, culminating in a model that excels in overall RewardBench performance, demonstrating the potential of combining both training frameworks.
Preference Justifications: Utilizing justifications as training data also yields insights. However, pairwise justifier models, directly leveraging these justifications, generally show less robust performance compared to reward models trained using more traditional methods.

Theoretical and Practical Implications

The implications of this work are substantial for both theoretical exploration and applied AI. Theoretically, the research underscores the value of hybrid approaches in reward model training. Practically, the open-sourced dataset and model serve as valuable resources for further advancements in AI alignment research.

Future Directions

Looking ahead, there is room for further exploration in several areas:

Scalability: Investigating how the proposed methodologies scale with larger model architectures or more complex datasets.
Robustness: Enhancing the robustness of models by integrating additional contextual or domain-specific data.
Human-In-The-Loop Interactions: Expanding the role of human feedback in training paradigms beyond static dataset annotations.

In conclusion, this paper contributes to advancing the understanding and development of reward models by offering actionable insights and resources. Its findings encourage a holistic view of model training, emphasizing the utility of leveraging diverse data views and annotation styles.

PDF Markdown

Related Papers

Tweets

https://twitter.com/lmarena_ai/status/1849530739194462708

https://twitter.com/kuchaev/status/1849248662767177951

https://twitter.com/ailozovskaya/status/1846849140686237984

https://twitter.com/hillbig/status/1846680004928983531

https://twitter.com/austinc3301/status/1846554316573646928

https://twitter.com/APratham/status/1846653639974338971

YouTube

Show All Videos