HelpSteer2: Open-source dataset for training top-performing reward models (2406.08673v1)

Published 12 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: High-quality preference datasets are essential for training reward models that can effectively guide LLMs in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at https://huggingface.co/datasets/nvidia/HelpSteer2 and code is available at https://github.com/NVIDIA/NeMo-Aligner

Citations (43)

View on Semantic Scholar

Summary

The paper introduces HelpSteer2, a novel dataset featuring 10,000 response pairs annotated on five key attributes with high inter-annotator reliability (Cohen's κ up to 0.793).
It employs a rigorous multi-turn data collection and annotation process that combines crowd-sourced and internally generated prompts to capture diverse real-world LLM use cases.
The study shows that reward models trained on HelpSteer2, including Nemotron-4 340B-based models, achieve superior performance with scores up to 92.0% on Reward Bench.

HelpSteer2: An Open-Source Dataset for Training High-Performing Reward Models

The paper "HelpSteer2: Open-source dataset for training top-performing reward models," authored by Zhilin Wang et al., presents the HelpSteer2 dataset, which is engineered to train state-of-the-art (SOTA) reward models for aligning LLMs with human preferences. The authors highlight the growing necessity of having high-quality preference datasets, emphasizing that strong numerical results on benchmarks such as Reward Bench are essential indicators of a reward model's relevance and capability.

Contribution and Dataset Composition

HelpSteer2 is introduced as a CC-BY-4.0-licensed dataset that contains 10,000 response pairs annotated on five key attributes: helpfulness, correctness, coherence, complexity, and verbosity. Although its size is an order of magnitude smaller than existing datasets such as HH-RLHF, the dataset’s high-quality annotations—reflected in a Cohen's $\kappa$ of up to 0.791 for the key attributes—make it highly effective for training reward models. The dataset derives prompts primarily from ShareGPT, while responses are generated using a multi-generational mix of internal models from the Nemotron family.

Data Collection and Annotation

The data collection process employs a combination of crowd-sourced prompts and internally generated responses, ensuring the prompts represent a range of real-world LLM use cases. This dataset uniquely includes multi-turn prompt completions to enhance the reward model’s ability to handle complex, multi-faceted conversations. Rigorous annotation procedures were applied, including reviewing up to five annotators per response on a Likert-5 scale. Automated checks and manual reviews ensure the quality and consistency of annotations.

Statistically, HelpSteer2 annotations show a significant improvement over the initial annotations, with attributes such as helpfulness and correctness achieving Cohen's $\kappa$ scores of 0.791 and 0.793, respectively. An analysis indicates that responses in HelpSteer2 are generally more helpful, correct, and coherent than those in previous datasets.

Reward Model Training and Evaluation

The reward models trained using HelpSteer2 leverage two different base models—Llama 3 70B and Nemotron-4 340B. These models train for two epochs and utilize scalar regression outputs for each fine-grained attribute, providing more detailed feedback than binary preferences. Extensive tests on Reward Bench validate the efficacy of these models, with the Nemotron-4 340B-based reward model achieving an overall score of 92.0%. This score surpasses both open-source and proprietary models, underlining the effectiveness and efficiency of HelpSteer2.

The authors also provide a comparative analysis with reward models trained on other permissively licensed datasets such as Open Assistant and HH-RLHF, showing that HelpSteer2-trained models perform significantly better in various categories of Reward Bench.

Model Alignment Techniques

Three alignment approaches are explored: Iterative Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and SteerLM 2.0. Each method demonstrates different strengths:

DPO shows outstanding performance in factuality metrics like TruthfulQA and Arena Hard, due to its alignment with the correctness attribute.
PPO excels in AlpacaEval 2.0 LC, benefiting from its focus on response style and detail.
SteerLM 2.0 optimizes for complex, multi-requirement tasks as seen in MT Bench, by enhancing the likelihood of generating high-quality, attribute-aligned responses.

Implications and Future Directions

The implications of this work are significant for both practical applications and future research. Practically, HelpSteer2 allows smaller organizations and researchers to train competitive reward models without the need for extensive computational resources or restricted datasets. Theoretically, the introduction of fine-grained attributes and methods such as SteerLM 2.0 opens avenues for more nuanced model alignment that goes beyond simple binary preferences.

Future directions could include expanding the dataset to include multi-lingual annotations, enhancing demographic diversity among annotators, and refining reward model training techniques to further mitigate issues such as verbosity bias and factual accuracy. The integration of HelpSteer2 with real-world LLM applications might also be explored to ensure continuous improvement in alignment and adherence to human preferences in practical settings.

In conclusion, the HelpSteer2 dataset represents a significant step forward in the efficient and effective training of reward models for LLMs. Its impact is evident in the strong performance of trained models on diverse benchmarks, and its accessibility ensures it will be a valuable resource for the broader research community.

PDF Markdown

Related Papers

GitHub

GitHub - NVIDIA/NeMo-Aligner: Scalable toolkit for efficient model alignment (603 stars)

Tweets

https://twitter.com/omarsar0/status/1802024352851878296

https://twitter.com/din_lol_/status/1803272374684623252

https://twitter.com/_The_AI_Guy_/status/1803686511306490288

https://twitter.com/RamonAstudill12/status/1803442025796173935

https://twitter.com/TheTuringPost/status/1803083754203992534

https://twitter.com/chenE2937376713/status/1803272281776595193

YouTube

Show All Videos