HelpSteer2 Dataset for LLM Alignment
- HelpSteer2 is a multi-attribute human preference dataset featuring structured ratings and pairwise annotations for prompt-response pairs.
- It ensures high-quality data with rigorous aggregation, quality-control protocols, and evaluation across five distinct response dimensions.
- The dataset supports diverse modeling paradigms—regression, Bradley–Terry, and combined ranking—achieving state-of-the-art benchmarks in LLM alignment.
HelpSteer2 is a multi-attribute human preference dataset designed to facilitate the alignment and optimization of LLMs, with a focus on reward modeling, steering, and comparative evaluation. It provides structured ratings and pairwise preferences for user-generated and enterprise-style prompts, supporting a wide range of modeling paradigms including regression, Bradley–Terry (BT), and combined ranking approaches. The dataset is distinguished by its annotation quality, coverage of five distinct response dimensions, and rigorous aggregation and quality-control protocols. HelpSteer2 is open-sourced under the CC-BY-4.0 license and serves as both a benchmark and a training resource for state-of-the-art reward models and alignment pipelines (Wang et al., 2024, Wang et al., 2024, Lee et al., 2 Sep 2025).
1. Dataset Specification and Structure
HelpSteer2 comprises prompt–response pairs with exhaustive human rating and pairwise preference annotation. The standard ratings dataset consists of 10,681 prompts, each paired with two candidate responses, resulting in 21,362 annotated entries (Wang et al., 2024, Wang et al., 2024). Additional splits include:
- “train_ratings.jsonl”: 20,000 response ratings
- “validation_ratings.jsonl”: reserved for evaluation
- “train_preferences.jsonl”: 6,766 pairwise preference comparisons
- “validation_preferences.jsonl”: 352 preference comparisons
The schema for a single rating entry is:
1 2 3 4 5 6 7 8 9 |
{
"prompt": "<string>",
"response": "<string>",
"helpfulness": <int 0–4>,
"correctness": <int 0–4>,
"coherence": <int 0–4>,
"complexity": <int 0–4>,
"verbosity": <int 0–4>
} |
Preference examples additionally specify both responses, an ordinal preference label , and a free-text justification:
1 2 3 4 5 6 7 |
{
"prompt": "<string>",
"response_1": "<string>",
"response_2": "<string>",
"preference": <int -3 … +3>,
"justification": "<string>"
} |
All data files are hosted at https://huggingface.co/datasets/nvidia/HelpSteer2.
2. Data Collection, Annotation, and Quality Control
Prompts are sampled primarily (≈95%) from the ShareGPT corpus, with the remainder representing enterprise-style tasks such as summarization, extraction, and QA (Wang et al., 2024). Non-English prompts are filtered using FastText; coding prompts are filtered by heuristics. Prompts are clustered into ~1,000 topics via BERTopic and distributed uniformly to ensure topical diversity. Complexity balancing is achieved with outputs from the Nemotron-2-43B classifier and deliberate sampling across complexity bins. Approximately 29% of prompts are multi-turn, with assistant turns replaced by domain-adapted completions.
Responses are contributed by multiple sources:
- Internal LLMs (Megatron/NeMo-Aligner; Nemotron-2/3/4, Mixtral-8x7B-Instruct)
- Human raters (Scale AI)
- Responses per model source: Nemotron-2-43B (18.9%), Nemotron-3 (40.4%), Nemotron-4 (26.9%), Mixtral-8x7B-Instruct (7.9%), Scale AI (5.9%).
Each response is evaluated on five dimensions: helpfulness, correctness, coherence, complexity, and verbosity (Likert 0–4) (Wang et al., 2024, Lee et al., 2 Sep 2025). At least three annotators score each response; outlier handling and average aggregation protocols are enforced. Items with rating variance in helpfulness are flagged for further review and annotation. Quality control excludes ≈10% of pairs with high disagreement and ≈50% of initial raw annotations. Annotator agreement is measured via quadratic-weighted Cohen’s (post-processing for ratings, $0.878$ for preferences).
Preference annotation presents annotators with both responses and a forced-choice scale, demanding selection of preference strength and a justification. Outlier and spread-based filtering is applied, with exclusion of pairs with spread or mean preference .
3. Modeling Paradigms Enabled by HelpSteer2
HelpSteer2 was designed to support two principal reward modeling paradigms:
Regression-Style
Each prompt-response tuple is scored on each dimension . Supervised regression RMs are trained to minimize mean-squared error:
Bradley–Terry (Pairwise Preference)
Preference annotations enable training of BT models: where is the chosen response, the rejected, and is the sigmoid. Variants include Margin-BT and Scaled-BT, which leverage the annotated preference strength :
Combined modeling employs two-stage synergy: regression pretraining followed by BT fine-tuning, and “ExPO” (Weak-to-Strong Extrapolation), with a grid-searched mixing factor :
These approaches enable direct head-to-head, apples-to-apples comparison across modeling paradigms (Wang et al., 2024).
4. Benchmark Results and Empirical Impact
Reward models trained on HelpSteer2 achieve state-of-the-art results on public LLM evaluation benchmarks. Notably:
- Nemotron-4 340B (HelpSteer2): 92.0% overall accuracy (RewardBench, primary set) (Wang et al., 2024)
- Llama-3 70B: 88.8%
- Llama-3.1-70B-Instruct (Scaled-BT+ExPO): 94.1% (#1 on RewardBench as of 01 Oct 2024) (Wang et al., 2024)
- RLHF alignment of Llama-3.1-70B via REINFORCE yielded 85.0% Arena Hard win-rate versus GPT-4o (2024-05-13), surpassing GPT-4o and Claude-3.5-Sonnet on this task.
External open models (Open Assistant, HH-RLHF) and proprietary models (Cohere, Gemini, GPT-4 variants) score lower on RewardBench, underscoring HelpSteer2’s annotation and modeling effectiveness.
5. Integration, Usage, and Reproducibility
HelpSteer2 and HelpSteer2-Preference are openly available via Hugging Face under CC-BY-4.0, supporting both academic and commercial applications. The canonical repository is https://huggingface.co/datasets/nvidia/HelpSteer2. Data can be loaded programmatically using Hugging Face Datasets:
1 2 3 4 5 6 |
from datasets import load_dataset ds = load_dataset("nvidia/HelpSteer2") train_ratings = ds["train_ratings"] val_ratings = ds["validation_ratings"] train_prefs = ds["train_preferences"] val_prefs = ds["validation_preferences"] |
Reward models and instruct-tuned models are publicly released (e.g., https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward). Full training and alignment recipes utilizing MSE, BT, Scaled-BT, ExPO, and RLHF (REINFORCE/PPO) are documented in (Wang et al., 2024, Wang et al., 2024), with source code at https://github.com/NVIDIA/NeMo-Aligner.
6. Application Domains and Significance
HelpSteer2 is primarily utilized for:
- Training reward models for RLHF, DPO, PPO
- Multi-attribute steering via SteerLM 2.0 (policy optimization over 5D attribute vectors)
- Pairwise and multi-metric prompt optimization (e.g., CRPO: retrieval-augmented contrastive reasoning (Lee et al., 2 Sep 2025))
- Comparative evaluation and benchmarking for instruction-following, factuality, coherence, and other qualitative dimensions
Its data efficiency (≈10k prompt–response pairs), high inter-annotator agreement, multi-attribute design, and compatibility with both regression and BT modeling paradigms position HelpSteer2 as a standard resource for preference modeling, reward function learning, and steerable LLM alignment.
7. Related Resources and Datasets
HelpSteer2 augments and supersedes prior open preference datasets such as Open Assistant and HH-RLHF, offering an order of magnitude greater annotation efficiency and higher agreement (Wang et al., 2024, Wang et al., 2024). It enables direct paradigm comparison, multi-attribute steering, and integration into iterative alignment workflows. Model outputs and steering recipes trained on HelpSteer2 demonstrate robust generalization and improved alignment across domains, informing the development and evaluation of instruction-tuned LLMs.