UltraFeedback Dataset: Multi-Dimensional RLHF Data

Updated 11 January 2026

UltraFeedback is a large-scale, multi-dimensional preference dataset that annotates responses along helpfulness, instruction-following, honesty, and truthfulness axes.
It comprises approximately 64,000 single-turn prompts, over 250,000 completions from 17 LLMs, and around 340,000 pairwise comparisons to support refined reward modeling.
The dataset underpins reward and critique model training with applications in best-of-n sampling and PPO while revealing significant trade-offs among alignment axes.

UltraFeedback is a large-scale, multi-dimensional preference dataset constructed for advancing Reinforcement Learning from Human Feedback (RLHF) and related alignment strategies for LLMs. Developed by Cui et al. and released in 2023, UltraFeedback fundamentally extends the prevailing two-dimensional evaluation schema—typically "helpful" versus "harmless"—by providing a richer set of human alignment axes. It supports training of reward models and critique generators, facilitates best-of-n sampling, and serves as a foundation for both vanilla and more advanced preference optimization protocols. The dataset’s structure, annotation, and methodological rigor have enabled new empirical investigations into alignment conflicts, reward modeling, and preference data efficiency (Cui et al., 2023, Jiang et al., 2024, Deng et al., 20 Feb 2025).

1. Dataset Structure and Composition

UltraFeedback encompasses approximately 64,000 user instructions (single-turn prompts), each accompanied by four distinct model-generated responses, yielding over 250,000 completions. These were selected from six primary sources, including TruthfulQA, FalseQA, Evol-Instruct, UltraChat, ShareGPT, and FLAN, post-decontamination to remove overlap with major benchmarks. For each prompt, 4 model completions were sampled from a pool of 17 LLMs (ranging from open-source models like LLaMA2 and Vicuna to proprietary models such as GPT-4).

Annotations are twofold: each (prompt, response) pair receives scalar, aspect-wise scores along four orthogonal axes (helpfulness, instruction-following, honesty, truthfulness), and each completion is paired with a critique—a GPT-4-generated textual rationale and a numerical critique score (1–10 scale). Preference information is distilled into approximately 340,000 pairwise labeled comparisons, each referencing a partner completion, a Boolean "chosen" label, and a margin score reflecting the absolute score difference.

Property	Value	Description
Instructions	≈64,000	Unique single-turn prompts post-decontamination
Completions per prompt	4	Model-generated replies per instruction
Preference comparisons	≈340,000	Human or AI-judged pairwise choices between responses
Scalar axes	4	Helpfulness, Instruction-Following, Honesty, Truthfulness
Critiques	≈255,000	Textual rationales per completion
Model pool	17	Range: LLaMA2, GPT-4, gpt-3.5-turbo, Bard, Vicuna, WizardLM, UltraLM, and others

2. Data Collection and Annotation Methodology

UltraFeedback’s data pipeline emphasizes extensive scale and domain coverage while mitigating annotation bias. Instructions were stratified across source distributions; response diversity was achieved via random sampling from a broad model pool and varying system prompts emphasizing different alignment principles. Annotations were collected via GPT-4, which, for each instruction, assigned integer scores [1–5] on the four axes to all four responses simultaneously, providing scoring calibration within a batch.

Pairwise preference labels are derived from these scalar scores. For each prompt, all six possible response pairs are compared, with binary preference and a margin based on the absolute score gap. Textual critiques are produced by directing GPT-4 to generate targeted feedback, supporting downstream critique-model training.

Bias in the annotations is attenuated by grouping responses for calibrated comparison and by randomizing model/prompt exposure. Further, during reward model training, the margin between the chosen and rejected responses is incorporated directly as a weight in the ranking loss.

3. Alignment Dimensions and Theoretical Implications

UltraFeedback distinguishes itself by annotating four largely orthogonal alignment dimensions:

Helpfulness: Relevance and user-intent alignment.
Instruction-Following: Obedience to explicit instructions.
Honesty: Absence of fabrication or deception.
Truthfulness: Factual accuracy.

This richer annotation schema exposes the latent trade-offs among alignment objectives—an issue not addressed by prior two-axis datasets. Empirical results show that these axes, while conceptually distinct, exhibit non-trivial alignment conflicts: improving reward model performance on one often degrades performance on another.

Jiang et al. formalized these trade-offs via the Alignment Dimension Conflict (ADC) metric, which quantifies squared performance drop across "untouched" objectives after single-axis fine-tuning. For UltraFeedback, ADC is measured at approximately 67.2%, indicating substantial competition among axes and revealing the brittleness of Ultra-trained reward models under further objective-specific fine-tuning (Jiang et al., 2024).

4. Downstream Usage: Reward Models, PPO, and Best-of-n Sampling

UltraFeedback underpins the training of reward models (UltraRM), critique models (UltraCM), and open-domain chat models using best-of-n selection and reinforcement learning via Proximal Policy Optimization (PPO). The main reward modeling protocol employs a margin-weighted ranking loss:

$\mathcal{L}_\mathrm{ranking} = -\log\Bigl(\sigma( r_\theta(x, y_c) - r_\theta(x, y_r) - m(r) )\Bigr)$

where $m(r)$ is the normalized margin. Training data comprises UltraFeedback pairs and additional preference datasets (e.g., Stanford SHP, OpenAI Summ, Anthropic Helpful).

Best-of-n sampling guided by UltraRM systematically improves win-rate on AlpacaEval versus text-davinci-003, rising from 76.53% (n=1) to 91.54% (n=16). PPO-aligned models (UltraLM-13B-PPO), using UltraRM as reward, demonstrate ≥86% win rate on AlpacaEval and up to 65% versus gpt-3.5-turbo on UltraChat according to GPT-4 judgment (Cui et al., 2023).

5. Data Selection, Margin Principles, and Data Efficiency

Subsequent work identifies data selection as a limiting factor for preference optimization protocols such as Direct Preference Optimization (DPO). Deng et al. show that UltraFeedback contains substantial redundancy and label noise, particularly in low-margin pairs, resulting in parameter shrinkage and diminished data efficiency (Deng et al., 20 Feb 2025).

They introduce a margin-maximization principle: samples are scored by both external reward margins ( $m_\mathrm{ex}$ ) and implicit DPO logit margins ( $m_\mathrm{im}$ ), and are fused into a single pseudo-probability via Bayesian aggregation:

$P(y_w \succ y_l \mid m_\mathrm{ex}, m_\mathrm{im}) = \frac{P(m_\mathrm{ex}) \, P(m_\mathrm{im})}{P(m_\mathrm{ex}) \, P(m_\mathrm{im}) + (1-P(m_\mathrm{ex}))(1-P(m_\mathrm{im}))}$

Selecting only the top 10% fused-margin pairs from UltraFeedback yields 3–8% empirical improvements in alignment benchmarks, and iterative DPO with 25% online sampling consistently outperforms baselines with the full dataset—demonstrating high redundancy and the value of margin-aware data filters.

Sampling Strategy	Data Fraction	AlpacaEval2 Length-Controlled	AlpacaEval2 Raw Win Rate
Full	100%	17.32%	15.30%
Random 10%	10%	12.33%	10.96%
Top 10% DM-MUL	10%	19.53%	19.09%

6. Impact, Strengths, and Limitations

UltraFeedback’s principal strengths reside in its scale, task diversity (spanning open-domain chat, code, math, knowledge retrieval, and safety), multidimensional preference annotation, and automated pipeline for both aspect-wise scoring and critique generation. Its release facilitated broad improvements in open-source RLHF research and established a foundation for further work on reward models and preference data selection.

Key limitations include pronounced inter-objective competition (high ADC), which both restricts clean disentanglement of alignment axes and makes models fine-tuned on one axis vulnerable to degradation or adversarial prompts exploiting misalignment. The original axes, while richer than prior datasets, lack the granularity to prevent trade-offs between concepts such as helpfulness versus instruction-following.

Recommendations for future use: Compute ADC prior to multi-axis RLHF deployment, refine annotation axes via iterative AI-driven loops, apply reward-gap thresholding to filter low-confidence pairs, and investigate contrastive/unsupervised approaches for axis discovery and minimization of mutual interference (Jiang et al., 2024).

7. Extensions and Derivatives

UltraFeedback has served as a launchpad for further work, notably the Hummer dataset, which applies a three-stage GPT-4 pipeline to disambiguate and reduce alignment conflicts. This involves re-annotating UltraFeedback pairs across six new axes selected for near-orthogonality (accuracy, conciseness, depth, empathy, tone, specificity) and yields a lower ADC (≈9%) and higher-confidence reward model training (Jiang et al., 2024). These developments suggest UltraFeedback is most effective as a curated resource for constructing low-conflict, task-adaptive reward datasets rather than as an immutable benchmark.

References:

"Hummer: Towards Limited Competitive Preference Dataset" (Jiang et al., 2024)
"UltraFeedback: Boosting LLMs with Scaled AI Feedback" (Cui et al., 2023)
"Less is More: Improving LLM Alignment via Preference Data Selection" (Deng et al., 20 Feb 2025)