RewardBench 2: Advanced RM Benchmark
- RewardBench 2 is a robust, multi-domain benchmark designed to assess language model reward accuracy using a contamination-aware, best-of-4 evaluation format.
- It combines extensive human-sourced prompts with multi-model generated completions and rigorous filtering across domains like factuality, math, safety, and instruction following.
- The benchmark shows strong correlation with downstream LLM performance, providing actionable insights for model selection and RLHF training improvements.
RewardBench 2 is a comprehensive, multi-domain benchmark designed to rigorously evaluate reward models (RMs) for LLMs, with the explicit goal of achieving stronger alignment between offline reward model evaluation and downstream real-world model performance. It represents a significant advancement over prior evaluation frameworks, introducing new data, harder evaluation formats, and a direct focus on generalization, robustness, and practical relevance in domains ranging from factuality and safety to mathematical reasoning and precise instruction following.
1. Benchmark Design and Dataset Construction
RewardBench 2 was constructed through a multi-stage, contamination-aware pipeline focusing on quality, difficulty, and representativeness.
- Prompt sourcing:
70% of prompts are new, unreleased human-written queries from the WildChat pipeline, ensuring minimal overlap with existing evaluation sets and avoiding contamination between evaluation and training data. Prompts are domain-annotated using classifiers and manual inspection, with further filtering to maximize domain coverage. All overlapping prompts with more than 20 widely used downstream benchmarks are eliminated via the Tulu 3 decontamination toolkit.
- Dataset size and coverage:
The final dataset includes 1,865 prompts, carefully curated from an initial pool of ~3,000 human-written queries.
- Completion generation:
For each prompt, completions are generated by a pool of 20+ leading open and closed-weight LLMs, as well as human-written completions for select domains. Completions are classified as “correct” (chosen) or “incorrect” (rejected) by LLM-judging, majority voting, or rubric-based manual verification.
- Domain structure:
Six domains are covered—Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties—with domain-specific acquisition and filtering strategies:
| Domain | Prompt Source | Completion Filtering | |---------------------------|--------------|--------------------------------------| | Factuality | Human | Multi-LM-as-a-judge | | Precise Instruction Foll. | Human | Verifier functions | | Math | Human | Majority voting | | Safety | CoCoNot | LM-as-a-judge rubrics | | Focus | Human | System prompt variation (no filtering)| | Ties | Manual | Manual verification |
- Best-of-4 evaluation format:
Each prompt is paired with four completions—one correct and three incorrect (25% random baseline). In the Ties domain, several completions may be equally correct, rewarding RMs that avoid penalizing distinct but equally valid answers.
2. Evaluation Methodology and Metrics
RewardBench 2 employs a challenging, best-of-4 evaluation format, increasing the discriminative power of the benchmark compared to earlier pairwise test sets.
- Primary metric:
Accuracy is defined as the percentage of times the RM selects the “chosen” (correct) completion out of four options in each domain (excluding Ties, which uses a weighted formula to account for multiple correct completions).
- Random baseline:
25% (as opposed to 50% for pairwise evaluation), ensuring reduced ceiling effects and highlighting practical differences between models.
- Domain-specific metrics:
Each domain’s accuracy is reported, with Ties scoring combining ability to select correct over incorrect and to not spuriously discriminate among valid options.
- General correlation:
RewardBench 2 accuracy is evaluated for correlation with downstream use-cases, particularly best-of-N (BoN) sampling and RLHF policy gradient training.
3. Downstream Correlation and Practical Relevance
A distinct characteristic of RewardBench 2 is its tight linkage with downstream LLM performance using both inference-time and RLHF training algorithms.
Across 113 reward models, RewardBench 2 scores correlate strongly with BoN performance on major benchmarks (e.g., GSM8K, MATH, HumanEval+, BBH), with an average Pearson correlation of 0.87.
- RLHF (PPO) performance:
Seventeen reward models and their associated PPO-trained policies were benchmarked. Offline RewardBench 2 score establishes a lower bound for on-policy RLHF success; however, out-of-family or distribution-mismatched reward models can exhibit performance drops even with high offline scores.
- Domain-specific predictiveness:
The Factuality subset exhibits the strongest correlation with general downstream performance; Math is most predictive for math/coding tasks.
4. Comparative Analysis with Prior Evaluation Frameworks
RewardBench 2 departs from prior reward model benchmarks on several key axes:
- Unseen, human prompts:
Unlike RM-Bench, RewardBench (v1), and RMB, which heavily reuse prompts from evaluation sets or rely on pairwise formats, RewardBench 2 prioritizes freshness and non-overlap with downstream benchmarks.
- Best-of-N over pairwise:
The 1-of-4 (best-of-4) setting is harder and less susceptible to random accuracy inflation, yielding more reliable differentiation between reward models.
- Domain breadth and tailored filtering:
Comprehensive domains, LLM-based and manual verification, and robust class balance further distinguish RewardBench 2.
- Summary comparison:
| Benchmark | Best-of-N | Human Prompts | Unseen Prompts | Multi Skill | Main Metric | |---------------|-----------|--------------|----------------|-------------|-------------| | RewardBench | N | N | N | Y | Accuracy | | RM-Bench | N | N | N | Y | Accuracy | | RMB | Y | Y | N | Y | Accuracy | | RewardBench 2 | Y | Y | Y | Y | Accuracy |
5. Performance of Existing Models
RewardBench 2 is intentionally challenging: even leading models from the previous RewardBench leaderboard experience up to a 20-point drop in accuracy.
- Top model scores (selected):
| Model | Avg. | Factuality | IF | Math | Safety | Focus | Ties | |------------------------------------------------|------|------------|------|------|--------|-------|------| | google/gemini-2.5-flash-preview-04-17* | 77.2 | 65.7 | 55.3 | 81.1 | 90.9 | 86.7 | 83.4 | | nicolinho/QRM-Gemma-2-27B | 76.7 | 78.5 | 37.2 | 69.9 | 95.8 | 95.4 | 83.2 | | infly/INF-ORM-Llama3.1-70B | 76.5 | 74.1 | 41.9 | 69.9 | 96.4 | 90.3 | 86.2 | | anthropic/claude-opus-4-20250514* | 76.5 | 82.7 | 41.9 | 74.9 | 89.5 | 86.2 | 83.7 | | openai/gpt-4o-2024-08-06* | 64.9 | 56.8 | 33.1 | 62.3 | 86.2 | 72.9 | 78.2 |
- Domain trends:
Performance is lowest on Precise IF (~35–45%) and Math (~60–81%), higher on Safety and Focus, but no model achieves near-perfect scores.
- Incremental training observations:
Training for more than one epoch continues to yield accuracy improvements—contravening earlier best practices of minimal epoch counts.
6. Impact, Limitations, and Recommendations
RewardBench 2 establishes a new standard in reward model evaluation, with practical implications for both researchers and practitioners.
- Offline accuracy as a predictor:
Used with inference-scaling (best-of-N), RewardBench 2 can reliably guide model selection, accelerating model development and “hillclimbing” for LLM alignment.
- Cautions for RLHF policy training:
For policy-gradient methods (e.g., PPO), RewardBench 2 score is necessary but not sufficient; alignment between policy model, reward model, and the training distribution is essential for optimal RLHF outcomes.
- Field recommendations:
Practitioners should incorporate both offline leaderboard performance and details of model training/data recipe when applying reward models to RLHF (e.g., retrain with similar data, check for domain/format alignment).
- Future evolution:
RewardBench 2’s rigorous construction and demonstrated correlation with practical performance set a new bar; future iterations should maintain decontamination, multi-domain design, and challenging evaluation formats as model and task diversity grow.
7. Accessibility and Resources
RewardBench 2 is fully open source, with all data, code, and documentation available for the community.
- Code and data repository:
https://github.com/allenai/reward-bench
- Dataset:
https://huggingface.co/datasets/allenai/reward-bench-2
8. Illustrative Figures and Tables
- Main benchmark comparison figure:
- Domain-wide correlation grid:
- PPO performance comparison:
- Domain and model accuracy table:
(see Table 3 in the paper for exact scores)
In summary, RewardBench 2 is a rigorously designed, multi-skill, contamination-free benchmark enabling accurate, practical, and challenging evaluation of reward models across instruction following, factuality, reasoning, safety, and nuanced domains. It delivers strong predictive power for downstream best-of-N selection, provides critical diagnostic insight for RLHF pipelines, and sets a new standard for robust reward model evaluation in LLMing.