Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

120 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

RewardBench 2: Advanced RM Benchmark

Updated 30 June 2025

RewardBench 2 is a robust, multi-domain benchmark designed to assess language model reward accuracy using a contamination-aware, best-of-4 evaluation format.
It combines extensive human-sourced prompts with multi-model generated completions and rigorous filtering across domains like factuality, math, safety, and instruction following.
The benchmark shows strong correlation with downstream LLM performance, providing actionable insights for model selection and RLHF training improvements.

RewardBench 2 is a comprehensive, multi-domain benchmark designed to rigorously evaluate reward models (RMs) for LLMs, with the explicit goal of achieving stronger alignment between offline reward model evaluation and downstream real-world model performance. It represents a significant advancement over prior evaluation frameworks, introducing new data, harder evaluation formats, and a direct focus on generalization, robustness, and practical relevance in domains ranging from factuality and safety to mathematical reasoning and precise instruction following.

1. Benchmark Design and Dataset Construction

RewardBench 2 was constructed through a multi-stage, contamination-aware pipeline focusing on quality, difficulty, and representativeness.

Prompt sourcing:

70% of prompts are new, unreleased human-written queries from the WildChat pipeline, ensuring minimal overlap with existing evaluation sets and avoiding contamination between evaluation and training data. Prompts are domain-annotated using classifiers and manual inspection, with further filtering to maximize domain coverage. All overlapping prompts with more than 20 widely used downstream benchmarks are eliminated via the Tulu 3 decontamination toolkit.

Dataset size and coverage:

The final dataset includes 1,865 prompts, carefully curated from an initial pool of ~3,000 human-written queries.

Completion generation:

For each prompt, completions are generated by a pool of 20+ leading open and closed-weight LLMs, as well as human-written completions for select domains. Completions are classified as “correct” (chosen) or “incorrect” (rejected) by LLM-judging, majority voting, or rubric-based manual verification.

Domain structure:

Six domains are covered—Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties—with domain-specific acquisition and filtering strategies:

| Domain | Prompt Source | Completion Filtering | |---------------------------|--------------|--------------------------------------| | Factuality | Human | Multi-LM-as-a-judge | | Precise Instruction Foll. | Human | Verifier functions | | Math | Human | Majority voting | | Safety | CoCoNot | LM-as-a-judge rubrics | | Focus | Human | System prompt variation (no filtering)| | Ties | Manual | Manual verification |

Best-of-4 evaluation format:

Each prompt is paired with four completions—one correct and three incorrect (25% random baseline). In the Ties domain, several completions may be equally correct, rewarding RMs that avoid penalizing distinct but equally valid answers.

2. Evaluation Methodology and Metrics

RewardBench 2 employs a challenging, best-of-4 evaluation format, increasing the discriminative power of the benchmark compared to earlier pairwise test sets.

Primary metric:

Accuracy is defined as the percentage of times the RM selects the “chosen” (correct) completion out of four options in each domain (excluding Ties, which uses a weighted formula to account for multiple correct completions).

Random baseline:

25% (as opposed to 50% for pairwise evaluation), ensuring reduced ceiling effects and highlighting practical differences between models.

Domain-specific metrics:

Each domain’s accuracy is reported, with Ties scoring combining ability to select correct over incorrect and to not spuriously discriminate among valid options.

General correlation:

RewardBench 2 accuracy is evaluated for correlation with downstream use-cases, particularly best-of-N (BoN) sampling and RLHF policy gradient training.

3. Downstream Correlation and Practical Relevance

A distinct characteristic of RewardBench 2 is its tight linkage with downstream LLM performance using both inference-time and RLHF training algorithms.

Best-of-N sampling:

Across 113 reward models, RewardBench 2 scores correlate strongly with BoN performance on major benchmarks (e.g., GSM8K, MATH, HumanEval+, BBH), with an average Pearson correlation of 0.87.

RLHF (PPO) performance:

Seventeen reward models and their associated PPO-trained policies were benchmarked. Offline RewardBench 2 score establishes a lower bound for on-policy RLHF success; however, out-of-family or distribution-mismatched reward models can exhibit performance drops even with high offline scores.

Domain-specific predictiveness:

The Factuality subset exhibits the strongest correlation with general downstream performance; Math is most predictive for math/coding tasks.

4. Comparative Analysis with Prior Evaluation Frameworks

RewardBench 2 departs from prior reward model benchmarks on several key axes:

Unseen, human prompts:

Unlike RM-Bench, RewardBench (v1), and RMB, which heavily reuse prompts from evaluation sets or rely on pairwise formats, RewardBench 2 prioritizes freshness and non-overlap with downstream benchmarks.

Best-of-N over pairwise:

The 1-of-4 (best-of-4) setting is harder and less susceptible to random accuracy inflation, yielding more reliable differentiation between reward models.

Domain breadth and tailored filtering:

Comprehensive domains, LLM-based and manual verification, and robust class balance further distinguish RewardBench 2.

Summary comparison:

| Benchmark | Best-of-N | Human Prompts | Unseen Prompts | Multi Skill | Main Metric | |---------------|-----------|--------------|----------------|-------------|-------------| | RewardBench | N | N | N | Y | Accuracy | | RM-Bench | N | N | N | Y | Accuracy | | RMB | Y | Y | N | Y | Accuracy | | RewardBench 2 | Y | Y | Y | Y | Accuracy |

5. Performance of Existing Models

RewardBench 2 is intentionally challenging: even leading models from the previous RewardBench leaderboard experience up to a 20-point drop in accuracy.

Top model scores (selected):

| Model | Avg. | Factuality | IF | Math | Safety | Focus | Ties | |------------------------------------------------|------|------------|------|------|--------|-------|------| | google/gemini-2.5-flash-preview-04-17* | 77.2 | 65.7 | 55.3 | 81.1 | 90.9 | 86.7 | 83.4 | | nicolinho/QRM-Gemma-2-27B | 76.7 | 78.5 | 37.2 | 69.9 | 95.8 | 95.4 | 83.2 | | infly/INF-ORM-Llama3.1-70B | 76.5 | 74.1 | 41.9 | 69.9 | 96.4 | 90.3 | 86.2 | | anthropic/claude-opus-4-20250514* | 76.5 | 82.7 | 41.9 | 74.9 | 89.5 | 86.2 | 83.7 | | openai/gpt-4o-2024-08-06* | 64.9 | 56.8 | 33.1 | 62.3 | 86.2 | 72.9 | 78.2 |

Domain trends:

Performance is lowest on Precise IF (~35–45%) and Math (~60–81%), higher on Safety and Focus, but no model achieves near-perfect scores.

Incremental training observations:

Training for more than one epoch continues to yield accuracy improvements—contravening earlier best practices of minimal epoch counts.

6. Impact, Limitations, and Recommendations

RewardBench 2 establishes a new standard in reward model evaluation, with practical implications for both researchers and practitioners.

Offline accuracy as a predictor:

Used with inference-scaling (best-of-N), RewardBench 2 can reliably guide model selection, accelerating model development and “hillclimbing” for LLM alignment.

Cautions for RLHF policy training:

For policy-gradient methods (e.g., PPO), RewardBench 2 score is necessary but not sufficient; alignment between policy model, reward model, and the training distribution is essential for optimal RLHF outcomes.

Field recommendations:

Practitioners should incorporate both offline leaderboard performance and details of model training/data recipe when applying reward models to RLHF (e.g., retrain with similar data, check for domain/format alignment).

Future evolution:

RewardBench 2’s rigorous construction and demonstrated correlation with practical performance set a new bar; future iterations should maintain decontamination, multi-domain design, and challenging evaluation formats as model and task diversity grow.

7. Accessibility and Resources

RewardBench 2 is fully open source, with all data, code, and documentation available for the community.

Code and data repository:

https://github.com/allenai/reward-bench

Dataset:

https://huggingface.co/datasets/allenai/reward-bench-2

8. Illustrative Figures and Tables

Main benchmark comparison figure:

Domain-wide correlation grid:

PPO performance comparison:

Domain and model accuracy table:

(see Table 3 in the paper for exact scores)

In summary, RewardBench 2 is a rigorously designed, multi-skill, contamination-free benchmark enabling accurate, practical, and challenging evaluation of reward models across instruction following, factuality, reasoning, safety, and nuanced domains. It delivers strong predictive power for downstream best-of-N selection, provides critical diagnostic insight for RLHF pipelines, and sets a new standard for robust reward model evaluation in LLMing.

PDF Markdown Chat (Upgrade)