Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RewardBench 2: Advanced RM Benchmark

Updated 30 June 2025
  • RewardBench 2 is a robust, multi-domain benchmark designed to assess language model reward accuracy using a contamination-aware, best-of-4 evaluation format.
  • It combines extensive human-sourced prompts with multi-model generated completions and rigorous filtering across domains like factuality, math, safety, and instruction following.
  • The benchmark shows strong correlation with downstream LLM performance, providing actionable insights for model selection and RLHF training improvements.

RewardBench 2 is a comprehensive, multi-domain benchmark designed to rigorously evaluate reward models (RMs) for LLMs, with the explicit goal of achieving stronger alignment between offline reward model evaluation and downstream real-world model performance. It represents a significant advancement over prior evaluation frameworks, introducing new data, harder evaluation formats, and a direct focus on generalization, robustness, and practical relevance in domains ranging from factuality and safety to mathematical reasoning and precise instruction following.

1. Benchmark Design and Dataset Construction

RewardBench 2 was constructed through a multi-stage, contamination-aware pipeline focusing on quality, difficulty, and representativeness.

  • Prompt sourcing:

70% of prompts are new, unreleased human-written queries from the WildChat pipeline, ensuring minimal overlap with existing evaluation sets and avoiding contamination between evaluation and training data. Prompts are domain-annotated using classifiers and manual inspection, with further filtering to maximize domain coverage. All overlapping prompts with more than 20 widely used downstream benchmarks are eliminated via the Tulu 3 decontamination toolkit.

  • Dataset size and coverage:

The final dataset includes 1,865 prompts, carefully curated from an initial pool of ~3,000 human-written queries.

  • Completion generation:

For each prompt, completions are generated by a pool of 20+ leading open and closed-weight LLMs, as well as human-written completions for select domains. Completions are classified as “correct” (chosen) or “incorrect” (rejected) by LLM-judging, majority voting, or rubric-based manual verification.

  • Domain structure:

Six domains are covered—Factuality, Precise Instruction Following, Math, Safety, Focus, and Ties—with domain-specific acquisition and filtering strategies:

| Domain | Prompt Source | Completion Filtering | |---------------------------|--------------|--------------------------------------| | Factuality | Human | Multi-LM-as-a-judge | | Precise Instruction Foll. | Human | Verifier functions | | Math | Human | Majority voting | | Safety | CoCoNot | LM-as-a-judge rubrics | | Focus | Human | System prompt variation (no filtering)| | Ties | Manual | Manual verification |

  • Best-of-4 evaluation format:

Each prompt is paired with four completions—one correct and three incorrect (25% random baseline). In the Ties domain, several completions may be equally correct, rewarding RMs that avoid penalizing distinct but equally valid answers.

2. Evaluation Methodology and Metrics

RewardBench 2 employs a challenging, best-of-4 evaluation format, increasing the discriminative power of the benchmark compared to earlier pairwise test sets.

  • Primary metric:

Accuracy is defined as the percentage of times the RM selects the “chosen” (correct) completion out of four options in each domain (excluding Ties, which uses a weighted formula to account for multiple correct completions).

  • Random baseline:

25% (as opposed to 50% for pairwise evaluation), ensuring reduced ceiling effects and highlighting practical differences between models.

  • Domain-specific metrics:

Each domain’s accuracy is reported, with Ties scoring combining ability to select correct over incorrect and to not spuriously discriminate among valid options.

  • General correlation:

RewardBench 2 accuracy is evaluated for correlation with downstream use-cases, particularly best-of-N (BoN) sampling and RLHF policy gradient training.

3. Downstream Correlation and Practical Relevance

A distinct characteristic of RewardBench 2 is its tight linkage with downstream LLM performance using both inference-time and RLHF training algorithms.

Across 113 reward models, RewardBench 2 scores correlate strongly with BoN performance on major benchmarks (e.g., GSM8K, MATH, HumanEval+, BBH), with an average Pearson correlation of 0.87.

  • RLHF (PPO) performance:

Seventeen reward models and their associated PPO-trained policies were benchmarked. Offline RewardBench 2 score establishes a lower bound for on-policy RLHF success; however, out-of-family or distribution-mismatched reward models can exhibit performance drops even with high offline scores.

  • Domain-specific predictiveness:

The Factuality subset exhibits the strongest correlation with general downstream performance; Math is most predictive for math/coding tasks.

4. Comparative Analysis with Prior Evaluation Frameworks

RewardBench 2 departs from prior reward model benchmarks on several key axes:

  • Unseen, human prompts:

Unlike RM-Bench, RewardBench (v1), and RMB, which heavily reuse prompts from evaluation sets or rely on pairwise formats, RewardBench 2 prioritizes freshness and non-overlap with downstream benchmarks.

  • Best-of-N over pairwise:

The 1-of-4 (best-of-4) setting is harder and less susceptible to random accuracy inflation, yielding more reliable differentiation between reward models.

  • Domain breadth and tailored filtering:

Comprehensive domains, LLM-based and manual verification, and robust class balance further distinguish RewardBench 2.

  • Summary comparison:

| Benchmark | Best-of-N | Human Prompts | Unseen Prompts | Multi Skill | Main Metric | |---------------|-----------|--------------|----------------|-------------|-------------| | RewardBench | N | N | N | Y | Accuracy | | RM-Bench | N | N | N | Y | Accuracy | | RMB | Y | Y | N | Y | Accuracy | | RewardBench 2 | Y | Y | Y | Y | Accuracy |

5. Performance of Existing Models

RewardBench 2 is intentionally challenging: even leading models from the previous RewardBench leaderboard experience up to a 20-point drop in accuracy.

  • Top model scores (selected):

| Model | Avg. | Factuality | IF | Math | Safety | Focus | Ties | |------------------------------------------------|------|------------|------|------|--------|-------|------| | google/gemini-2.5-flash-preview-04-17* | 77.2 | 65.7 | 55.3 | 81.1 | 90.9 | 86.7 | 83.4 | | nicolinho/QRM-Gemma-2-27B | 76.7 | 78.5 | 37.2 | 69.9 | 95.8 | 95.4 | 83.2 | | infly/INF-ORM-Llama3.1-70B | 76.5 | 74.1 | 41.9 | 69.9 | 96.4 | 90.3 | 86.2 | | anthropic/claude-opus-4-20250514* | 76.5 | 82.7 | 41.9 | 74.9 | 89.5 | 86.2 | 83.7 | | openai/gpt-4o-2024-08-06* | 64.9 | 56.8 | 33.1 | 62.3 | 86.2 | 72.9 | 78.2 |

  • Domain trends:

Performance is lowest on Precise IF (~35–45%) and Math (~60–81%), higher on Safety and Focus, but no model achieves near-perfect scores.

  • Incremental training observations:

Training for more than one epoch continues to yield accuracy improvements—contravening earlier best practices of minimal epoch counts.

6. Impact, Limitations, and Recommendations

RewardBench 2 establishes a new standard in reward model evaluation, with practical implications for both researchers and practitioners.

  • Offline accuracy as a predictor:

Used with inference-scaling (best-of-N), RewardBench 2 can reliably guide model selection, accelerating model development and “hillclimbing” for LLM alignment.

  • Cautions for RLHF policy training:

For policy-gradient methods (e.g., PPO), RewardBench 2 score is necessary but not sufficient; alignment between policy model, reward model, and the training distribution is essential for optimal RLHF outcomes.

  • Field recommendations:

Practitioners should incorporate both offline leaderboard performance and details of model training/data recipe when applying reward models to RLHF (e.g., retrain with similar data, check for domain/format alignment).

  • Future evolution:

RewardBench 2’s rigorous construction and demonstrated correlation with practical performance set a new bar; future iterations should maintain decontamination, multi-domain design, and challenging evaluation formats as model and task diversity grow.

7. Accessibility and Resources

RewardBench 2 is fully open source, with all data, code, and documentation available for the community.

  • Code and data repository:

https://github.com/allenai/reward-bench

  • Dataset:

https://huggingface.co/datasets/allenai/reward-bench-2

8. Illustrative Figures and Tables

  • Main benchmark comparison figure:

  • Domain-wide correlation grid:

  • PPO performance comparison:

  • Domain and model accuracy table:

(see Table 3 in the paper for exact scores)


In summary, RewardBench 2 is a rigorously designed, multi-skill, contamination-free benchmark enabling accurate, practical, and challenging evaluation of reward models across instruction following, factuality, reasoning, safety, and nuanced domains. It delivers strong predictive power for downstream best-of-N selection, provides critical diagnostic insight for RLHF pipelines, and sets a new standard for robust reward model evaluation in LLMing.