RewardBench: A Benchmark for Reward Model Evaluation

Updated 25 June 2025

RewardBench is a standardized benchmark created to enable rigorous, granular evaluation of reward models (RMs) used in LLM alignment, especially within Reinforcement Learning from Human Feedback (RLHF) workflows. Serving as a testbed for both academic and industry researchers, RewardBench consists of diverse, manually verified comparisons covering instruction following, adversarial robustness, safety, and reasoning. Its design emphasizes challenge, domain breadth, and a focus on surfacing unresolved issues in current reward modeling practices.

1. Objectives and Rationale

RewardBench is motivated by the need for transparent, challenge-rich evaluation of RMs, which play a central but historically under-examined role in aligning pretrained LLMs to human preferences. Prior to RewardBench, most RM assessment relied on narrow or non-adversarial test sets, confounding meaningful comparison and failing to reveal critical limitations in safety, instruction-following, or robustness to adversarial inputs. RewardBench addresses this by:

Assembling a large, manually verified collection of preference triplets (prompt, chosen answer, rejected answer),
Ensuring diversity and difficulty by curating prompts from open-ended chat, adversarial tasks, safety-critical scenarios, and programmatic/logic-intensive domains,
Requiring reward models to perform direct pairwise comparison, moving beyond aggregate win rates and focusing on detailed accuracy.

Challenges Addressed

Reward models evaluated by legacy benchmarks may overfit to distributional idiosyncrasies, achieve high scores on easy tasks, or fail on more subtle or adversarial test cases. RewardBench responds to these challenges through:

Diverse, granular evaluation subsets across instruction-following (including adversarial edge cases), safety (false refusals, harm avoidance), and structured reasoning,
Manual label verification and multi-dataset aggregation,
Transparency and reproducibility via open leaderboard reporting for both open and closed (commercial) models.

2. Dataset Structure and Coverage

RewardBench’s core dataset comprises 2,538 manually curated triplets, distributed across major domains:

Capabilities: Standard instruction-following from existing benchmarks such as AlpacaEval, MT Bench, and the LLMBar “natural” split.
Adversarial/instruction-following: From LLMBar adversarial sets, these prompts subtly stress RM ability to follow nuanced instructions and detect trick questions or minor factual errors.
Safety and Refusal: Derived from XSTest and Do-Not-Answer datasets, with prompts designed to expose both under- and over-refusing behavior, plus overtly harmful or offensive completions.
Reasoning and Code: Sourced from HumanEvalPack and PRM math, including correct vs. buggy code and math solutions—often the most challenging subset for state-of-the-art RMs.
Optional Large-Scale Benchmarks: Over 50,000 additional samples from prior popular datasets (e.g., SHP, Anthropic HHH) allow broader analysis but serve as secondary for leaderboard purposes.

Core characteristics:

Single-turn, instruction-following format—minimizing extraneous conversational complexity.
Manual verification—data is either directly curated or filtered from existing sets with high-fidelity review.
Edge-case/adversarial richness—targeted at discriminating between RMs that otherwise perform similarly on “easy” cases.

3. Evaluation Methodology

The principal evaluation metric is accuracy—defined as the proportion of instances in which the reward model assigns strictly higher reward to the reference "chosen" response compared to the "rejected" response:

$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}[RM(prompt_i, chosen_i) > RM(prompt_i, rejected_i)]$

where $\mathbb{I}[\cdot]$ denotes the indicator function.

Computational requirements are minimal—each evaluation involves only forward passes through the model, making leaderboard-style comparison practical.

RewardBench supports multiple RM architectures (sequence classifiers, pairwise preference classifiers, DPO models), comparing their outputs on the same challenge set. Both per-task and per-subset accuracy are reported, enabling the analysis of strengths and limitations in different domains—such as "Chat Hard" (adversarial), "Safety," and "Reasoning."

Optional prior sets (e.g., SHP, Anthropic HHH) are available for broader cross-task comparison but are not central to the main leaderboard, due to their lower discriminative power.

4. Key Results and Diagnostic Insights

RewardBench surfaces sharp distinctions in RM capabilities:

Wide Performance Range: On the core benchmark, leading open-source models (Starling-RM-34B, Tulu-2-dpo-70b) score between 77–82% accuracy; on subsets like "Chat Hard," performance often drops below 60%, and for many models, to near-chance.
Domain-Specific Weaknesses:
- Reasoning/code: Most models struggle to correctly assign higher rewards to mathematically or programmatically precise solutions, often failing to distinguish subtly buggy from correct outputs.
- Safety/False Refusal: Models are often either overly cautious (refusing benign prompts) or unsafe (failing to refuse harmful ones).
Adversarial Robustness: Models with high chat or “easy” prompt accuracy can perform poorly on adversarial subsets, underscoring the value of RewardBench’s edge-case coverage.

Example summary table (abbreviated):

Reward Model	Avg. Core	Chat	Chat Hard	Safety	Reasoning
Starling-34B	81.5	96.9	59.0	89.9	90.3
Tulu-2-dpo-70b	77.0	97.5	60.8	85.1	88.9

Score Distribution Diagnostics:

DPO-trained models tend toward large-magnitude negative scores.
Classifiers often produce Gaussian-distributed outputs.
Spread and separation between "chosen" and "rejected" responses serve as diagnostics for RM pathologies, such as overconfidence or poor calibrations.

5. Limitations and Benchmark Evolution

RewardBench has become a reference standard but is not without limitations:

One-to-One Comparisons: Especially in mathematical reasoning, only a single rejected response is compared to each chosen solution, which may not reflect the full spectrum of potential incorrect outputs. Later work (RewardMATH) expands on this to probe one-to-many robustness.
Language and Cultural Scope: RewardBench is primarily English-centric; extensions such as M-RewardBench expand evaluation to multilingual and cross-cultural contexts.
Benchmark Saturation: As RM accuracy climbs (esp. for easy subsets), new, more challenging, and contamination-resistant test sets (e.g., RewardBench 2) are required to avoid overfitting and accurately reflect generalization.

6. Influence on Reward Model Research and Practice

RewardBench directly influenced subsequent work:

Model Architectures: Used to assess the value of different training paradigms (e.g., DPO, MLE classifiers, margin-based losses).
Bias/Pathology Detection: Inspired research on debiasing RLHF and reward modeling (e.g., post-hoc reward calibration for length bias, robustness to input perturbations).
Safety and Robustness: RewardBench’s adversarial and safety sets serve as key targets for improving safety alignment procedures and for evaluating information-theoretic or data-adaptive rule selection methods.
Progressive Benchmarking: Prompted the need for higher-fidelity, contamination-free, and specialty benchmarks such as RewardBench 2 and RAG-RewardBench.

RewardBench’s openly released data and leaderboard promote rapid, reproducible advances in both model development and evaluation methodology across academia and industry.

Further resources and dataset access are available at https://huggingface.co/datasets/allenai/pref-test-sets, with source datasets from AlpacaEval, MT Bench, LLMBar, XSTest, Do-Not-Answer, HumanEvalPack, PRM800k, and others.

PDF Markdown Bookmark Chat (Pro)