RM-Bench: Evaluating Reward Models for Alignment

Updated 30 October 2025

RM-Bench is a benchmark that rigorously tests reward models' sensitivity to subtle content differences and factual nuances.
It employs a triple-style matrix evaluation to distinguish substantive quality from superficial stylistic features.
Empirical findings show state-of-the-art reward models exhibit significant style bias and challenges in technical domains like math and code.

RM-Bench is a benchmark specifically constructed to rigorously evaluate reward models (RMs) for LLMs in the context of alignment, with a critical focus on distinguishing sensitivity to subtle content differences and robustness to stylistic biases. Unlike earlier benchmarks, which primarily assess a reward model's ability to discriminate between coarse differences in output quality from models of varying strength, RM-Bench is designed to uncover both the capability and the limitations of RMs in real-world alignment scenarios, including their correlation with downstream policy model performance. This benchmark is extensively used to probe whether reward models prioritize substantive content over superficial style, thus serving as a diagnostic and selection tool for alignment in RLHF and related paradigms.

1. Motivation and Underlying Deficiencies in Prior Benchmarks

RM-Bench addresses two fundamental shortcomings in standard reward model evaluation:

Insensitivity to Factual Subtleties: Previously, reward models were often assessed by their ability to pick the superior output from a strong LLM (LM) versus a weaker baseline. This regime does not capture the requirement to discriminate between responses that are nearly identical but differ in critical factual, logical, or mathematical aspects.
Style/Formatting Bias: Many RMs exhibit a confounding bias toward outputs that appear more sophisticated, verbose, or are formatted more attractively (e.g., Markdown, enumerated lists), even if the substantive content is inferior. This bias is exacerbated when stylistic features correlate superficially with model capability.

As a result, "success" in prior reward model benchmarks often fails to predict or correlate strongly with policy model improvements under RLHF or similar alignment approaches.

2. Benchmark Construction: Principles and Methodology

Sensitivity to Subtle Content Differences

RM-Bench constructs challenging test cases where both the "chosen" and "rejected" responses are generated by a powerful LLM (GPT-4o), not across models of different strength. Rejected responses are adversarially crafted to include minimal, subtle factual, logical, or safety-critical errors, verified through manual review for tasks where automated checking is infeasible (e.g., Chat, Safety).

Example: Two nearly identical factual answers to a science question, differing by a single key concept ("quantum superposition" vs. "quantum entanglement"), challenge the RM's ability to distinguish correctness at high granularity.

Systematic Style Variation

Every (prompt, chosen, rejected) triplet is realized across three styles for both responses:

Concise (only essential factual material)
Detailed (no Markdown)
Detailed with Markdown

This results in a $3 \times 3$ evaluation matrix per prompt, where cases "off-diagonal" stress-test the RM's tendency to privilege style over substance—e.g., when the incorrect response is longer, employs Markdown, or mimics the "style" of a more advanced model.

Domain Coverage

RM-Bench spans four representative domains:

Chat: Open-ended queries requiring factual/reasoned responses.
Code: Programming and code reasoning tasks sampled from HumanEvalPack.
Math: Math word problems sampled from MATH benchmark.
Safety: Two subclasses: requiring helpful engagement in "safety-should" prompts, and refusal in "safety-should-refuse" prompts.

Domain-specific adversarial data generation ensures that subtleties relevant to each field are appropriately tested.

3. Evaluation Metrics and Technical Criteria

Given the pivotal role of reward assignment in alignment, RM-Bench uses pairwise accuracy calculated as:

$\text{Accuracy} = \frac{1}{|\mathcal{D}|} \sum_{(x, y_c, y_r) \in \mathcal{D}} \mathbb{I}[R_\psi(x, y_c) > R_\psi(x, y_r)]$

where $R_\psi$ denotes the reward model, $y_c$ the chosen, $y_r$ the rejected response.

To measure robustness and identify failure modes, three accuracy metrics are reported:

Easy Accuracy: The lower triangle of the style matrix; chosen response is at least as "fancy" in style.
Normal Accuracy: Matrix diagonal; both responses are of matched style.
Hard Accuracy: Upper triangle; chosen response is stylistically simpler than the rejected—directly measuring resistance to style bias.

For multi-objective RMs (outputting vectors for e.g., correctness and verbosity), element-wise comparison is used.

The core pairwise preference loss used to train RMs is:

$\mathcal{L}_{\text{pref}} = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma (R_\psi(x, y_c) - R_\psi(x, y_r)) \right]$

where $\sigma$ denotes the sigmoid.

When RMs are used for DPO (Direct Preference Optimization), the signal relies on

$R_\psi(x, y) = \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)$

where $\pi_\theta$ is the policy, $\pi_\text{ref}$ the reference LM, and $\beta$ a regularization/temperature parameter.

Correlation of RM-Bench accuracy, especially Hard Accuracy, with real-world alignment performance is measured using Pearson $r$ .

4. Empirical Findings Across Nearly 40 Reward Models

Substantial Difficulty for SOTA Models: Even state-of-the-art RMs, such as Skywork-Reward-Llama-3.1-8B, achieve only 46.6% Hard Accuracy—worse than random guessing (50%)—under style bias interference. Average accuracy is 70.1%, but this is dominated by Easy and Normal (less challenging) conditions.
Performance Decomposition:
- In style-matched (Normal) evaluation, SOTA models report 74.7% accuracy.
- Hard cases are especially difficult in code and math (e.g., Hard Accuracy drops to 28.4% for math), exposing weaknesses in nuanced technical discernment.
- Multi-objective RMs are able to output correctness and verbosity scores separately, but in non-trivial tasks correct/incorrect responses often receive similar scores, suggesting imperfect disentanglement.
Effectiveness of DPO: RMs trained with DPO outperform vanilla sequence classifiers, highlighting the benefit of preference-based optimization.
Correlation with Downstream Alignment: Hard Accuracy correlates positively ( $r = 0.55$ ) with policy model improvements, while previous benchmarks (e.g., RewardBench) show weak correlation ( $r = 0.21$ ).

5. Comparative Analysis and Unique Properties

Feature	RM-Bench	Prior Benchmarks
Response source	GPT-4o for both chosen/rejected	Mixed model quality
Subtle content difference	Yes	No
Systematic style control	Triple-style matrix per sample	Uncontrolled
Policy performance correlation	Strong (in Hard metric)	Weak/None
Coverage (domain)	Chat, code, math, safety	Often limited

Key innovations are the adversarial construction of (almost) content-identical response pairs, and the systematic style matrix, which jointly expose weaknesses in both content sensitivity and style robustness. As prior benchmarks conflate style/class artifacts with quality, they do not reveal these deficiencies and can mislead policy optimization efforts.

6. Implications for Alignment Research and Reward Model Development

Persistent Style Bias: The failure of state-of-the-art RMs to exceed random accuracy under style-biased conditions demonstrates pervasive vulnerabilities: LLM-aligned policies may select for verbose/Markdown/“model-like” style, rather than substance, if tuned with such RMs.
Necessity of Subtlety and Style Control: RM-Bench evidences that progress in alignment will require advances that explicitly measure and mitigate RM susceptibility to non-substantive cues.
Guidance for Reward Model Selection: Policy models trained with RMs scoring well on RM-Bench (notably Hard Accuracy) are likely to be better aligned, less style-biased, and more reliable in downstream deployment.
Partial Success of Current Solutions: DPO and multi-objective reward modeling contribute marginal improvements, but do not eliminate style bias, particularly in code and mathematical reasoning.
Limitations: RM-Bench currently controls only for length and Markdown, not more subtle artifacts (e.g., “think step by step”). The observed correlations hold when policy models are trained under comparable optimization regimes.

7. Resources and Availability

RM-Bench is open-source and available at https://github.com/THU-KEG/RM-Bench. The repository provides the benchmark data, evaluation code, and documentation for use in comparative research, model selection, and diagnostic evaluation of alignment methodologies.

RM-Bench provides a robust and discriminating standard for the evaluation of reward modeling in LLM alignment, directly exposing limitations in content sensitivity and style bias that were previously obscured. Its construction and methodology tightly couple evaluation outcomes to practical policy improvements, thereby informing both the critique and development of future reward modeling strategies in the LLM alignment discipline (Liu et al., 21 Oct 2024).

PDF Markdown Chat (Pro)

References (1)

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style (2024)

Follow Topic

Get notified by email when new papers are published related to RM-Bench.