RewardMath Benchmark

Updated 26 July 2025

RewardMath is a benchmark and evaluation framework that measures reward model reliability, robustness, and signal quality in mathematical reasoning tasks.
It employs a one-to-many task structure with step-aligned solutions, challenging models to distinguish correct solutions from nine closely matched incorrect answers.
Its metrics, such as accuracy and mean reciprocal rank, closely correlate with downstream RLHF policy performance, helping to diagnose reward hacking and overoptimization.

RewardMath is a benchmark and evaluation framework designed to robustly measure the reliability, robustness, and signal quality of reward models—particularly in the context of mathematical reasoning tasks for LLMs. It arose from the recognition that existing reward model evaluation methods, such as those in RewardBench, fail to accurately capture the subtleties of reward model performance, especially regarding reward hacking, overoptimization, and the ability to generalize on challenging math problems. RewardMath’s methodological innovations and empirical validation have established it as a critical diagnostic tool and standard for the mathematical rigor required in modern reinforcement learning from human feedback (RLHF) pipelines.

1. Motivation and Rationale for RewardMath

RewardMath addresses two central limitations observed in prior benchmarks for reward model evaluation:

Representation discrepancy: Human-annotated “chosen” (correct) and “rejected” (incorrect) solutions in previous benchmarks like RewardBench often differ not just in correctness but in how they express solutions—e.g., humans tend to omit detailed intermediate steps while machine-generated answers are step-by-step. This mismatch introduces shortcuts for reward models, letting them key off on style rather than mathematical validity.
Pairwise comparison shortcomings: Previous benchmarks are fundamentally one-to-one; they offer only a single correct versus a single incorrect answer per problem. In these conditions, naive scoring or superficial distinction can yield high accuracy, permitting “reward hacking” (overfitting to spurious cues) and rendering the signal unreliable with respect to downstream RL policies.

RewardMath was devised as a remedy: a benchmark with one-to-many comparison structure and step-aligned (machine-generated) representations, specifically constructed to robustly challenge and reveal the true capabilities and vulnerabilities of reward models used in RLHF for mathematics (Kim et al., 2 Oct 2024).

2. Structure and Methodology

The construction of the RewardMath benchmark involves several deliberate steps to maximize its informativeness and resilience against reward hacking:

One-to-many task structure: For each selected problem from the MATH500 dataset (with easy instances filtered), RewardMath provides one correct, step-explicit solution and nine incorrect solutions. The correct solution is generated in a step-by-step manner from curated human annotations—using carefully engineered GPT-4 prompts for maximal representational consistency.
Sampling of incorrect answers: Nine diverse rejected solutions per problem are sourced from 14 different LLMs, including both open and closed-source, as well as specialist math models. When needed, selected outputs are further perturbed by prompting GPT-4 to introduce errors. This ensures the incorrect candidates closely mirror the style of correct ones, forcing the reward model to home in on actual correctness rather than surface features.
Metrics: Two core metrics are used:
- Accuracy: Measures if the reward model scores the correct solution above all incorrect ones for a problem (random chance is 10% given 1:9 structure).
- Mean Reciprocal Rank (MRR): Averages the inverse of the rank assigned to the correct solution.

Mathematically, MRR is

$\text{MRR} = \frac{1}{d} \sum_{n=1}^d \frac{1}{\text{rank}_n}$

where $d$ is the number of problems and $\text{rank}_n$ is the position of the correct answer for problem $n$ .

RewardMath’s construction places particular emphasis on comparability and realistic diversity, thereby elevating the bar for reward modeling in mathematical domains.

3. Evaluation Findings and Correlation to Policy Optimization

The paper evaluates a variety of reward model types—generative (LLM-as-a-judge), classifier-based, and process-level—on RewardMath and earlier baselines:

Direct assessment weakness in RewardBench: Classifier and generative models routinely achieve high scores on RewardBench but often fail to distinguish correct from incorrect answers in RewardMath’s one-to-many setup—commonly assigning near-identical scores to both.
Predictive power for policy optimization: RewardMath’s scores—unlike those from RewardBench—exhibit a high linear correlation ( $r^2 > 0.8$ on MATH500) with the policy accuracy or “oracle” rewards achieved by RLHF-optimized models using proxy rewards. This validates RewardMath accuracy as a direct proxy for practical reward signal quality.
Overoptimization diagnosis: Models excelling at RewardMath are less prone to reward overoptimization—i.e., less likely to yield policies that overfit the reward model at the expense of actual correctness. Conversely, high RewardBench scores often accompany catastrophic overoptimization, with gold rewards and pass@1 rates collapsing as the optimization budget increases.

The robust predictive utility of RewardMath is due to its stringent structure, which aligns evaluation directly with the practical requirements of RLHF for mathematical reasoning.

4. Implications for Robustness, Overoptimization, and Reward Hacking

RewardMath exposes vulnerabilities and limitations that would otherwise go undetected:

Reward hacking: One-to-many evaluation neutralizes superficial cues, compelling actual correctness verification instead of style or lexical artifacts; this makes “reward hacking” (where the model exploits accidental correlations) much harder.
Overoptimization: Because RewardMath scores closely track when optimized RLHF policies start to diverge between reward model and true correctness, it can be used to select or tune reward models that are less susceptible to overoptimization.
Policy-alignment: By correlating evaluation not with abstract model behavior but with how well a reward model supports optimization of downstream policy (i.e., improves answer accuracy for new problems), RewardMath operationalizes robustness for mathematical LLM alignment.

This benchmark is thus not simply a test set, but a diagnostic for the internal resilience and practical utility of reward models in alignment pipelines.

5. Applications, Availability, and Future Directions

RewardMath is now publicly available (Kim et al., 2 Oct 2024), offering code and data at https://huggingface.co/spaces/RewardMATH/RewardMATH_project so that researchers can benchmark, debug, and qualitatively analyze their reward models. Its current applications include:

Reward model selection: Serving as a standard for direct comparison across model architectures and training strategies, particularly when mathematical reasoning is the optimization target.
RLHF stability research: Evaluating new approaches to reward modeling, pruning, or RLHF pipeline adjustments for improved policy robustness.
Broader benchmarking: The one-to-many structure and focus on high-fidelity, step-aligned responses provides a blueprint for constructing more robust benchmarks in adjacent domains (e.g., code or long-form factual reasoning).

Proposed directions for expansion include many-to-many comparisons (for even finer-grained detection of reward model weaknesses), scaling to other scientific fields, and paper of the optimal trade-off between comparison pool size and compute budget in practice.

6. Methodological and Mathematical Distinctions

RewardMath’s methodology is rooted in strict adherence to best practices for reward model evaluation:

Model training objectives: For classifier-based models, the benchmark uses the classical Bradley–Terry likelihood:

$\mathcal{L}_\text{reward} = -\mathbb{E}_{(x, y_c, y_r) \sim D} \left[ \log \left( \sigma\left( r_\phi(x, y_c) - r_\phi(x, y_r) \right) \right) \right]$

RL update for policy: RLHF policy optimization is performed via objectives combining proxy reward maximization and KL penalty:

$\max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta D_\text{KL}\left[ \pi_\theta(y|x) \| \pi_\text{ref}(y|x) \right]$

These formalizations underpin both the evaluation and policy optimization loops that rely on high-fidelity reward signals.

RewardMath’s structure and methodology thus embody a clear, rigorous mathematical framework for reward robustness in RLHF—specifically tailored to the demands of mathematical reasoning tasks in the era of LLMs.

PDF Markdown Chat (Pro)

References (1)

Evaluating Robustness of Reward Models for Mathematical Reasoning (2024)

Follow Topic

Get notified by email when new papers are published related to RewardMath.