Papers
Topics
Authors
Recent
Search
2000 character limit reached

RoboRewardBench: Robust Reward Model Benchmark

Updated 8 January 2026
  • RoboRewardBench is a human-verified benchmark designed to evaluate reward models across NLP and robotics with controlled input perturbations to ensure robust ranking performance.
  • It systematically measures model stability using diverse, meaning-preserving transformations and calibrated annotations, highlighting brittleness in advanced reward models.
  • Applications span natural language transformations and robotic task evaluations, guiding improvements in reward model training and real-world deployment.

RoboRewardBench is a human-verified benchmark designed for the systematic evaluation of reward models in the context of both NLP and robotics, with a focus on robustness, calibration, and real-world alignment utility. Originating as a robustness extension to NLP reward model evaluation and adapted to video-language reward assessment in robotics, RoboRewardBench comprises diverse, meaning-preserving input transformations and a calibrated progression-scale annotation protocol, enabling a principled comparison of reward models across modalities and application domains (Wu et al., 14 Mar 2025, &&&1&&&).

1. Benchmark Composition and Design Objectives

RoboRewardBench systematically measures reward model consistency and reliability by exposing inputs to a variety of small, meaning-preserving or ranking-preserving transformations. Its design goals are to quantify the stability of pairwise ranking or score assignments for near-identical content, detect spurious correlations learned from static datasets, and to motivate development of reward models (RMs) that are insensitive to trivial or superficial changes in input representation.

NLP Setting

Within text-based assessment, RoboRewardBench includes taxonomy-driven perturbations across three categories:

  • Controlled templates: e.g., added quotes, altered punctuation, appended meaningless strings, tautologies, instruction reordering, and Caesar cipher encodings.
  • Naturalistic, model-agnostic transformations: e.g., paraphrasing (LLM-based), multi-round back-translation, TTS+ASR roundtrip, homoglyph substitutions, character-level noise, and word deletion.
  • Domain-targeted adversarials: e.g., code minification/comments, answer format alteration in math, and standard jailbreak templates in safety evaluation.

Transformed inputs are crafted to ensure semantic equivalence or at least preservation of correct pairwise rankings.

Robotics Setting

In the robotics domain, RoboRewardBench utilizes evaluation splits derived from real-robot corpora (Open X-Embodiment and RoboArena), covering over 2,800 human-verified episodes spanning 22 tasks and 14 embodiments. Reward labels are calibrated through manual rubric-based verification, with final scores representing progress on a discretized five-level scale. Input perturbations are less focused on superficial text but include variation via video-based counterfactual relabeling and partial progression through temporal clipping (Lee et al., 2 Jan 2026).

2. Evaluation Metrics and Protocols

RoboRewardBench employs modality-appropriate, mathematically rigorous metrics to quantify RM robustness, accuracy, and calibration.

NLP Robustness Metrics

  • Pairwise ranking accuracy:

accoriginal=P[RM(x,ywinning)>RM(x,ylosing)]acc_{original} = \mathbb{P}\left[ RM(x, y_{winning}) > RM(x, y_{losing}) \right]

Post-transformation:

acctransformed=P[RM(x~,y~winning)>RM(x~,y~losing)]acc_{transformed} = \mathbb{P}\left[ RM(\tilde{x}, \tilde{y}_{winning}) > RM(\tilde{x}, \tilde{y}_{losing}) \right]

  • Robustness degradation:

Δacc=accoriginalacctransformed\Delta_{acc} = acc_{original} - acc_{transformed}

  • Normalized robustness score:

R=1Δaccaccoriginal=acctransformedaccoriginalR = 1 - \frac{\Delta_{acc}}{acc_{original}} = \frac{acc_{transformed}}{acc_{original}}

R=1R=1 indicates perfect invariance.

Robotics Progress Prediction

  • Mean absolute error (MAE):

MAE=1Ni=1Ny^iyiMAE = \frac{1}{N} \sum_{i=1}^N |\hat{y}_i - y_i|

with y^i=argmaxk  pθ(r=kvi,ti)\hat{y}_i = \mathrm{argmax}_k \; p_\theta(r = k \mid v_i, t_i), evaluated on discrete progress labels (1–5).

  • Offline Reward-Policy Correlation:

Spearman correlation (ρ\rho) between reward-model accuracy and RL policy success, with ρ=0.83\rho=0.83 indicating a strong relationship in simulation.

3. Model Architectures, Training, and Data Regimes

RoboRewardBench is used to evaluate a broad spectrum of SOTA reward models.

NLP Reward Model Families

  • Backbones: Instruction-tuned Llama SFT LMs, sequence-classification and generative-classification architectures, with sizes ranging from 3B to 70B parameters.
  • Training regimes: Fine-tuning on large preference datasets (e.g., HelpSteer2, OpenAI RLHF) using Bradley–Terry-style objectives or MSE on scalar scores.

Robotics Reward Models

  • Backbone: Qwen3-VL vision-LLMs (4B and 8B), with frozen ResNet encoders for video and transformer-based fusion.
  • Head: Five-way classification token predicts progress ({1,2,3,4,5}\in \{1,2,3,4,5\}).
  • Data augmentation: Counterfactual relabeling (scene/action rewrites, alternative tasks, monotonicity-enforced labeling) and negative temporal clipping (partial progression).
  • Supervision: Human-verified labels and VLM-based rubric validation for negative/near-miss episode synthesis.

4. Empirical Findings and Model Behavior

NLP Robustness

  • SOTA reward models achieve 95%\geq 95\% pairwise accuracy on unperturbed RewardBench instances.
  • Transformations cause substantial degradation (drops of 10–70 percentage points). E.g., "Chat Hard" subset shows a leading RM falling from 70.6% to 54.1% under random naturalistic transformations; for "Ignore Above" and homoglyph substitutions, Δacc\Delta_{acc} reaches 43 and 39 points, respectively.
  • Many models under certain perturbations perform below-random, indicating profound brittleness (Wu et al., 14 Mar 2025).

Robotics Generalization and Policy Impact

Model MAE (overall) MAE (RoboArena)
RoboReward-8B 0.665 0.768
GPT-5 mini 0.691 0.862
RoboReward-4B 0.845 0.806
Qwen3-VL 8B 0.892 0.847
Gemini 2.5 Pro 0.902 0.936
Gemini Robotics-ER 1.5 0.906 1.002

RoboReward-8B outperforms all baselines, including models with up to 30B parameters. In deployment, RL agents trained with rewards from RoboReward-8B match or surpass those using Gemini Robotics-ER 1.5 across unseen physical tasks. For example, on the "open drawer" task, RoboReward-8B enabled an 80% success rate versus 45% for the proprietary alternative, narrowing the gap to human-provided reward learning (90%).

5. Robustness Improvement via Paraphrase-Alignment Regularization

A principled robustification approach is introduced in the NLP setting. Given data points (x,y,s)(x, y, s), automatic paraphrases y~\tilde{y} augment the dataset. The proposed regularization penalizes deviation in RM scoring between the original and paraphrased responses:

E(x,y,y~,s)[(RM(x,y)s)2+α(RM(x,y)RM(x,y~))2]\mathbb{E}_{(x, y, \tilde{y}, s)} \left[(RM(x, y)-s)^2 + \alpha (RM(x, y) - RM(x, \tilde{y}))^2 \right]

This alignment regularizer (α=10\alpha=10) enforces score invariance and yields up to 40% improvement in robustness (Δacc\Delta_{acc} reduction from \sim16.6 to \sim8.7 points on "Chat Hard"). Gains generalize to unseen transformations and methods, including RL-free alignment pipelines and best-of-nn sampling, with win rates for regularized RMs up to 64% versus standard-trained RMs (Wu et al., 14 Mar 2025).

6. Limitations and Prospective Directions

  • Semantic preservation: For certain transformations (e.g., paraphrases), meaning equivalence is filtered by cosine similarity (0.7\geq 0.7) but not guaranteed.
  • Judging and annotation: NLP robustness relies on automatic LLM judgments; robotics annotation includes manual human verification but could benefit from broader human studies.
  • Generalization challenges: Substantial variation in cross-embodiment and cross-task performance remains in robotics.
  • Proposed extensions: RoboRewardBench v2 may introduce learned adversarial perturbations, human-in-the-loop ranking invariance audits, and adaptation to multimodal/code-synthesizing RMs or meaning-altering spectrum calibration.

7. Significance and Impact

RoboRewardBench exposes critical failure modes in current reward modeling—a high baseline accuracy does not imply robustness, with state-of-the-art models sensitive to trivial or irrelevant input changes. In text and video settings, this brittleness can fundamentally undermine the reliability of RLHF pipelines, policy improvement in robotics, and automated evaluation frameworks. By providing standardized, human-verified assessment under controlled and naturalistic perturbations, RoboRewardBench enables rigorous benchmarking and targeted development of robust, generalizable reward models for both NLP and robotics (Wu et al., 14 Mar 2025, Lee et al., 2 Jan 2026). The accompanying data augmentation and alignment techniques further establish effective baselines for increasing reward model invariance and utility in real-world applications.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoboRewardBench.