RoboReward-8B: Vision-Language Reward Model
- RoboReward-8B is an 8-billion-parameter multi-modal VLM that predicts reward signals in robotic reinforcement learning, integrating a ViT-based visual encoder and an 8B-parameter Qwen3 language decoder.
- It employs a systematic pipeline with synthetic negatives and near-miss examples to enhance training data diversity and reduce reliance on human labeling.
- Benchmark evaluations show that RoboReward-8B achieves lower MAE and improved real-world RL performance compared to larger, off-the-shelf models.
RoboReward-8B is an 8-billion-parameter general-purpose vision-LLM (VLM) designed for assigning reward values in robotic reinforcement learning scenarios. Developed as part of the RoboReward project, it is trained to predict rewards for video-task pairs by leveraging large-scale datasets of real-robot demonstrations alongside a systematic pipeline for generating and validating negative and near-miss examples. The model addresses the longstanding challenge of providing reliable, scalable, and informative reward signals without the need for labor-intensive human labeling or brittle handcrafted objectives, and demonstrates improved correlation with human evaluators as well as tangible gains in real-robot reinforcement learning (Lee et al., 2 Jan 2026).
1. Architecture and Parameterization
RoboReward-8B utilizes a multi-modal architecture with a Qwen3-VL backbone at its core. The visual encoder is a frozen ViT-style transformer comprising 12 layers with a 14×14 patch size and a hidden dimension of 1,024. The language decoder is implemented via an 8-billion-parameter Qwen3 LLM consisting of 48 transformer layers, each with a hidden dimension of 4,096, 32 attention heads, and a 2,048-dimensional feedforward layer.
Multi-modal fusion occurs through cross-attention layers inserted every two language layers, each comprising 32 heads attending from text to visual tokens (each of dimension 1,024). After the final language layer, a fully-connected reward head is attached, which supports two output configurations: (1) five logits corresponding to discrete 1–5 progress predictions, passed through a softmax and trained via cross-entropy, and (2) a single real-valued regression output, trained via mean square error. The combined parameter count totals approximately 8 billion, with ≈7.8B residing in the backbone and ≈0.2B in the reward head.
2. Training Data and Negative Example Augmentation
Training data for RoboReward-8B is sourced principally from two corpora: Open X-Embodiment (OXE) and RoboArena. OXE contributes approximately 1 million real-robot demonstration episodes spanning 22 platforms, paired with natural-language task descriptions. Uniform subsampling up to 1,200 episodes per source ensures balanced representation; all are labeled as perfect demonstrations (score = 5). RoboArena comprises 2,800 human-scored real policy evaluation rollouts on the DROID (Franka) platform, covering both successes and failures on a 1–5 scale.
To address the success-heavy, low-diversity nature of available demonstrations, a negative-example augmentation pipeline is applied, comprising:
- Counterfactual relabeling: Each video of a perfect demonstration is paired with four alternative task texts, generated via a multi-stage LLM+VLM pipeline, with assigned progress scores of {1,2,3,4} according to a fixed rubric.
- Temporal clipping: Successful demonstration videos are truncated at 25%, 50%, and 75% of their length, each producing a partial-progress outcome labeled {1,2,3} as validated by an LLM+VLM-based scoring system.
- Validation: All synthetic negative and near-miss examples are filtered using a VLM-based end-of-episode scoring check, removing label-mismatched or semantically ungrounded pairs.
The final dataset (post-augmentation) comprises 54,135 labeled (video, task) pairs, split as 45,072 for training, 6,232 for validation, and 2,831 for testing. The distribution includes approximately 22,500 perfect examples, 11,000 negatives, and 11,500 near-misses.
3. Training Objectives and Loss Functions
RoboReward-8B supports two principal training objectives for scalar reward prediction given video and task . For discrete label classification (scores 1–5), the model predicts logits () and minimizes the cross-entropy loss:
where is the softmax-normalized probability over logits. For regression, the mean squared error is minimized:
where . During training, fine-tuning is conducted with ; evaluation utilizes mean absolute error (MAE):
0
Pairwise ranking or margin-based objectives were not used in the final 8B configuration.
4. Benchmarking and Comparative Evaluation
RoboReward-8B is evaluated on RoboRewardBench, a held-out benchmark of 2,831 human-verified episodes spanning 14 robot platforms and approximately 30 short-horizon tasks such as pick-and-place, drawer opening, and object stacking. The primary metric is MAE between predicted and ground-truth reward labels on the 1–5 scale, with lower values signifying greater accuracy.
A summary table of key results is as follows:
| Model | MAE (Overall) | MAE (RoboArena) | MAE (OXE) |
|---|---|---|---|
| RoboReward-8B | 0.665 | 0.768 | 0.660 |
| GPT-5 mini (API) | 0.691 | 0.862 | 0.683 |
| GPT-5 full (API) | 0.811 | 1.028 | 0.801 |
| RoboReward-4B | 0.845 | 0.806 | 0.847 |
| Gemini 2.5 Pro (API) | 0.902 | 0.936 | 0.900 |
| Gemini Robotics-ER 1.5 | 0.906 | 1.002 | 0.902 |
| Qwen3-VL Instruct (8B) | 0.892 | 0.847 | 0.894 |
RoboReward-8B outperforms all evaluated models, including larger proprietary VLMs, on per-episode reward prediction. However, generalization gaps remain substantial across embodiments and task classes: for example, MAE is as low as ≈0.3 on some exocentric pick-place tasks and exceeds 1.4 on bimanual or non-prehensile object interaction tasks.
5. Deployment in Real-Robot Reinforcement Learning
The model is tested in real-world robotic policy learning on the WidowX 250 arm in two previously unseen tasks: (1) pick-and-place a small monkey toy onto a yellow towel, and (2) pull a drawer out. The base policy is a diffusion transformer pretrained on BridgeData V2 with goal-image conditioning, and RL is conducted using DSRL-SAC with sparse, end-of-episode rewards.
The reward evaluation scheme applies RoboReward-8B as a zero-shot, episode-level reward function, mapping its 1–5 discrete output linearly to 1. This is compared against Gemini Robotics-ER 1.5 under the same reward mapping and an oracle human reward (+1 if a human judges the episode successful, otherwise 0). The protocol includes replay buffer warm-up, 6,000 RL gradient steps, and up to 70 environment steps per episode, with standard SAC hyperparameters.
Success rates (%), averaged over 20 held-out trials post-finetuning, are:
| Task | Base Policy | +Human Reward | +RoboReward-8B | +Gemini ER 1.5 |
|---|---|---|---|---|
| Pick-and-place monkey | 5 | 75 (+70 pp) | 50 (+45 pp) | 10 (+5 pp) |
| Pull drawer out | 10 | 90 (+80 pp) | 80 (+70 pp) | 45 (+35 pp) |
RoboReward-8B achieves substantial real-world RL improvements, closing most of the gap to human-labeled rewards while requiring no human-in-the-loop label generation. It significantly outperforms Gemini Robotics-ER 1.5 in this reward optimization setting.
6. Significance and Implications
The empirical results from RoboReward-8B demonstrate that (a) targeted large-scale reward supervision—specifically, through the introduction of synthetic negatives and near-misses—enables the construction of compact (8B parameter) VLM reward models that outperform much larger off-the-shelf models for robotic applications, and (b) improved offline MAE performance serves as a reliable predictor for policy learning outcomes in downstream RL deployments.
A plausible implication is that comprehensive reward datasets incorporating diverse error modes, paired with rigorous automated validation mechanisms, can substantially improve the reliability and utility of VLMs for robotics, potentially reducing dependence on brittle manual reward engineering or resource-intensive human annotation frameworks (Lee et al., 2 Jan 2026).