Robustness of general-purpose VLMs as robot reward models

Determine whether state-of-the-art vision-language models pretrained on large, diverse internet-scale datasets can robustly provide accurate and reliable rewards for real-world robotic reinforcement learning at the level of precision required for effective policy training.

Background

Reinforcement learning for real-world robotics requires precise and reliable reward signals, but obtaining such rewards is labor-intensive. Vision-LLMs offer a potential automated alternative due to their broad perceptual and linguistic capabilities.

However, despite extensive pretraining on diverse data, it remains uncertain whether these general capabilities suffice to deliver rewards with the fidelity needed for reinforcement learning. This uncertainty motivates the creation of RoboReward and RoboRewardBench to rigorously evaluate reward accuracy across many robots and tasks.

References

While VLMs are pretrained on large datasets drawn from a diverse set of sources-endowing them with general vision-language abilities-it is not clear that these general abilities enable them, at present, to robustly provide rewards at the level of precision and reliability required by RL training.

RoboReward: General-Purpose Vision-Language Reward Models for Robotics (2601.00675 - Lee et al., 2 Jan 2026) in Introduction