TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
This presentation introduces TOPReward, a breakthrough framework that extracts dense reward signals from pretrained vision-language models by mining their internal token probabilities rather than unreliable text outputs. We explore how this zero-shot approach achieves robust progress estimation across diverse robot platforms and manipulation tasks, outperforming prior methods on large-scale benchmarks while enabling practical applications in success detection and policy optimization—all without additional model training.Script
Most vision-language models generate unreliable numeric outputs when you ask them to score robotic task progress. But what if the answer was hiding inside their internal probability distributions all along? This paper reveals that open-source video models already understand temporal progress at a level their text outputs never expose.
Training robots with reinforcement learning hits a wall when rewards arrive only at task completion. Current solutions either demand costly human labels for every new task or rely on vision-language models to output numeric progress scores—a strategy that fails spectacularly on open-source models whose instruction-following abilities can't be trusted for precise numeric reasoning.
The authors sidestep this entire problem by looking where no one else did.
TOPReward reframes progress estimation as a probability mining task. At each time step, they measure how confident the vision-language model is that the instruction has been completed by extracting the log-probability of an affirmative token. This constructs a smooth, temporally dense progress curve without ever asking the model to generate unreliable numeric text.
The framework works across wildly different robot platforms and tasks. Whether it's a single-arm manipulator folding towels or a bimanual system assembling objects, TOPReward produces smooth progress curves that align with actual task completion. The key insight is that these curves emerge from the model's internal confidence about instruction satisfaction, not from fragile text generation.
The authors validate this on an unprecedented scale.
On 39 datasets from Open X-Embodiment and a new 130-task benchmark called Mani, TOPReward dominates. While the prior state-of-the-art method called GVL achieves near-random performance on open-source vision-language models, TOPReward delivers strong correlation with ground-truth progress using the same backbones. The difference is dramatic: GVL produces noisy, unreliable signals while TOPReward tracks actual task stages with remarkable consistency.
These example traces reveal the difference in signal quality. TOPReward's curves in orange rise smoothly as the robot progresses through subtasks, closely matching annotated ground truth. The competing method in blue jumps erratically, failing to capture the temporal structure of task execution. This stability matters enormously when you need reliable progress signals for learning algorithms.
The progress signals aren't just accurate—they're useful. When the authors fine-tune robot policies using advantage-weighted behavior cloning, where TOPReward provides the advantage weights, they achieve perfect success rates on difficult manipulation tasks. Standard behavior cloning fails repeatedly on the same demonstrations. The dense, temporally aligned rewards from TOPReward let the learning algorithm focus on the high-value segments of demonstrations.
TOPReward proves that open-source vision-language models have been hiding robust progress understanding in their token probabilities all along, waiting for someone to look past their unreliable text outputs. Visit EmergentMind.com to explore more research and create your own presentation videos.