TOPReward: Temporal Progress for Robotics
- TOPReward is a probabilistically motivated temporal value function that computes normalized log-probabilities to generate dense progress signals.
- It leverages pretrained video vision-language models to extract and normalize token prediction probabilities as effective dense rewards for reinforcement and imitation learning.
- Benchmark results show TOPReward achieves high Value-Order Correlation (VOC) scores and improved success detection across diverse robotic platforms compared to traditional methods.
TOPReward is a probabilistically motivated temporal value function designed to extract fine-grained, zero-shot progress estimates for robotic task execution by probing the internal token prediction probabilities of pretrained video Vision-LLMs (VLMs). Unlike previous approaches that prompt VLMs to directly output numeric progress values—a process subject to numeric misrepresentation—TOPReward operates by computing log-probabilities for completion at each trajectory prefix, normalizing them, and harnessing the resulting curves as dense reward or progress signals for downstream reinforcement or imitation learning (Chen et al., 22 Feb 2026).
1. Theoretical Foundations
TOPReward addresses the challenge of generalizing temporal value functions for open-world robotic manipulation, where reward models must provide dense, monotonic feedback across diverse tasks and hardware without requiring fine-tuning or domain-specific data. The method operates on the robot-instruction pair , where is a natural language instruction and is a sequence of video observations.
A prompt template is defined: 8 At each prefix , TOPReward forms the full context and extracts the log-probability where is the Boolean token "True". The progression of with typically reflects the accumulating "belief" of the VLM that the task is completed.
To ensure comparability within each trajectory (whose raw log-probabilities live on 0), min–max normalization computes:
1
where 2 prefixes are uniformly selected between the initial and final frame, and 3 is a small constant for numerical stability.
For applications requiring stepwise, advantage-like weights (e.g., behavior cloning with advantage-weighted regression), TOPReward defines:
4
Parameters 5 and 6 are scaling and capping constants, respectively.
2. Algorithmic Procedure
The procedural logic of TOPReward requires access to a VLM with token logit output interfaces. The outline is as follows:
- Input: Instruction 7, video 8, VLM 9, number of prefixes 0 (e.g., 16), completion token 1, small 2, (optional) scaling 3, and maximum 4.
- Prefix Selection: Uniformly choose 5 frame indices 6 between 1 and 7.
- Log-probability Evaluation: For each 8, form 9 and compute 0.
- Normalization: Compute 1 and 2, then calculate 3 for all 4 using Equation 2 above.
- Dense Reward (if needed): Compute advantage-like increments 5 from 6 sequence.
- Output: Normalized progress curve 7 and, optionally, dense rewards 8.
This procedure produces per-episode progress curves suitable for dense reward shaping, success detection, and behavior cloning.
3. Value-Order Correlation (VOC) Metric
TOPReward progress is evaluated by the Value-Order Correlation (VOC), which measures the rank-correlation between predicted normalized scores and true temporal ordering. For a sequence 9 (normalized progress) and corresponding frame indices 0:
1
A VOC of 2 signifies perfect monotonic increase; 3 indicates no alignment; 4 indicates anti-correlation.
4. Benchmarking and Quantitative Evaluation
TOPReward is evaluated across two primary benchmarks: Open X-Embodiment (OXE) and Mani.
Robotics Platforms and Datasets:
- OXE: 39 academic manipulation datasets, 780 episodes.
- Mani: 130+ zero-shot tasks, 497 successful, 156 failure episodes, spanning Franka, SO-100/101, single-arm and bimanual YAM robots; each task annotated with subtask boundaries.
Vision-LLM (VLM) Backbones:
- Open-source: Molmo2-8B, Qwen3-VL-8B
- Proprietary: Gemini-2.5-Pro
VOC Results:
| Method | Molmo2-8B | Qwen3-VL-8B | Gemini-2.5-Pro |
|---|---|---|---|
| GVL (0-shot) | −0.016 | 0.194 | 0.541 |
| TOPReward | 0.417 | 0.857 | 0.433 |
OXE mean VOC over 39 datasets.
| Dataset | GVL (Molmo2) | TOPR (Molmo2) | GVL (Qwen) | TOPR (Qwen) | GVL (Gemini) | TOPR (Gemini) |
|---|---|---|---|---|---|---|
| Franka | 0.000 | 0.662 | 0.242 | 0.942 | 0.695 | 0.448 |
| Bimanual YAM | 0.007 | 0.565 | 0.164 | 0.947 | 0.566 | 0.546 |
| Single-arm YAM | −0.017 | 0.642 | 0.544 | 0.945 | 0.752 | 0.488 |
| LeRobot | −0.001 | 0.595 | 0.332 | 0.954 | 0.620 | 0.578 |
Mani mean VOC over 113 tasks (497 episodes).
On Qwen3-VL-8B, TOPReward attains 5--6 VOC across all four platforms, significantly exceeding GVL's 7--8.
Success Detection (ROC-AUC on Mani failures, 156 episodes):
| Method | Qwen3-VL-8B | Gemini-2.5-Pro |
|---|---|---|
| GVL (VOC) | 0.519 | 0.823 |
| TOPReward | 0.654 | 0.826 |
Behavior Cloning with Advantage Weights (Single-arm SO-100 tasks):
| Task | Pretrained | BC | TOP-AWR |
|---|---|---|---|
| Place doll in box | 0 | 7 | 10 |
| Pick up cube | 4 | 7 | 10 |
| Put pen into cup | 1.67 | 5.67 | 6.33 |
| … (six tasks total) |
TOP-AWR (using 9 as AWR weights) consistently outperforms standard behavior cloning, sometimes improving by three subtasks.
5. Comparative Analysis and Discussion
TOPReward demonstrates substantive performance improvements over previous generalized value learning (GVL) approaches:
- On open-source VLMs, GVL VOC collapses (0), while TOPReward achieves VOC in the 1--2 range.
- On proprietary Gemini, GVL's VOC (up to 0.752) slightly exceeds TOPReward due to enforced chat formatting that distorts the model's logit distribution. Eliminating this prompt wrapper restores TOPReward's advantage.
- TOPReward produces dense, monotonic progress signals that are better aligned with subtask boundaries, and these signals are directly useful for reward shaping, success detection, and curation in RL and imitation learning pipelines.
TOPReward's zero-shot operation requires no fine-tuning or application-specific data. However, performance is bounded by the VLM's semantic video comprehension capability; tasks demanding fine-grained spatial analysis (e.g., manipulation of minute components) are challenging. The per-episode normalization of 3 implies scores are not globally comparable, but terminal log-probabilities remain interpretable.
6. Implementation and Practical Recommendations
Implementation of TOPReward involves selecting a VLM with direct access to token logits—such as Qwen3-VL, Molmo2, or an equivalent proprietary model. Recommended settings are K=16–32 uniformly spaced prefixes and 4 for normalization. Prompt formatting is crucial: minimal, direct templates as specified above are preferred. Usage of chat-based wrappers or conversational templates is discouraged as these degrade logit calibration.
Integration into RL or imitation learning workflows proceeds as follows:
- Reward shaping: Use 5 as dense rewards for any RL algorithm.
- Imitation learning: Apply 6 as weights in advantage-weighted regression (AWR).
- Dataset curation: Rank and select demonstration trajectories by terminal 7 or average of final log-probabilities.
TOPReward generalizes across over 130 tasks, four distinct robot platforms, and numerous public datasets without requiring per-task adaptation. It thus provides dense, smooth, and semantically aligned temporal rewards unattainable by prior prompt-based value estimation methods (Chen et al., 22 Feb 2026).