TOPReward: Temporal Progress for Robotics

Updated 3 July 2026

TOPReward is a probabilistically motivated temporal value function that computes normalized log-probabilities to generate dense progress signals.
It leverages pretrained video vision-language models to extract and normalize token prediction probabilities as effective dense rewards for reinforcement and imitation learning.
Benchmark results show TOPReward achieves high Value-Order Correlation (VOC) scores and improved success detection across diverse robotic platforms compared to traditional methods.

TOPReward is a probabilistically motivated temporal value function designed to extract fine-grained, zero-shot progress estimates for robotic task execution by probing the internal token prediction probabilities of pretrained video Vision-LLMs (VLMs). Unlike previous approaches that prompt VLMs to directly output numeric progress values—a process subject to numeric misrepresentation—TOPReward operates by computing log-probabilities for completion at each trajectory prefix, normalizing them, and harnessing the resulting curves as dense reward or progress signals for downstream reinforcement or imitation learning (Chen et al., 22 Feb 2026).

1. Theoretical Foundations

TOPReward addresses the challenge of generalizing temporal value functions for open-world robotic manipulation, where reward models must provide dense, monotonic feedback across diverse tasks and hardware without requiring fine-tuning or domain-specific data. The method operates on the robot-instruction pair $(x, \tau_{1:T})$ , where $x$ is a natural language instruction and $\tau_{1:T}$ is a sequence of video observations.

A prompt template $u$ is defined: $c(\tau_{1:t_k}, u)$ 8 At each prefix $\tau_{1:t_k}$ , TOPReward forms the full context $c(\tau_{1:t_k}, u)$ and extracts the log-probability $r_{t_k} = \log p_\theta(a \mid c(\tau_{1:t_k}, u))$ where $a$ is the Boolean token "True". The progression of $r_{t_k}$ with $t_k$ typically reflects the accumulating "belief" of the VLM that the task is completed.

To ensure comparability within each trajectory (whose raw log-probabilities live on $x$ 0), min–max normalization computes:

$x$ 1

where $x$ 2 prefixes are uniformly selected between the initial and final frame, and $x$ 3 is a small constant for numerical stability.

For applications requiring stepwise, advantage-like weights (e.g., behavior cloning with advantage-weighted regression), TOPReward defines:

$x$ 4

Parameters $x$ 5 and $x$ 6 are scaling and capping constants, respectively.

2. Algorithmic Procedure

The procedural logic of TOPReward requires access to a VLM with token logit output interfaces. The outline is as follows:

Input: Instruction $x$ 7, video $x$ 8, VLM $x$ 9, number of prefixes $\tau_{1:T}$ 0 (e.g., 16), completion token $\tau_{1:T}$ 1, small $\tau_{1:T}$ 2, (optional) scaling $\tau_{1:T}$ 3, and maximum $\tau_{1:T}$ 4.
Prefix Selection: Uniformly choose $\tau_{1:T}$ 5 frame indices $\tau_{1:T}$ 6 between 1 and $\tau_{1:T}$ 7.
Log-probability Evaluation: For each $\tau_{1:T}$ 8, form $\tau_{1:T}$ 9 and compute $u$ 0.
Normalization: Compute $u$ 1 and $u$ 2, then calculate $u$ 3 for all $u$ 4 using Equation 2 above.
Dense Reward (if needed): Compute advantage-like increments $u$ 5 from $u$ 6 sequence.
Output: Normalized progress curve $u$ 7 and, optionally, dense rewards $u$ 8.

This procedure produces per-episode progress curves suitable for dense reward shaping, success detection, and behavior cloning.

3. Value-Order Correlation (VOC) Metric

TOPReward progress is evaluated by the Value-Order Correlation (VOC), which measures the rank-correlation between predicted normalized scores and true temporal ordering. For a sequence $u$ 9 (normalized progress) and corresponding frame indices $\tau_{1:t_k}$ 0:

$\tau_{1:t_k}$ 1

A VOC of $\tau_{1:t_k}$ 2 signifies perfect monotonic increase; $\tau_{1:t_k}$ 3 indicates no alignment; $\tau_{1:t_k}$ 4 indicates anti-correlation.

4. Benchmarking and Quantitative Evaluation

TOPReward is evaluated across two primary benchmarks: Open X-Embodiment (OXE) and Mani.

Robotics Platforms and Datasets:

OXE: 39 academic manipulation datasets, 780 episodes.
Mani: 130+ zero-shot tasks, 497 successful, 156 failure episodes, spanning Franka, SO-100/101, single-arm and bimanual YAM robots; each task annotated with subtask boundaries.

Vision-LLM (VLM) Backbones:

Open-source: Molmo2-8B, Qwen3-VL-8B
Proprietary: Gemini-2.5-Pro

VOC Results:

Method	Molmo2-8B	Qwen3-VL-8B	Gemini-2.5-Pro
GVL (0-shot)	−0.016	0.194	0.541
TOPReward	0.417	0.857	0.433

OXE mean VOC over 39 datasets.

Dataset	GVL (Molmo2)	TOPR (Molmo2)	GVL (Qwen)	TOPR (Qwen)	GVL (Gemini)	TOPR (Gemini)
Franka	0.000	0.662	0.242	0.942	0.695	0.448
Bimanual YAM	0.007	0.565	0.164	0.947	0.566	0.546
Single-arm YAM	−0.017	0.642	0.544	0.945	0.752	0.488
LeRobot	−0.001	0.595	0.332	0.954	0.620	0.578

Mani mean VOC over 113 tasks (497 episodes).

On Qwen3-VL-8B, TOPReward attains $\tau_{1:t_k}$ 5-- $\tau_{1:t_k}$ 6 VOC across all four platforms, significantly exceeding GVL's $\tau_{1:t_k}$ 7-- $\tau_{1:t_k}$ 8.

Success Detection (ROC-AUC on Mani failures, 156 episodes):

Method	Qwen3-VL-8B	Gemini-2.5-Pro
GVL (VOC)	0.519	0.823
TOPReward	0.654	0.826

Behavior Cloning with Advantage Weights (Single-arm SO-100 tasks):

Task	Pretrained	BC	TOP-AWR
Place doll in box	0	7	10
Pick up cube	4	7	10
Put pen into cup	1.67	5.67	6.33
… (six tasks total)

TOP-AWR (using $\tau_{1:t_k}$ 9 as AWR weights) consistently outperforms standard behavior cloning, sometimes improving by three subtasks.

5. Comparative Analysis and Discussion

TOPReward demonstrates substantive performance improvements over previous generalized value learning (GVL) approaches:

On open-source VLMs, GVL VOC collapses ( $c(\tau_{1:t_k}, u)$ 0), while TOPReward achieves VOC in the $c(\tau_{1:t_k}, u)$ 1-- $c(\tau_{1:t_k}, u)$ 2 range.
On proprietary Gemini, GVL's VOC (up to 0.752) slightly exceeds TOPReward due to enforced chat formatting that distorts the model's logit distribution. Eliminating this prompt wrapper restores TOPReward's advantage.
TOPReward produces dense, monotonic progress signals that are better aligned with subtask boundaries, and these signals are directly useful for reward shaping, success detection, and curation in RL and imitation learning pipelines.

TOPReward's zero-shot operation requires no fine-tuning or application-specific data. However, performance is bounded by the VLM's semantic video comprehension capability; tasks demanding fine-grained spatial analysis (e.g., manipulation of minute components) are challenging. The per-episode normalization of $c(\tau_{1:t_k}, u)$ 3 implies scores are not globally comparable, but terminal log-probabilities remain interpretable.

6. Implementation and Practical Recommendations

Implementation of TOPReward involves selecting a VLM with direct access to token logits—such as Qwen3-VL, Molmo2, or an equivalent proprietary model. Recommended settings are K=16–32 uniformly spaced prefixes and $c(\tau_{1:t_k}, u)$ 4 for normalization. Prompt formatting is crucial: minimal, direct templates as specified above are preferred. Usage of chat-based wrappers or conversational templates is discouraged as these degrade logit calibration.

Integration into RL or imitation learning workflows proceeds as follows:

Reward shaping: Use $c(\tau_{1:t_k}, u)$ 5 as dense rewards for any RL algorithm.
Imitation learning: Apply $c(\tau_{1:t_k}, u)$ 6 as weights in advantage-weighted regression (AWR).
Dataset curation: Rank and select demonstration trajectories by terminal $c(\tau_{1:t_k}, u)$ 7 or average of final log-probabilities.

TOPReward generalizes across over 130 tasks, four distinct robot platforms, and numerous public datasets without requiring per-task adaptation. It thus provides dense, smooth, and semantically aligned temporal rewards unattainable by prior prompt-based value estimation methods (Chen et al., 22 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TOPReward.