Papers
Topics
Authors
Recent
Search
2000 character limit reached

TOPReward: Temporal Progress for Robotics

Updated 3 July 2026
  • TOPReward is a probabilistically motivated temporal value function that computes normalized log-probabilities to generate dense progress signals.
  • It leverages pretrained video vision-language models to extract and normalize token prediction probabilities as effective dense rewards for reinforcement and imitation learning.
  • Benchmark results show TOPReward achieves high Value-Order Correlation (VOC) scores and improved success detection across diverse robotic platforms compared to traditional methods.

TOPReward is a probabilistically motivated temporal value function designed to extract fine-grained, zero-shot progress estimates for robotic task execution by probing the internal token prediction probabilities of pretrained video Vision-LLMs (VLMs). Unlike previous approaches that prompt VLMs to directly output numeric progress values—a process subject to numeric misrepresentation—TOPReward operates by computing log-probabilities for completion at each trajectory prefix, normalizing them, and harnessing the resulting curves as dense reward or progress signals for downstream reinforcement or imitation learning (Chen et al., 22 Feb 2026).

1. Theoretical Foundations

TOPReward addresses the challenge of generalizing temporal value functions for open-world robotic manipulation, where reward models must provide dense, monotonic feedback across diverse tasks and hardware without requiring fine-tuning or domain-specific data. The method operates on the robot-instruction pair (x,τ1:T)(x, \tau_{1:T}), where xx is a natural language instruction and τ1:T\tau_{1:T} is a sequence of video observations.

A prompt template uu is defined: c(τ1:tk,u)c(\tau_{1:t_k}, u)8 At each prefix τ1:tk\tau_{1:t_k}, TOPReward forms the full context c(τ1:tk,u)c(\tau_{1:t_k}, u) and extracts the log-probability rtk=logpθ(ac(τ1:tk,u))r_{t_k} = \log p_\theta(a \mid c(\tau_{1:t_k}, u)) where aa is the Boolean token "True". The progression of rtkr_{t_k} with tkt_k typically reflects the accumulating "belief" of the VLM that the task is completed.

To ensure comparability within each trajectory (whose raw log-probabilities live on xx0), min–max normalization computes:

xx1

where xx2 prefixes are uniformly selected between the initial and final frame, and xx3 is a small constant for numerical stability.

For applications requiring stepwise, advantage-like weights (e.g., behavior cloning with advantage-weighted regression), TOPReward defines:

xx4

Parameters xx5 and xx6 are scaling and capping constants, respectively.

2. Algorithmic Procedure

The procedural logic of TOPReward requires access to a VLM with token logit output interfaces. The outline is as follows:

  1. Input: Instruction xx7, video xx8, VLM xx9, number of prefixes τ1:T\tau_{1:T}0 (e.g., 16), completion token τ1:T\tau_{1:T}1, small τ1:T\tau_{1:T}2, (optional) scaling τ1:T\tau_{1:T}3, and maximum τ1:T\tau_{1:T}4.
  2. Prefix Selection: Uniformly choose τ1:T\tau_{1:T}5 frame indices τ1:T\tau_{1:T}6 between 1 and τ1:T\tau_{1:T}7.
  3. Log-probability Evaluation: For each τ1:T\tau_{1:T}8, form τ1:T\tau_{1:T}9 and compute uu0.
  4. Normalization: Compute uu1 and uu2, then calculate uu3 for all uu4 using Equation 2 above.
  5. Dense Reward (if needed): Compute advantage-like increments uu5 from uu6 sequence.
  6. Output: Normalized progress curve uu7 and, optionally, dense rewards uu8.

This procedure produces per-episode progress curves suitable for dense reward shaping, success detection, and behavior cloning.

3. Value-Order Correlation (VOC) Metric

TOPReward progress is evaluated by the Value-Order Correlation (VOC), which measures the rank-correlation between predicted normalized scores and true temporal ordering. For a sequence uu9 (normalized progress) and corresponding frame indices τ1:tk\tau_{1:t_k}0:

τ1:tk\tau_{1:t_k}1

A VOC of τ1:tk\tau_{1:t_k}2 signifies perfect monotonic increase; τ1:tk\tau_{1:t_k}3 indicates no alignment; τ1:tk\tau_{1:t_k}4 indicates anti-correlation.

4. Benchmarking and Quantitative Evaluation

TOPReward is evaluated across two primary benchmarks: Open X-Embodiment (OXE) and Mani.

Robotics Platforms and Datasets:

  • OXE: 39 academic manipulation datasets, 780 episodes.
  • Mani: 130+ zero-shot tasks, 497 successful, 156 failure episodes, spanning Franka, SO-100/101, single-arm and bimanual YAM robots; each task annotated with subtask boundaries.

Vision-LLM (VLM) Backbones:

VOC Results:

Method Molmo2-8B Qwen3-VL-8B Gemini-2.5-Pro
GVL (0-shot) −0.016 0.194 0.541
TOPReward 0.417 0.857 0.433

OXE mean VOC over 39 datasets.

Dataset GVL (Molmo2) TOPR (Molmo2) GVL (Qwen) TOPR (Qwen) GVL (Gemini) TOPR (Gemini)
Franka 0.000 0.662 0.242 0.942 0.695 0.448
Bimanual YAM 0.007 0.565 0.164 0.947 0.566 0.546
Single-arm YAM −0.017 0.642 0.544 0.945 0.752 0.488
LeRobot −0.001 0.595 0.332 0.954 0.620 0.578

Mani mean VOC over 113 tasks (497 episodes).

On Qwen3-VL-8B, TOPReward attains τ1:tk\tau_{1:t_k}5--τ1:tk\tau_{1:t_k}6 VOC across all four platforms, significantly exceeding GVL's τ1:tk\tau_{1:t_k}7--τ1:tk\tau_{1:t_k}8.

Success Detection (ROC-AUC on Mani failures, 156 episodes):

Method Qwen3-VL-8B Gemini-2.5-Pro
GVL (VOC) 0.519 0.823
TOPReward 0.654 0.826

Behavior Cloning with Advantage Weights (Single-arm SO-100 tasks):

Task Pretrained BC TOP-AWR
Place doll in box 0 7 10
Pick up cube 4 7 10
Put pen into cup 1.67 5.67 6.33
… (six tasks total)

TOP-AWR (using τ1:tk\tau_{1:t_k}9 as AWR weights) consistently outperforms standard behavior cloning, sometimes improving by three subtasks.

5. Comparative Analysis and Discussion

TOPReward demonstrates substantive performance improvements over previous generalized value learning (GVL) approaches:

  • On open-source VLMs, GVL VOC collapses (c(τ1:tk,u)c(\tau_{1:t_k}, u)0), while TOPReward achieves VOC in the c(τ1:tk,u)c(\tau_{1:t_k}, u)1--c(τ1:tk,u)c(\tau_{1:t_k}, u)2 range.
  • On proprietary Gemini, GVL's VOC (up to 0.752) slightly exceeds TOPReward due to enforced chat formatting that distorts the model's logit distribution. Eliminating this prompt wrapper restores TOPReward's advantage.
  • TOPReward produces dense, monotonic progress signals that are better aligned with subtask boundaries, and these signals are directly useful for reward shaping, success detection, and curation in RL and imitation learning pipelines.

TOPReward's zero-shot operation requires no fine-tuning or application-specific data. However, performance is bounded by the VLM's semantic video comprehension capability; tasks demanding fine-grained spatial analysis (e.g., manipulation of minute components) are challenging. The per-episode normalization of c(τ1:tk,u)c(\tau_{1:t_k}, u)3 implies scores are not globally comparable, but terminal log-probabilities remain interpretable.

6. Implementation and Practical Recommendations

Implementation of TOPReward involves selecting a VLM with direct access to token logits—such as Qwen3-VL, Molmo2, or an equivalent proprietary model. Recommended settings are K=16–32 uniformly spaced prefixes and c(τ1:tk,u)c(\tau_{1:t_k}, u)4 for normalization. Prompt formatting is crucial: minimal, direct templates as specified above are preferred. Usage of chat-based wrappers or conversational templates is discouraged as these degrade logit calibration.

Integration into RL or imitation learning workflows proceeds as follows:

  • Reward shaping: Use c(τ1:tk,u)c(\tau_{1:t_k}, u)5 as dense rewards for any RL algorithm.
  • Imitation learning: Apply c(τ1:tk,u)c(\tau_{1:t_k}, u)6 as weights in advantage-weighted regression (AWR).
  • Dataset curation: Rank and select demonstration trajectories by terminal c(τ1:tk,u)c(\tau_{1:t_k}, u)7 or average of final log-probabilities.

TOPReward generalizes across over 130 tasks, four distinct robot platforms, and numerous public datasets without requiring per-task adaptation. It thus provides dense, smooth, and semantically aligned temporal rewards unattainable by prior prompt-based value estimation methods (Chen et al., 22 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TOPReward.