Dense Temporal Value Estimation
- Dense temporal value estimation is the process of predicting task-aligned progress signals at a fine temporal resolution from visual observations.
- One approach uses per-frame normalized progress via chunked prediction, while another estimates frame-to-frame temporal distances as dense rewards.
- These methods capture subtle dynamics like hesitation, regression, and completion, thereby enhancing policy learning in robotic control tasks.
Dense temporal value estimation is the estimation of task-aligned scalar signals at fine temporal resolution, typically from visual observations and without direct access to environment rewards or action labels. In recent robotic learning work, two closely related formulations have been emphasized. One defines value as per-frame normalized task progress, yielding a dense value curve over an entire trajectory; the other predicts frame-wise temporal distance between pairs of observations and uses that quantity as a dense transition-level progress signal. In both cases, the central objective is to replace temporally sparse supervision with estimates that track incremental advancement, hesitation, regression, and completion at the granularity of frames or short windows rather than whole trajectories (Wang et al., 23 Jun 2026, Liu et al., 30 Sep 2025).
1. Definition and formal scope
In the formulation introduced by "World Value Models for Robotic Manipulation" (Wang et al., 23 Jun 2026), value estimation is framed as per-frame task progress prediction. For a demonstration trajectory of length , the ground-truth scalar value at frame is defined as normalized task progress,
Under a sparse negative reward setting until completion, this quantity is equivalent, up to an affine transformation, to the usual value function
The paper therefore treats dense progress supervision as a practical proxy for negative expected remaining time.
A distinct but related formulation appears in "TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance" (Liu et al., 30 Sep 2025). There, the central target is the normalized temporal distance between two frames from an expert trajectory,
Positive values denote forward progression, negative values denote backward progression, and the quantity is used directly as a transition-level proxy reward. The paper does not explicitly output a state-value function ; instead, it predicts a local progress increment that can be interpreted as a temporal difference of a latent potential function.
Taken together, these formulations suggest a broad definition of dense temporal value estimation: the learning of temporally fine-grained progress signals that are aligned with task completion and usable either as state values, transition rewards, or both. A crucial feature is that the signal is dense in time. It is not restricted to final success, coarse pairwise ranking, or a single scalar for an entire episode.
A useful distinction is between per-frame dense value and per-transition dense reward. The former assigns a scalar to each observation, producing a trajectory-wise curve . The latter scores the change between adjacent observations, producing a reward-like sequence . The two are conceptually adjacent because both measure temporal progress, but they place supervision at different points in the trajectory.
2. Mathematical formulations of progress and value
The per-frame formulation in WVM is explicitly chunk-wise. Rather than predicting a single scalar at one time index, the model predicts a value chunk of length ,
0
This makes dense temporal value estimation a sequence prediction problem over short windows. The intended representational consequences are stated directly: the model can encode smooth increases as the task progresses, plateaus corresponding to hesitation, decreases corresponding to regress or retry, and more general non-monotonic local profiles. Overlapping windows are then aggregated into per-frame estimates by standard overlapped-chunk decoding,
1
The TimeRewarder formulation is pairwise rather than chunk-wise. It learns the scalar 2 between two frames and decodes that quantity as a step-wise reward during RL,
3
The paper connects this prediction to potential-based shaping through
4
and introduces a theoretical potential function on expert trajectories,
5
with Bellman relation
6
With 7, temporal distance along expert trajectories becomes approximately the per-step temporal difference of a latent progress value.
The two formulations differ in what they assume about supervision. WVM directly supervises normalized progress 8 and can handle piecewise-linear non-monotonic curves on suboptimal trajectories. TimeRewarder assumes that expert trajectories are near-optimal and monotonic toward completion, so temporal order itself is the supervision signal. This difference is central. WVM is designed to model local regressions explicitly; TimeRewarder uses reversed frame pairs as implicit negatives but retains a trajectory-wise monotonic interpretation of progress.
| Formulation | Prediction target | Temporal granularity |
|---|---|---|
| WVM | 9 or chunk 0 | Per frame / chunk |
| TimeRewarder | 1 | Frame pair / transition |
This comparison indicates that dense temporal value estimation need not be tied to a single mathematical object. It can be instantiated as a value curve over observations or as a progress increment over transitions, provided the signal remains temporally resolved and task-aligned.
3. Temporal representation learning and model architectures
A central issue in dense temporal value estimation is whether the backbone can represent temporal context and future evolution rather than isolated visual snapshots. WVM is explicitly motivated by the claim that most existing robotic value models are built on Vision-LLM backbones pretrained primarily on static or temporally sparse observations, and therefore lack the requisite temporal modeling capabilities for value estimation (Wang et al., 23 Jun 2026). Its proposed alternative is a world-model-based backbone.
WVM is built on Wan2.2, comprising a Video VAE and a Video DiT. The Video VAE encodes a video clip to temporally and spatially compressed latents, and the Video DiT is a large Video Diffusion Transformer with 30 layers and approximately 5B parameters. For a value chunk anchored on frames 2, the model consumes
3
consisting of one prefix frame, 4 current frames, and 5 future frames. The latent dynamics are described conceptually as
6
with observations recovered via the VAE decoder. The key temporal claim is that the backbone is co-trained to predict future latents and therefore encodes information useful for anticipating scene evolution.
On top of this world model, WVM adds a Value DiT with hidden dimension 512 and 30 layers. The coupling between video and value streams is realized through Mixture-of-Transformers. Attention is asymmetric: value tokens attend to video tokens, but video tokens do not attend to value tokens. Formally, for value tokens,
7
whereas video tokens use only 8. This architecture is designed to exploit world-model features without degrading video generation quality.
TimeRewarder uses a much lighter architecture. Its visual backbone is CLIP-pretrained ViT-B/16. Each frame is encoded independently to a 512-dimensional feature, two frame embeddings are concatenated into a 1024-dimensional vector, and a single linear head outputs logits over discretized temporal distance bins (Liu et al., 30 Sep 2025). Both encoder and head are trainable. This architecture does not maintain explicit memory across longer sequences and does not model future video generation; instead, it learns temporal structure from frame-pair supervision.
The architectural contrast is substantive. WVM embeds value prediction inside a generative video world model with explicit future-latent prediction, whereas TimeRewarder learns relative temporal distance from static frame pairs. A plausible implication is that the former is optimized for dense state-value estimation on heterogeneous, suboptimal trajectories, while the latter is optimized for inexpensive transition-level reward shaping from passive videos.
4. Training objectives and supervision regimes
WVM treats the value chunk 9 as a continuous multi-dimensional target and trains the value head with flow matching rather than scalar regression or categorical value distributions. For either a future latent target 0 or a value chunk 1, the model defines the stochastic interpolation
2
and optimizes
3
The instantiated losses are
4
with joint objective
5
At inference time, the paper reports that only one explicit Euler step is used to map a noisy initialization to a prediction.
The supervision regime in WVM is heterogeneous. Training uses approximately 1.4k hours of video from RoboCOIN, EgoDex, RoboReward expert subsets, and self-collected RoboSuite, AgileX, and ARX data, with no action labels. The Video VAE and T5 text encoder are frozen; the world model is fine-tuned jointly with the value head. The training mixture includes expert trajectories, multiple embodiments, simulation and real-world data, and mixed-quality segments. This is why the model is described as a generalist value model.
TimeRewarder uses a different supervision pipeline. It samples frame pairs 6 within expert trajectories, with frame interval 7 drawn from
8
a procedure termed Exponentially Weighted Pair Sampling. The scalar temporal distance 9 is mapped to a two-hot target over 0 bins, and the model is trained with cross-entropy: 1 The training hyperparameters reported for the reward model are: 10,000 training pairs per epoch, 100 epochs, 5 warm-up epochs, batch size 16, Adam, and learning rate 2 (Liu et al., 30 Sep 2025).
The supervision regimes encode different assumptions about temporal structure. WVM relies on dense labels, including explicit labels for hesitation and retry. TimeRewarder relies on pairwise temporal order inside expert videos and derives suboptimal awareness from the inclusion of both forward and backward ordered pairs. The former is label-intensive but expressly supports non-monotonic progress; the latter is label-efficient but rests on the monotonicity of expert demonstrations.
5. Dense temporal phenomena: monotonic progress, hesitation, and retry
A defining feature of dense temporal value estimation is whether it can represent local deviations from monotonic progress. WVM formalizes this issue explicitly through chunk-wise dense value prediction and through the construction of Suboptimal-Value-Bench, which contains approximately 800 trajectories with dense human frame-level value annotations for suboptimal behaviors (Wang et al., 23 Jun 2026).
For expert trajectories, the ground-truth curve is simply 3, so progress is monotonic. For suboptimal trajectories, WVM introduces piecewise-linear progress curves for two modes. In hesitation, a plateau segment 4 is inserted during which the robot does not make progress but later resumes at the same effective speed. If 5, then
6
with linear growth before 7, a constant plateau on 8, and linear growth afterward. In retry, a failed attempt is followed by backward movement and then re-approach. If 9, the effective forward frames are 0, the per-step rate is 1, and
2
with piecewise-linear interpolation between 3, 4, 5, and 6. This yields the characteristic plateau for hesitation and a V-shaped dip, potentially to 0, for retry.
TimeRewarder does not encode hesitation and retry with explicit frame-level labels. Instead, it treats reversed frame ordering as a negative instance of progress and predicts signed temporal distance over 7. This means that it can assign negative progress to backward-looking transitions, but it does not construct a trajectory-level piecewise-linear value curve with labeled local plateaus or dips (Liu et al., 30 Sep 2025). The paper also explicitly states limitations on tasks with frequent back-and-forth motions and on multi-goal or branching tasks, where temporal order may not correspond to a single scalar progress variable.
This distinction addresses a common misconception: dense temporal value estimation is not identical to monotonic progress tracking. In WVM, dense values are designed to represent localized non-monotonicity. In TimeRewarder, dense progress remains fundamentally tied to the ordering of expert trajectories. The two approaches therefore occupy different points on the spectrum between monotonic temporal ranking and explicitly non-monotonic value modeling.
6. Evaluation protocols and empirical performance
WVM evaluates dense temporal value estimation with metrics designed for both expert and suboptimal trajectories. On Suboptimal-Value-Bench, Hesitation-RMSE is computed over hesitation segments,
8
penalizing deviations from the true flat plateau. For retry segments, the paper uses Retry-VOC, a Value-Order Correlation restricted to windows with monotonically decreasing ground truth. Standard expert VOC is also computed on monotonic expert trajectories. The reported results are: average Hesitation-RMSE of 0.05 for WVM versus 0.14 for the next best baselines; average Retry-VOC of 0.78 for WVM versus 0.62 for GVL; and average expert VOC of 0.95, compared with 0.88 for the best baseline on expert-only datasets. The paper further reports that WVM exceeds 0.99 on all self-collected robot datasets and is slightly below RoboReward on EgoDex, 0.92 versus 0.95 (Wang et al., 23 Jun 2026).
The WVM ablations are particularly diagnostic for dense temporal modeling. Removing video co-training worsens Hesitation-RMSE from 0.05 to 0.08 and lowers Retry-VOC from 0.78 to 0.68. Training Video DiT from scratch reduces Retry-VOC further to 0.62. Freezing Video DiT yields Hesitation-RMSE 0.12, Retry-VOC 0.45, and Expert-VOC 0.92. Prefix randomization with intermediate probability 9 gives the best overall trade-off, while 0 inflates Expert-VOC to 0.98 but degrades suboptimal metrics, indicating that the model can otherwise exploit prefix information rather than the visually observed chunk. Replacing the flow-matching head with an HL-Gaussian head lowers all metrics, especially Retry-VOC from 0.78 to 0.59.
TimeRewarder evaluates dense temporal signals primarily through downstream RL and held-out temporal coherence tests. On ten Meta-World tasks with 100 action-free expert videos each, it reports nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment, and states that it outperformed previous methods and even the manually designed environment dense reward on both final success rate and sample efficiency (Liu et al., 30 Sep 2025). For unseen expert videos, it evaluates VOC and reports the highest VOC across tasks versus VIP and GVL. Qualitative analyses compare successful and failed rollouts on tasks such as basketball and window-open, with TimeRewarder yielding smooth, monotone increases on successful rollouts and low or declining rewards on failures.
The evaluation protocols reveal different standards of evidence. WVM measures fidelity to dense human-labeled value curves, including plateau and regression segments. TimeRewarder measures the utility and coherence of temporally ordered reward estimates, especially through RL improvement. This suggests two complementary notions of success: curve accuracy for state-value estimation and control utility for reward shaping.
7. Policy learning, adjacent paradigms, and limitations
Dense temporal value estimation is operationally important because it can guide policy optimization from sparse or mixed-quality data. WVM uses dense per-frame predictions to define an advantage proxy over action chunks,
1
with 2 and 3. This proxy is inserted into weighted behavior cloning losses of the form
4
The paper instantiates binary filter weighting,
5
percentile filter weighting,
6
with 7 chosen to keep the top 70% of chunks by 8, and Advantage-Weighted Regression weighting,
9
Using only suboptimal trajectories, the paper reports that all WVM-guided variants outperform plain BC on simulated RoboSuite and real-world AgileX tasks (Wang et al., 23 Jun 2026).
TimeRewarder uses the predicted temporal distance as a dense reward during DrQ-v2 training,
0
with the reward model frozen throughout RL. The dense progress term shapes exploration by rewarding transitions that resemble forward progress in expert videos and penalizing regressions. The paper emphasizes that the model can also leverage real-world human videos: on three selected Meta-World tasks, combining human videos with one in-domain Meta-World expert video significantly improves success and sample efficiency relative to human-only or Meta-World-only training (Liu et al., 30 Sep 2025).
These systems are situated relative to several adjacent paradigms. The data explicitly contrasts dense temporal value estimation with VLM-based robotic value models such as GVL, VLAC, RoboReward, Robo-Dopamine, Robometer, and TopReward, which often operate on static images, short clips, or frame pairs and may use sparse temporal sampling or coarse labels. TimeRewarder further positions itself against TCN, VIP, Rank2Reward, PROGRESSOR, GAIfO, OT, ADS, and video-prediction-based reward methods. The central contrast is that dense temporal value estimation aims to encode temporally resolved task progress rather than generic similarity, global success labels, or only pairwise order without magnitude.
The principal limitations are also explicit. WVM notes that training data scale remains moderate, that the approach is video-centric, that longer credit assignment still relies on overlapping chunks and model attention, and that no explicit imagined rollouts are used for value. TimeRewarder notes breakdowns on tasks with frequent back-and-forth motions, multi-goal or branching tasks, partial observability or occlusions, and severe domain shift. These constraints indicate that dense temporal value estimation is most reliable when progress is strongly legible in visual dynamics and can be represented by a relatively low-dimensional scalar notion of advancement.
A broader reading of the two papers suggests that dense temporal value estimation is becoming a convergence point for model-based RL, distributional prediction, progress-based reward learning, and offline imitation from heterogeneous visual logs. In WVM, this convergence takes the form of a world-model-based, distributional, dense state-value function. In TimeRewarder, it takes the form of a frozen frame-pair temporal-distance model reused as a dense reward signal. The shared premise is that temporal structure itself can be a primary supervisory resource for learning progress-sensitive control signals.