Generalist Forecasting with Frozen Video Models via Latent Diffusion (2507.13942v1)

Published 18 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.

Summary

The paper presents a two-stage latent diffusion framework that forecasts future video latent trajectories using frozen video models.
Empirical results show a strong correlation between perceptual performance and short-horizon forecasting accuracy across diverse tasks.
Ablations reveal that diffusion models outperform deterministic regression in capturing the multimodal distribution of future video representations.

Generalist Forecasting with Frozen Video Models via Latent Diffusion

This paper presents a unified framework for generalist video forecasting by leveraging frozen pretrained video models and a diffusion-based forecasting module operating in latent representation space. The work systematically investigates the relationship between perceptual ability and forecasting performance across a diverse set of state-of-the-art image and video models, spanning multiple abstraction levels and tasks. The authors introduce new evaluation protocols and metrics to rigorously assess stochastic, temporally extended forecasting, and provide extensive empirical results and ablations.

Methodological Contributions

The core methodological innovation is a two-stage pipeline:

Task Readout Heads: For each frozen video backbone, lightweight attention-based readout heads are trained to decode task-specific outputs (pixels, depth, point tracks, bounding boxes) from the frozen representations. These heads are trained only on observed frames and remain fixed during forecasting.
Latent Diffusion Forecasting: A conditional denoising diffusion model is trained to forecast future latent trajectories in the frozen representation space, conditioned on a context of past frames. The diffusion model is architecture-agnostic and models the joint distribution of future representations, capturing the inherent stochasticity of the future.

This modular approach decouples representation learning, task decoding, and forecasting, enabling fair comparison across models with different pretraining paradigms (masking, synthesis, language supervision, etc.).

Evaluation Protocol

The evaluation is designed to capture both the accuracy and the distributional realism/diversity of predicted futures:

Per-example metrics: For each input, 10 samples are drawn from the diffusion model. Metrics such as PSNR (pixels), mean absolute relative error (depth), Jaccard distance (point tracks), and IoU (boxes) are computed, reporting mean, min, and max across samples.
Distributional metrics: Fréchet Distance (FD) is computed between the distributions of predicted and ground-truth trajectories in the output space, along with the variance of the predicted samples. This directly measures the alignment of the model's stochastic predictions with the true data distribution.

Empirical Findings

Correlation Between Perception and Forecasting

A central empirical result is the strong linear correlation between a model's perception performance and its short-horizon forecasting ability across all tasks and abstraction levels. This trend is especially pronounced for models pretrained with temporally coherent video data. Models trained solely on static images or with language supervision (e.g., DINOv2, SigLIP) consistently underperform in forecasting, highlighting the necessity of temporal supervision.

Model Comparisons

Masked Modeling vs. Synthesis: Among masked modeling approaches, 4DS-e achieves the best overall forecasting performance (lowest FD) across most tasks, except for point tracks where VideoMAE is superior.
Video Synthesis Models: W.A.L.T., a video synthesis model, outperforms masked models on low-level tasks (pixel, depth forecasting) but is less competitive on mid- and high-level tasks (point tracks, bounding boxes) when compared to similarly sized masked models. This is attributed to W.A.L.T.'s training regime, which is dominated by text-to-video generation rather than direct forecasting.
Variance Gap: All models, including W.A.L.T., fail to match the ground-truth variance in their predictions, especially for depth forecasting, indicating a limitation in modeling the full diversity of plausible futures.

Deterministic Regression vs. Diffusion

Ablation studies show that deterministic regression models can optimize mean-based metrics but fail to capture the variance and stochasticity of the future. Diffusion models, in contrast, achieve better best-of-N and FD scores, demonstrating superior modeling of multimodal futures.

Implementation Considerations

Computational Requirements: Training the diffusion forecasting module and readout heads is computationally intensive, requiring extensive TPU resources (144 days aggregated across experiments).
Layer Normalization: Applying layer normalization to frozen latents is critical for stable and performant diffusion forecasting.
Context and Horizon: All experiments use 4 context frames and forecast 12 future frames, with the effective time horizon varying by dataset (up to ~3 seconds).
Frozen Backbones: No fine-tuning of the base video models is performed; all adaptation is via the readout and diffusion modules.

Limitations

Short Horizons: The paper is limited to short-horizon forecasting (≤3 seconds). The observed correlation between perception and forecasting may not hold for longer horizons.
Dataset Complexity: The datasets used lack highly complex or ambiguous motion, potentially limiting the generality of the findings.
Computational Cost: Diffusion-based forecasting is resource-intensive and sensitive to the number of samples drawn at evaluation.
Representation Mismatch: Forecasting in frozen representation space may be suboptimal for tasks requiring fine-grained temporal dynamics.

Implications and Future Directions

This work establishes that strong perceptual representations, especially those learned with temporal supervision, are a prerequisite for effective short-term forecasting. The proposed framework provides a unified testbed for evaluating both perception- and synthesis-based models on temporally grounded tasks, and the distributional metrics introduced are well-suited for assessing stochastic generative models.

Future research directions include:

Extending the framework to longer-horizon forecasting and more complex, ambiguous scenarios.
Investigating joint training of representation, readout, and forecasting modules to mitigate representational mismatches.
Exploring more efficient or scalable generative forecasting architectures.
Developing improved metrics for evaluating the diversity and realism of predicted futures, especially for high-level semantic tasks.

Summary Table: Model Performance (Fréchet Distance, Lower is Better)

Model	Pixels	Depth	Points	Boxes
DINOv2	82.3	626.8	1.90	3.08
SigLIP	182.7	872.1	3.00	3.22
VideoPrism	83.4	948.1	0.80	2.72
VJEPA	61.5	710.5	0.63	2.85
VideoMAE	33.8	643.8	0.55	2.62
VideoMAEv2	45.6	652.5	0.74	2.92
4DS-h	32.5	616.2	7.88	3.24
4DS-e	32.3	599.4	0.68	1.87
W.A.L.T.	8.60	222.9	1.34	2.47
N-WALT	10.9	264.0	1.30	2.31

Note: 4DS-e and W.A.L.T. achieve the best FD on most tasks, with W.A.L.T. excelling in pixel and depth forecasting.

Conclusion

The paper provides a rigorous, extensible framework for generalist video forecasting using frozen video models and latent diffusion. The empirical results demonstrate that perceptual ability, especially when learned with temporal supervision, is a strong predictor of forecasting performance in the short term. The work highlights the need for temporally grounded representation learning and sets a new standard for evaluating stochastic video forecasting across multiple abstraction levels.

PDF Markdown

Follow-up Questions

Related Papers

Authors (9)

Tweets

https://twitter.com/kwangmoo_yi/status/1947366626292039682

alphaXiv

Generalist Forecasting with Frozen Video Models via Latent Diffusion (17 likes, 0 questions)