Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 49 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 433 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

DINO-world: Latent Space Video Prediction

Updated 30 July 2025

DINO-world is a generalist video world model that utilizes a pre-trained DINOv2 encoder to extract latent features and predict future frames with temporal consistency.
It achieves state-of-the-art performance on segmentation, depth forecasting, and intuitive physics benchmarks using an autoregressive transformer architecture.
The design supports direct adaptation to action-conditioned planning, providing a scalable foundation for reinforcement learning and robotics.

DINO-world is a generalist video world model that utilizes the latent space of a pre-trained DINOv2 image encoder to predict future frames in videos. Designed as a foundation for video predictive modeling, DINO-world addresses temporal dynamics across diverse real-world and simulated scenes and establishes robust performance on multiple dense forecasting benchmarks, including segmentation, depth forecasting, and intuitive physics tasks. The model architecture allows direct adaptation to action-conditioned world modeling, facilitating planning and control by simulating candidate future trajectories entirely in latent space (Baldassarre et al., 25 Jul 2025).

1. Model Architecture and Latent Space Formulation

DINO-world processes each video frame through a frozen DINOv2 image encoder to produce a grid of patch embeddings, each with spatial coordinates (i, j) and embedding dimension D. This grid forms the latent “state” of the world model at each timestep. The predictor is an autoregressive transformer composed of N residual pre-norm cross-attention blocks, responsible for forecasting future patch embeddings given a specified temporal offset.

For each prediction, a learned query is constructed with the target time $\tau_{t+1}$ and location $(i', j')$ . This query interacts with all prior patch tokens (for all $(i,j)$ and times $\leq t$ ), using cross-attention. Temporal positions are encoded with three-axial Rotary Position Encoding (RoPE): absolute timestamps for the temporal axis (with periods spanning $[10^{-2}, 10^{+2}]$ ), and $(i, j)$ are linearly mapped to $[-1, +1]$ for the spatial axes. This positional encoding gives the predictor invariance to variable frame rates and spatial resolutions.

Training employs teacher forcing: Given a sequence $(X_{1:t}, T_{1:t})$ of DINOv2 patch tokens and timestamps, the model predicts the token for a queried coordinate $(\tau_{t+1}, i', j')$ :

$(X_{1:t}, T_{1:t}, (\tau_{t+1}, i', j')) \longrightarrow \hat{x}_{t+1, (i', j')}$

A block-triangular attention mask enforces causality, ensuring each prediction utilizes only past states.

The loss function is a smooth L1 loss averaged over all predicted patch tokens:

$\min_\theta \sum_{t,i',j'} \text{SmoothL1}(x_{t+1, (i',j')},~\text{Predictor}_\theta(X_{1:t}, T_{1:t}, (\tau_{t+1}, i', j')))$

During training, future prediction targets $\Delta \tau$ are sampled uniformly from a range $[\Delta \tau_\text{min}, \Delta \tau_\text{max}]$ to encourage the model to reason about both short- and long-term dynamics.

2. Training Methodology and Dataset Scope

DINO-world is trained on a large-scale, uncurated video collection comprising approximately 60–66 million web videos, encompassing driving, indoor, outdoor, and simulated environments over a wide variety of frame rates and durations. The DINOv2 encoder is frozen throughout; only the transformer-based predictor model is trained. Training is conducted in a fully unsupervised, next-frame prediction paradigm, leveraging the vast diversity of dynamics in the web-scale data sources to build a generalist predictive world model.

3. Performance Across Benchmarks and Tasks

DINO-world demonstrates state-of-the-art performance across several dense forecasting benchmarks:

Semantic Segmentation Forecasting: On VSPW mid-term prediction ( $\sim$ 0.5s ahead), it outperforms the next-best model by 6.3 percentage points in mIoU.
Depth Forecasting: Benchmarks on KITTI and Cityscapes demonstrate higher quality latent predictions for depth.
Comparative Models: Evaluations include DINO-Foresight (domain-specific latent predictor), V-JEPA (joint encoder–predictor training), and pixel-space generative models like COSMOS, with DINO-world exhibiting superior temporal modeling quality.

Additionally, DINO-world is assessed on intuitive physics benchmarks (IntPhys, GRASP, InfLevel), where it achieves lower “surprise” scores on physically plausible video sequences, reflecting advanced understanding of object permanence, causality, and other core physical reasoning attributes.

4. Action Conditioning and Decision-Time Planning

DINO-world is explicitly designed to be fine-tuned for control settings. Once pretrained in an unsupervised fashion, the predictor is adapted on observation-action trajectories from specific environments by inserting “action blocks” after each cross-attention block. These blocks take as input the actions $a_t$ (e.g., agent controls), and are initially zero-initialized to preserve the world model’s prior.

The action-conditioned predictor allows rollouts in latent space:

Forward Simulation: The model recursively predicts future patch tokens for a hypothesized action sequence.
Planning: A search method (e.g., Cross-Entropy Method [CEM]) is used to optimize sequences of actions that minimize the latent distance to a desired goal embedding $z_g$ :

$\min_{a_{1:T}} \| \hat{z}_T - z_g \|_2^2$

Demonstrations are provided on Push-T, Wall, and PointMaze environments, supporting direct planning in latent space for navigation or manipulation tasks.

5. Architectural Details and Technical Specifics

The DINO-world predictor follows a residual block structure, with each block:

Pre-norm cross-attention between a learned query for the prediction coordinate and all previous patch tokens, with additive temporal and spatial RoPE encodings.
An MLP applied after each cross-attention.
In the action-conditioned version: after every block, concatenation of action vectors to the query and an MLP update.

For segmentation or depth forecasting from predicted latents, a linear “present time” head is trained to map DINOv2 tokens to the relevant output (e.g. class logits or depth values) for performance evaluation, as the actual next frame in pixel-space is not decoded.

6. Implications and Generalization Capacity

DINO-world demonstrates that world model learning can shift from pixel-level generative forecasting to a structured, pre-trained latent-space predictive paradigm. This foundation allows strong semantic structure, sample efficiency, and domain coverage, leveraging the invariances and generality of DINOv2 features. The model’s high performance in intuitive physics benchmarks suggests it has robust internalized scene dynamics.

A plausible implication is that frozen, large-scale visual encoders combined with scalable latent dynamics predictors can enable generalist world models that serve as the basis for perception, forecasting, and control in multi-domain settings, avoiding the computational burdens of pixel reconstruction and simplifying the adaptation path to action-conditioned reasoning.

7. Prospects and Research Directions

DINO-world exemplifies a shift toward using pre-trained representation spaces for general world modeling, prediction, and planning. Its modularity enables rapid adaptation to downstream reinforcement learning, robotics, and control applications by fine-tuning a small subset of parameters. Future research may explore hierarchical policy learning over latent states, unsupervised “action discovery” using naturalistic video, or applications in domains where dense reward signals are unavailable but visual dynamics are central to agent performance (Baldassarre et al., 25 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Back to the Features: DINO as a Foundation for Video World Models (2025)

Follow Topic

Get notified by email when new papers are published related to DINO-world.

DINO-world: Latent Space Video Prediction

1. Model Architecture and Latent Space Formulation

2. Training Methodology and Dataset Scope

3. Performance Across Benchmarks and Tasks

4. Action Conditioning and Decision-Time Planning

5. Architectural Details and Technical Specifics

6. Implications and Generalization Capacity

7. Prospects and Research Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DINO-world: Latent Space Video Prediction

1. Model Architecture and Latent Space Formulation

2. Training Methodology and Dataset Scope

3. Performance Across Benchmarks and Tasks

4. Action Conditioning and Decision-Time Planning

5. Architectural Details and Technical Specifics

6. Implications and Generalization Capacity

7. Prospects and Research Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research