Grounding Video Plans with World Models

Updated 8 February 2026

The paper introduces a novel two-stage optimization that grounds zero-shot video plans into physically consistent latent trajectories for control policies.
It uses latent encoding and collocation via an augmented Lagrangian method to align video trajectories with learned dynamics.
Empirical results in simulated domains demonstrate improved success rates by rectifying spatial and temporal inconsistencies in video plans.

Grounding Video Plans with World Models (GVP-WM) is a planning paradigm at the intersection of large-scale video generative models and learned action-conditioned world models, designed to address the gap between pixel-space video plans and dynamically feasible action trajectories in robotic and embodied agent tasks. By formulating the grounding process as a constrained optimization in the latent space of a world model, GVP-WM enables the transformation of zero-shot or synthesized video plans, which frequently violate physical and temporal constraints, into actionable and physically consistent plans suitable for real-world or simulated execution (Ziakas et al., 2 Feb 2026).

1. Motivation and Problem Statement

Recent advancements in video generative models have demonstrated powerful zero-shot visual planning capabilities. Models such as diffusion-based video generators can synthesize video plans between an initial scene and a desired goal state. However, when such plans are used directly to derive control policies, they often yield infeasible, temporally inconsistent, or physically implausible action sequences due to their lack of grounding in real system dynamics. This misalignment limits their utility in executable planning for robotics and other embodied settings.

GVP-WM addresses this challenge by introducing a systematic procedure to “ground” these video-generated plans—those produced in pixel space—onto the manifold of feasible latent-space trajectories dictated by a learned world model. The aim is to retain the semantic intent of the video plan while enforcing alignment with learned dynamics to produce actionable control policies.

2. Core Methodology

GVP-WM operates as a two-stage, test-time process:

Video Plan Generation: Given an initial observation $o_0$ and goal image $o_g$ , a pre-trained diffusion video model $\mathcal{G}$ generates a video plan $\hat{\tau}_{\rm vid} = \{ o_0, o_1, \ldots, o_{T-1}, o_g \}$ , potentially conditioned on context $c$ (e.g., text prompts).
Latent Encoding and Collocation: Each frame in the generated video is mapped through a frozen encoder $E_p$ from a pre-trained, action-conditioned world model into a latent state sequence $\{\hat z_t\}$ . The system then treats both the trajectory $\{z_t\}$ and the actions $\{a_t\}$ as primal variables and formulates a trajectory optimization problem:

$\text{Find } \{z_t, a_t\}_{t=0}^{T-1} \text{ such that } \begin{cases} z_{t+1} = f_u(z_{t-H:t}, a_{t-H:t}),\ \forall t \ z_t \approx \hat z_t,\ \forall t \ z_T \approx z_g = E_p(o_g) \end{cases}$

The optimization aims to minimize a cost function combining video-plan alignment, action regularization, and goal achievement:

$\mathcal{C}(Z, A) = \sum_{t=0}^{T-1} \left[ \lambda_{\rm vid} L_{\rm vid}(z_t, \hat{z}_t) + \frac{\lambda_a}{2} \|a_t\|^2 \right] + \lambda_{g} L_{\rm goal}(z_T, z_g)$

with scale-invariant cosine loss $L_{\rm vid}$ and quadratic goal loss $L_{\rm goal}$ . The dynamics constraints are enforced using an Augmented Lagrangian Method (ALM):

$\mathcal{L}_p(Z, A, \Lambda) = \mathcal{C}(Z, A) + \sum_{t=0}^{T-1} \langle \Lambda_t, z_{t+1} - f_u(z_{t-H:t}, a_{t-H:t}) \rangle + \frac{p}{2} \sum_{t=0}^{T-1} \| z_{t+1} - f_u(z_{t-H:t}, a_{t-H:t}) \|^2$

Alternating updates are performed on trajectory, actions, and Lagrange multipliers, increasing $p$ geometrically to enforce constraint satisfaction.

Receding-Horizon Execution: The first $K$ steps of the optimized action sequence are executed using MPC; then the process is repeated with the updated state.

3. World Model Architecture and Training

GVP-WM utilizes a DINO-WM architecture:

Encoder $E_p$ : Frozen DINOv2 ViT-S/14 visual transformer (patch size 14, hidden dim 384), mapping observations to latent space $\mathbb{R}^{384}$ .
Latent Dynamics $f_u$ : Six-layer Transformer (16 heads, MLP hidden size 2048), modeling evolution of latent states conditioned on histories $(z_{t-H:t}, a_{t-H:t})$ .
(Optional) Decoder: VQ-VAE for visualization.
Training: Conducted offline on expert or randomly sampled trajectories (e.g., 18,500 Push-T settings, 1,920 Wall settings) using Adam optimizer, with learning rates tailored for encoder and dynamics.

This modular structure allows the optimization to be performed in the latent space, which is more compact and semantically meaningful than raw pixel space.

4. Video-Guided Latent Collocation Algorithm

The core grounding loop is encapsulated in an alternating (primal-dual) optimization algorithm:

Encode the video plan into latents $\{\hat z_t\}$ .
Initialize trajectory and action variables from the video plan.
For each outer iteration, increase the penalty parameter $p$ (up to $p_{\max}$ ). For each inner iteration, perform gradient steps on $(Z, A)$ to minimize the augmented Lagrangian, then update Lagrange multipliers.
On convergence, extract the optimized action sequence and execute a receding-horizon segment, repeating the process for long-horizon plans.

Key hyperparameters: $\lambda_{\rm vid}=1.0$ , $\lambda_g=10.0$ , $\lambda_a=0.05...0.1$ , $I=25$ , $O=25$ , $p_0=1.0$ , $y=1.9$ , $p_{\max}=1e3$ .

An optional sample-based refinement uses Gaussian disturbances and open-loop rollouts to improve action robustness.

5. Empirical Evaluation and Results

Experimental Domains

Push-T: 2D contact-rich block manipulation.
Wall: 2D visual navigation around barriers.

Baselines

MPC-CEM: Cross-Entropy Method in latent space.
MPC-GD: Gradient-based shooting.
UniPi: Direct inverse-dynamics from video frames.

Video Plan Sources

WAN-0S: Zero-shot video generation.
WAN-FT: LoRA fine-tuned video generation.
ORACLE: Expert video sequences.

Metrics

Success Rate: Fraction of rollouts meeting final-state thresholds.

Baseline/Method	Push-T Success Rate ( $T=25$ )	Wall Success Rate
GVP-WM (WAN-FT)	up to 6–10 pts > CEM	Higher
GVP-WM (WAN-0S, $T=25$ )	0.56 (CEM: 0.74)	Comparable
GVP-WM (ORACLE, $T=25$ )	98% (CEM: 74%)	-
UniPi (Wall)	Only succeeds in-distribution	-

Extensive ablations demonstrate the necessity of collocation (no collocation: 12% success), video plan initialization, and video loss for competitive performance. Under motion blur, GVP-WM maintains robustness (e.g., 82% at $T=25$ , 46% at $T=50$ ), whereas UniPi collapses to near 0%.

Qualitatively, GVP-WM corrects or ignores inconsistencies such as object teleportation or spatial bilocation in unconstrained video plans. Domain-adapted video guidance results in higher consistency and success rates.

6. Relation to Broader Approaches and Extensions

GVP-WM belongs to a wider class of approaches that bridge pixel-space video generation (from models trained on expansive internet data) and embodied world models for planning. For instance, BridgeV2W enhances grounding by introducing pixel-aligned embodiment masks derived from coordinate-space actions using robot URDF and camera geometries (Chen et al., 3 Feb 2026). Masks are injected into pretrained video diffusion backbones via a ControlNet-style pathway, aligning actions and pixel-space video priors.

These frameworks also introduce specialized loss terms (e.g., flow-based motion loss, dynamics-consistency) and architectural conditioning (e.g., view embeddings) to address challenges such as viewpoint sensitivity and static background overfitting. BridgeV2W demonstrates improved video generation fidelity, Mask-IoU, and policy evaluation correlation, substantiating the effectiveness of grounding mechanisms in diverse settings.

A plausible implication is that as world models and video planners scale, grounding via latent collocation or mask conditioning will be integral to leveraging their semantic competence for real agents.

7. Limitations and Prospects

GVP-WM depends critically on the accuracy of the offline world model. If there is a significant out-of-distribution (OOD) gap, the resulting grounded plans may still be infeasible. The process is computationally more intensive than feed-forward control due to its reliance on test-time trajectory optimization. Tasks with high-dimensional action spaces or requiring precise orientation control may stress current optimization and representation limits.

Future directions include deployment on real robots—where video plans are less OOD relative to simulation—policy distillation to amortize test-time optimization into rapid inference, hierarchical video planning, and integration with value-based or search-based planning for long-horizon, compositional tasks (Ziakas et al., 2 Feb 2026). Architectural innovations such as embodiment-masked conditioning and relaxation of calibration requirements (as explored in BridgeV2W and related work) are further promising directions (Chen et al., 3 Feb 2026).

GVP-WM represents a convergence of generative video modeling and model-based control, setting a foundation for robust, semantically grounded planning in vision-based embodied systems.

Markdown Report Issue Upgrade to Chat

References (2)

Grounding Generated Videos in Feasible Plans via World Models (2026)

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grounding Video Plans with World Models (GVP-WM).