Depth-Aware Trajectory-Conditioned Video Generation

Updated 23 December 2025

The paper demonstrates that integrating depth cues into diffusion models significantly improves geometric consistency and motion control in video generation.
It employs advanced encoding strategies—including monocular depth estimation, 3D trajectory parameterization, and explicit camera pose encoding—to align video frames with precise spatial paths.
The framework achieves state-of-the-art visual fidelity and spatial coherence, supporting applications in robotics, 3D reconstruction, and creative media synthesis.

Depth-aware trajectory-conditioned video generation refers to a class of generative models—predominantly built upon diffusion frameworks—that synthesize video sequences aligned to a user-specified spatial or camera trajectory, using depth or geometric information as an explicit conditioning signal. These systems are designed to overcome the deficiencies of purely 2D or trajectory-latent-based controls by introducing 3D scene understanding and precise geometric supervision, enabling the creation of video outputs with high visual fidelity, spatial coherence, and controllable motion or viewpoint.

1. Foundational Principles and Motivations

Traditional trajectory-conditioned video generation architectures exhibit significant limitations when using only 2D spatial cues or latent manipulations, as these frequently yield videos with unrealistic motions, geometric inconsistencies, or broken subject-background alignments. The integration of depth information—either derived from monocular estimation, explicit depth maps, or 3D scene representations—serves to constrain motion and synthesis to respect scene geometry, perspective scaling, and camera or object trajectories originating in $\mathrm{SE}(3)$ .

Motivating use cases include controllable motion transfer in creative media, robotic demonstration synthesis, simulation for embodied AI, and viewpoint-driven scene generation for 3D reconstruction or autonomous driving. Depth-aware conditioning enables models to maintain on-manifold image realism while precisely following user-specified spatial paths, addressing issues of drift, perspective distortion, and implausible object dynamics seen in prior efforts (Zhang et al., 8 Sep 2025, Bai et al., 16 Dec 2025, Liu et al., 6 Aug 2025, Li et al., 3 Dec 2025).

2. Trajectory and Depth Encoding Mechanisms

Depth-aware trajectory conditioning operates via multiple encoding strategies tailored to the intended form of motion control, object-centric manipulation, or camera movement:

Monocular Depth Estimation and Kinematic Projection: Models such as Zo3T extract scene depth from initial frames using pre-trained monocular estimators, mapping 2D trajectory annotations into the 3D scene via known camera intrinsics. Affine or projective transformations are then applied to define per-frame spatial correspondences, enabling perspective-correct motion, scaling, and mask-based region priors for subsequent diffusion steps (Zhang et al., 8 Sep 2025).
3D trajectory Parameterization: Systems in robotics settings (e.g., DRAW2ACT) represent user-specified motion as a 3D curve $q = \{(x_i, y_i, d_i)\}$ , where $d_i$ encodes relative depth; trajectories are color-coded and embedded in reference images, and semantic/appearance features are extracted and propagated along the path (Bai et al., 16 Dec 2025).
Explicit Camera Pose and Plücker-ray Encoding: Scene-level video generation frameworks (IDC-Net, ReCamDriving) condition on a sequence of extrinsic camera poses $[R_i | t_i] \in SE(3)$ , representing camera translation and rotation. Dense ray representations or derived tokens (Plücker coordinates) are injected into diffusion transformers via dedicated attention interfaces to ensure fine-grained geometric and temporal alignment (Liu et al., 6 Aug 2025, Li et al., 3 Dec 2025).
3DGS and Geometry-Driven Rendering: Dense geometric supervision is accomplished by constructing 3D Gaussian Splatting (3DGS) models from monocular sequences; these enable the rendering of explicit depth-aware RGB frames for arbitrary novel viewpoints, which, when encoded and injected into the generative pipeline, provide strong priors for structure and continuity (Li et al., 3 Dec 2025).

3. Diffusion Model Adaptations and Architecture

Depth-aware, trajectory-conditioned video generation predominantly employs diffusion-based architectures, with task-specific modifications to incorporate geometric conditioning and inter-modal alignment:

Latent Diffusion with Joint RGB-Depth VAE Embeddings: Input RGB and depth maps are encoded together into a shared latent, with coordinated noise injection and denoising across modalities, enabling the model to synthesize consistent paired video and depth sequences. The denoising backbone is frequently a Transformer-in-Transformer (DiT) or U-Net variant, with auxiliary cross-attention heads for trajectory and geometry tokens (Bai et al., 16 Dec 2025, Liu et al., 6 Aug 2025).
Geometry-Aware Attention Blocks: Cross-attention layers integrate camera or trajectory tokens into each denoising step, aligning the generated latent with prescribed path geometry (IDC-Net, ReCamDriving). In some approaches, the diffusion transformer includes dedicated “rendering attention” for feature fusion from rendered depth or 3DGS priors, with core denoising weights partially frozen to prevent overfitting (Liu et al., 6 Aug 2025, Li et al., 3 Dec 2025).
Test-Time Adaptation and Guidance Rectification: Zo3T utilizes test-time LoRA injection—ephemeral low-rank adapters optimized per instance—in conjunction with “regional feature consistency” losses and Fourier domain fusion to maintain both high-frequency detail and on-manifold sampling during motion dragging. Guidance field rectification leverages a single-step lookahead gradient and latent blending to correct classifier-free guidance directionality for precise local control (Zhang et al., 8 Sep 2025).

4. Training Objectives and Loss Functions

Training of depth-aware, trajectory-conditioned video diffusion models employs a combination of generative losses and geometric or semantic consistency constraints:

Diffusion/Score-Matching Objectives: Standard denoising score-matching is adopted, minimized jointly over RGB and depth latents, with random noise levels and time steps (Liu et al., 6 Aug 2025, Bai et al., 16 Dec 2025).
Depth and Geometric Consistency Losses:
- Pixelwise depth reconstruction: $L_{\mathrm{depth}} = \mathbb{E}_{I,D}\|\hat{\mathcal{D}}(I,\mathcal{T}) - D_{\mathrm{gt}}\|_{1}$
- Metric consistency: Multi-view losses enforce that reprojected depth maps match rendered views from alternate camera poses, enhancing geometric alignment across frames (Liu et al., 6 Aug 2025).
- Regional feature or cross-modality consistency: Feature distances between target-region features (e.g., via DINOv2 or U-Net activations) maintain object identity and trajectory adherence through temporal evolution (Bai et al., 16 Dec 2025, Zhang et al., 8 Sep 2025).
Policy Learning: For robotic demonstration settings, generated RGB+depth sequences condition a downstream policy network to predict kinematic trajectories or joint angles, trained via regression and classification losses over real or simulated action data (Bai et al., 16 Dec 2025).
Regularization for Stability: Stage-wise model freezing and secondary adaptation are used to separate global motion alignment from fine-scale structure (ReCamDriving), mitigating overfitting and ensuring model capacity is directed toward geometry rather than appearance matching (Li et al., 3 Dec 2025).

5. Datasets, Evaluation Protocols, and Benchmarks

Proper evaluation of these systems relies on datasets providing temporally aligned RGB, depth, and camera trajectory supervision:

Curated Scene Datasets:
- RealEstate10K and DL3DV-10K: Provide indoor and real-world scenes with metric-aligned RGB, depth maps, and camera poses (Liu et al., 6 Aug 2025).
- ParaDrive: Comprises over 110K parallel-trajectory video pairs constructed via monocular 3DGS reconstruction and lateral offset rendering from autonomous driving datasets (Waymo Open Dataset, NuScenes), addressing the cross-trajectory generalization gap (Li et al., 3 Dec 2025).
- Robotic datasets: BridgeDataV2, Berkeley Autolab, and simulation environments supply paired RGB, depth, and action trajectories for manipulation tasks (Bai et al., 16 Dec 2025).
Metrics:
- Visual Quality: FID, FVD, PSNR, SSIM, LPIPS.
- Geometric/Camera Metrics: Rotation error ( $R_\mathrm{err}$ ), translation error ( $T_\mathrm{err}$ ), view consistency via re-estimated poses.
- Task/Trajectory Accuracy: Object motion conformity (ObjMC), trajectory error (mean $L_1$ ), and downstream manipulation or navigation success rates.
- Temporal coherence: CLIP-based frame and view similarity, background/subject consistency, temporal flicker stability.

Results indicate that depth-aware conditioning yields consistent improvements across visual fidelity, motion adherence, and geometric stability benchmarks, often matching or outperforming supervised and zero-shot baselines (Zhang et al., 8 Sep 2025, Bai et al., 16 Dec 2025, Liu et al., 6 Aug 2025, Li et al., 3 Dec 2025).

6. Applications and Practical Utility

Depth-aware trajectory-conditioned video generation directly enables:

Controllable object manipulation videos (DRAW2ACT): Synthesizing robotic demonstrations from single frames and 3D depth trajectories, supporting multimodal policy learning for robotic control and imitation (Bai et al., 16 Dec 2025).
Synthesizing new viewpoints for 3D reconstruction (IDC-Net): Jointly generating metric-aligned RGB and depth video under explicit camera paths, allowing immediate downstream 3D point cloud or surface reconstruction (Liu et al., 6 Aug 2025).
Novel trajectory video synthesis in driving scenarios: Precise camera-controlled scene generation with continuity across widely separated vehicle paths for simulation and planning (Li et al., 3 Dec 2025).
Trajectory-guided creative media generation: Zero-shot dragging or animation of specified entities within video frames, maintaining visual fidelity and physical plausibility (Zhang et al., 8 Sep 2025).

7. Limitations and Future Directions

Common limitations include:

Depth Estimation Sensitivity: Errors in monocular or cross-modal depth can cause mis-scaled objects, particularly on reflective or transparent surfaces (Zhang et al., 8 Sep 2025, Bai et al., 16 Dec 2025).
Occlusion and Non-Rigid Effects: Severe occlusions, complex deformations, or non-rigid objects (cloth, hair) may undermine region consistency or drift under trajectory propagation.
Scalability and Efficiency: Test-time optimization, especially with adaptation layers or fine-grained geometric control, increases computational requirements and may limit sequence horizon (Zhang et al., 8 Sep 2025, Bai et al., 16 Dec 2025).
Single-object Constraints: Many current systems control only one trajectory or region per sequence; multi-object and long-horizon interoperability remain open challenges (Bai et al., 16 Dec 2025).
Dataset Dependence: Accurate camera, depth, and semantic alignment supervision is critical; synthetic or poorly curated data induce geometry hallucinations or performance degradation (Liu et al., 6 Aug 2025, Li et al., 3 Dec 2025).

Future work targets enhanced multi-object and long-range planning, efficient test-time scheduling, joint audio-visual conditioning, and tighter integration with real-time physics and multi-view supervision (Zhang et al., 8 Sep 2025, Bai et al., 16 Dec 2025, Liu et al., 6 Aug 2025).

Key References:

"Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training" (Zhang et al., 8 Sep 2025)
"DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos" (Bai et al., 16 Dec 2025)
"IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control" (Liu et al., 6 Aug 2025)
"ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation" (Li et al., 3 Dec 2025)