Papers
Topics
Authors
Recent
Search
2000 character limit reached

Depth-Aware Trajectory-Conditioned Video Generation

Updated 16 March 2026
  • The paper’s main contribution is integrating explicit depth cues with trajectory conditioning to preserve consistent 3D structure and reduce artifacts.
  • It employs dual-stream conditioning and diffusion-based architectures with LoRA adapters, ensuring precise control over camera motion and geometric fidelity.
  • Empirical results on datasets like MultiCam-WarpData demonstrate improved pose accuracy, background consistency, and reduced reprojection errors.

Depth-aware trajectory-conditioned video generation comprises a family of methodologies for synthesizing temporally coherent videos whose content evolves in strict accordance with explicitly defined spatial trajectories, leveraging either explicit or implicit scene depth. These methods constitute the state of the art for precise geometric control in video synthesis tasks, particularly under novel camera viewpoints or object motion directives. The core challenge they address is the faithful preservation of content and 3D structure during controlled viewpoint changes, while avoiding artifacts endemic to shallow 2D techniques and traditional inpainting-based conditioning.

1. Foundations and Core Problem

Depth-aware trajectory-conditioned video generation aims to produce videos in which the camera or scene follows a user-specified trajectory, with the generative model leveraging depth cues to maintain geometric consistency and content identity. The ill-posedness of direct pixel-space warping and the limitations of 2D-only control (which cannot account for parallax, occlusion, or correct perspective) motivate the integration of accurate depth signals—either as explicit rendered maps, geometric proxies, or high-dimensional learned latents derived from reconstructed 3D/4D models (Chen et al., 15 Jan 2026, Bai et al., 16 Dec 2025, Li et al., 3 Dec 2025, Xie et al., 21 Jan 2026, Zhang et al., 8 Sep 2025).

Key challenges include:

  • Subject and background consistency under viewpoint changes,
  • Robust control over camera or object motion,
  • Avoidance of the “Inpainting Trap,” where models fix artifacts via content hallucination, leading to drifting identities or spatial distortions,
  • Scalability to dynamic scenes and multi-view settings.

2. Explicit Depth Supervision and Dual-Stream Conditioning

Explicit depth supervision directly leverages rendered depth maps and associated occlusion masks as conditioning input for generative models. In DepthDirector (Chen et al., 15 Jan 2026), this is operationalized via a dual-stream conditioning architecture:

  • The View branch processes warped depth and occlusion mask sequences rendered from the reconstructed 3D mesh under the new camera trajectory. Depth maps are normalized, transformed into the RGB domain (for VAE compatibility), and VAE-encoded to produce “view tokens.”
  • The Content branch provides encoded latent representations of the original source video, capturing appearance and motion priors.

These two branches are fused: view tokens are linearly projected and added to the noise latents at each denoising step, while the patchified content branch is concatenated along the frame dimension, exposing the denoising transformer (e.g., Wan 2.2) to both appearance and geometry. This design enables the model to resolve ambiguities associated with occlusion, preserve dynamic scene content, and align rendered views to the precise target trajectory.

Ablation studies confirm that this dual-stream view-content fusion prevents degradation observed when relying on warped RGB alone, which propagates reprojection artifacts, or using only content without explicit geometric cues, which induces misaligned or distorted output (Chen et al., 15 Jan 2026).

3. Implicit Depth Conditioning and 4D Scene Latents

Some frameworks avoid explicit depth prediction and mesh rendering, instead leveraging high-dimensional scene representations that subsume geometry, appearance, and dynamics in learned latent spaces. LaVR (Xie et al., 21 Jan 2026) integrates the per-frame latent tokens of a large pretrained 4D reconstruction model (CUT3R). These latents are adapter-compressed and fed, along with pose-encoded trajectory vectors, into a denoising diffusion transformer.

The implicit depth and parallax priors within CUT3R tokens enable the model to generate spatially consistent new trajectories, while allowing the pretrained diffusion backbone to regularize and correct inconsistencies. Unlike explicit conditioning, errors in geometric estimation do not produce catastrophic pixel-level artifacts—discrepancies are softened due to the continuous, distributed nature of the latent space. Training uses loss functions combining denoising, perceptual, cycle-consistency, and pose-alignment terms to balance fidelity and geometric alignment.

Empirical results demonstrate superior cycle-consistency (PSNR/LPIPS), pose estimation accuracy, and avoidance of “holey” or warped regions compared to both explicit 3D and unconditioned models (Xie et al., 21 Jan 2026).

4. Flow-Matching, Diffusion Architectures, and Adapter Efficiency

Most contemporary methods employ diffusion-based generative backbones, with flow-matching objectives replacing direct pixel-level regression. The approach interpolates between target latents and Gaussian noise during training; the network predicts a velocity field in latent space, trained to match the true difference vector between source and target. This setup is central in DepthDirector (Chen et al., 15 Jan 2026), ReCamDriving (Li et al., 3 Dec 2025), and related systems.

To adapt large pretrained diffusion transformers with geometric conditioning, low-rank adaptation (LoRA) modules are introduced—either as frozen or test-time-injected adapters—significantly reducing trainable parameter count and preserving learned generative priors. Only the adapter modules are updated during finetuning (Chen et al., 15 Jan 2026), minimizing data requirements and accelerating convergence. This approach is also utilized in zero-shot, test-time training regimes such as Zo3T (Zhang et al., 8 Sep 2025), where ephemeral LoRA adapters are co-optimized alongside the latent at inference for rapid local adaptation.

5. Variants: RGB+Depth Co-Generation, Robotic Policy Integration, and Multi-Modal Control

Expansions of the basic depth-aware trajectory-conditioned paradigm integrate multi-modal targets, such as co-generated RGB and depth sequences for downstream policy learning. DRAW2ACT (Bai et al., 16 Dec 2025) exemplifies this approach by producing both RGB and aligned depth videos from depth-encoded trajectory conditions, semantic object features, and coordinate-augmented textual prompts. These outputs are then consumed by a multimodal spatial-temporal transformer, fusing RGB and depth representations for closed-loop robotic manipulation policy training.

Joint supervision via video and depth LPIPS/SSIM/PSNR objectives, as well as trajectory and task-level success metrics (e.g., robot joint regression accuracy), demonstrate significant performance gains from co-generated depth and 3D-aware trajectory representations relative to 2D-only or unimodal approaches (Bai et al., 16 Dec 2025).

6. Datasets, Evaluation Protocols, and Empirical Outcomes

Representative datasets include:

  • MultiCam-WarpData (Chen et al., 15 Jan 2026): Synthetic 8K-video corpus with synchronized multi-trajectory dynamic scenes (Unreal Engine), designed for re-rendering and camera control supervision.
  • ParaDrive (Li et al., 3 Dec 2025): 110 K driving video pairs constructed via 3DGS-based rendering and cross-trajectory pairing, circumventing depth sparsity and enabling large-scale, camera-controllable training.
  • MultiCamVideo (Xie et al., 21 Jan 2026), BridgeData V2 (Bai et al., 16 Dec 2025), and others focused on scene re-rendering and embodied robotic action.

Benchmarking employs both geometric accuracy (rotation/translational error via MegaSaM, trajectory L₁), latent and perceptual metrics (PSNR, SSIM, LPIPS, FVD), and domain-specific indices such as VBench subject/background consistency and robotic success rates.

Across domains, explicit and implicit depth-aware methods outperform 2D/warp-only or inpainting-centric baselines in trajectory accuracy, visual quality, and multi-view consistency. For instance, DepthDirector achieves RotErr = 2.54°, RS = 0.689, and background consistency of 94.66% on MultiCam-WarpData, surpassing alternatives in both fidelity and geometric alignment (Chen et al., 15 Jan 2026). DRAW2ACT yields median trajectory error of 19.88 px and task success of 65.2% versus next-best Tora’s 36.8% (Bai et al., 16 Dec 2025). LaVR exhibits the lowest pose errors (Abs t = 14.39 mm, Rel t = 7.8%, Rel R = 0.411°) and multi-view VBench scores (Xie et al., 21 Jan 2026).

Method Rot/Trans Err ↓ Background Cons. ↑ Matched Pixels ↑ Task Success ↑
DepthDirector 2.54°, 94.66% 94.66% 989
DRAW2ACT 0.9473 65.2% (robotics)
ReCamDriving 1.32°, 2.37 m 97.96 (CLIP-V)
LaVR 0.411°, 14.39mm

These outcomes confirm that integrating depth-aware representations—explicit or implicit—enables precise, physically plausible video generation under user-controlled spatial trajectories, across both synthetic and real-world scenarios.

7. Zero-Shot and Test-Time Adaptation

Recent developments expand depth-aware trajectory control to the zero-shot setting, eliminating extensive offline training. Zo3T (Zhang et al., 8 Sep 2025) infers dense depth maps from a single image, computes calibrated affine transformations for user-directed trajectories, and injects LoRA-modulated adaptations at test time. This pipeline includes interactive latent/adapter optimization via regional feature consistency and rectified diffusion guidance fields, achieving 3D-realistic, perspective-correct motion and overcoming the limitations of precomputed supervised paradigms. This suggests a frontier where trajectory compliance and geometric realism can be realized in arbitrarily defined scenes from minimal input.

References

  • "Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation" (Chen et al., 15 Jan 2026)
  • "DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos" (Bai et al., 16 Dec 2025)
  • "ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation" (Li et al., 3 Dec 2025)
  • "LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models" (Xie et al., 21 Jan 2026)
  • "Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training" (Zhang et al., 8 Sep 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth-aware Trajectory-conditioned Video Generation.