Zero-shot 3D Trajectory Video Generation

Updated 16 March 2026

The paper introduces a novel framework that synthesizes high-fidelity videos from a single image without test-time fine-tuning by leveraging latent diffusion models and explicit 3D representations.
It integrates user-provided 3D trajectory information using methods like sparse mask clustering and affine warping to ensure geometric fidelity and temporal consistency.
Empirical evaluations show robust performance with improved visual coherence, demonstrating flexible control over object and camera motion in a zero-shot regime.

Zero-shot 3D-aware trajectory-guided image-to-video generation encompasses methods that synthesize temporally coherent, photo-realistic videos from a single input image, with object or scene motion constrained to faithfully follow user-specified 3D spatial trajectories or camera paths—while requiring no test-time fine-tuning or access to paired 3D+photograph datasets. Recent approaches leverage a combination of latent diffusion models, explicit 3D representations, and novel guidance and training mechanisms to achieve fine-grained 3D motion control in a zero-shot regime. The principal challenge is to reconcile geometric fidelity and temporal consistency while enabling flexible, intuitive user control over both object and viewpoint trajectories.

1. 3D Trajectory Representation and User Interaction

Methods in this class depend on precise representations of the intended 3D motion, encoded directly from user input or scene geometry.

LeviTor (Wang et al., 2024) abstracts each object's 2D segmentation mask into a sparse set of cluster points using $K$ -means clustering in image space. Each point is augmented with relative depth $d_t^i$ obtained from a monocular depth prediction. At inference, users provide a 2D stroke with per-point depth annotation; these points are lifted via the camera intrinsics $K$ into 3D, translated according to the specified trajectory $T$ , and reprojected into 2D view space via the pinhole projection

$[X'_i, Y'_i, Z'_i]^\top = K^{-1}[x_i, y_i, 1]^\top d_i + T, \qquad [x'_i, y'_i]^\top = \Pi([X'_i, Y'_i, Z'_i]^\top).$

This process generates dense, physically plausible mask warps over time and fully encodes out-of-plane and occlusion effects.

Zo3T (Zhang et al., 8 Sep 2025) proposes 3D-Aware Kinematic Projection, where user-specified object boxes or points are warped across frames using affine transformations parameterized by depth-inferred scales and translations:

$A_k = \begin{pmatrix} \sigma_k & 0 & u_k - \sigma_k u_0 \ 0 & \sigma_k & v_k - \sigma_k v_0 \ 0 & 0 & 1 \end{pmatrix},$

with $\sigma_k = d_0/d_k$ . Masks for guiding motion are rasterized for each frame.

VideoFrom3D (Kim et al., 22 Sep 2025) incorporates full scene-level 3D: users provide a coarse triangle mesh $M$ , a camera trajectory $\tau = \{p_0, ..., p_N\}$ , and a style image $I_0$ . The geometry is preprocessed to extract structural edge maps per view via HED, and dense 3D-2D correspondences for optical flow via mesh back-projection.
Pixel-to-4D (Almeida et al., 2 Jan 2026) lifts the input image into a 3D Gaussian splat field by leveraging monocular depth (Depth-Pro) and DINOv2 features in a U-Net encoder/decoder; sampled object motions are assigned via a conditional VAE per-splat, and future state is evaluated by classical kinematic equations, supporting fully continuous, camera-guided motion trajectories.

2. Integration with Video Synthesis Architectures

Different frameworks for 3D-aware, trajectory-guided image-to-video generation distinguish themselves by the method of integrating trajectory information into video generation backbones:

Latent Video Diffusion with ControlNet:

LeviTor augments the backbone Stable Video Diffusion (latent DDPM) by concatenating a dense control map, which consists of rasterized cluster-point heatmaps (for $(x, y)$ ), depth, and instance labels, to the U-Net at each block via ControlNet. Conditioning is applied as:

$f_l' = f_l + g_l(\mathrm{Downsample}_l(c_\textrm{traj})),$

enabling flexible injection of time-varying, spatially localized 3D trajectory signals at all U-Net scales.

Test-Time Adaptive Guidance:

Zo3T introduces Trajectory-Guided Test-Time LoRA adapters, low-rank decompositions injected into the main U-Net, which are optimized together with the latent variable during early denoising steps. A regional feature-consistency loss,

$\mathcal{J}_{\rm TTT}(z_t, \theta') = \sum_{b=1}^M \sum_{l\in\mathcal{L}} w_l \sum_{k=2}^{N_f} \|G_b \odot (F_{l,k}[M_{b,k}] - F_{l,1}^{(\rm frozen)}[M_{b,1}])\|_F^2,$

aligns the evolving latent with the desired mask regions, while a Guidance Field Rectification module corrects the diffusion vector field via one-step lookahead gradient descent on a kinematic consistency objective.

Hybrid Image/Video Diffusion Pipelines:

VideoFrom3D divides the synthesis process into two: (1) Sparse Anchor-view Generation (SAG), producing high-fidelity anchor frames with image diffusion (FLUX-dev + ControlNet, augmented by prompt-tuned LoRA), and (2) Geometry-guided Generative Inbetweening (GGI), using a video diffusion prior (CogVideoX-5B) conditioned on endpoints and optical flow–encoded guidance derived from the 3D mesh and camera trajectory.

Non-Diffusion, Explicit 3D Field Rendering:

Pixel-to-4D eschews iterative, stochastic refinement. Given the initial 4D (3D+time) Gaussian splat field, future video frames are rendered deterministically for arbitrary camera and object motion via fast Gaussian splatting, with one forward network pass and no diffusion.

3. Training Paradigms and Zero-Shot Generalization

The zero-shot property is operationalized by various combinations of conditional training and inference-phase adaptation:

Conditional Denoising Only:

LeviTor is fully trained on VOS videos using only the conditional denoising loss,

$\mathcal{L}(\theta) = \mathbb{E}_{z^0, t, \epsilon \sim \mathcal{N}(0,I)} \big[ \|\epsilon - \epsilon_\theta(z_t; t, z^0, c_{\text{traj}})\|^2 \big],$

and, after convergence, requires neither further data nor parameter updates per test case.

Test-Time Training:

Zo3T enables rapid, zero-shot, trajectory-constrained adaptation by optimizing LoRA adapter weights and the current latent variable at each guided time step during diffusion, using feature and geometry consistency losses. All adaptation is ephemeral; the underlying model is never retrained between tasks.

Hybrid Diffusion with Minimal Supervision:

VideoFrom3D's image diffusion modules are prompt-fine-tuned via LoRA for style transfer on the input image only, while the main modules for geometry and video are trained entirely on unstructured video data with synthetic structural cues.

End-to-End Scene Representation:

Pixel-to-4D trains its scene-lifting and motion modules on large, generic datasets of video+depth pairs (KITTI, Waymo, RealEstate10K, DL3DV-10K), requiring no paired scene–photo supervision. At test time, any single image and trajectory suffice.

4. Loss Functions and Structural Objectives

Losses and guidance objectives are tailored to enforce both reconstruction fidelity and adherence to 3D-movement constraints.

LeviTor and VideoFrom3D primarily employ standard DDPM denoising (or VAE) loss, optionally augmented by perceptual LPIPS loss and edge map/flow-based synthetic guidance.
Zo3T supplements conditional diffusion losses with a feature-consistency objective over region masks and a guidance-field correction loss to ensure kinematic correctness over latent dynamics.
Pixel-to-4D jointly supervises LPIPS-based RGB consistency, depth reconstruction (mean relative error), L1 frame-difference constraints, and VAE latent regularization:

$\mathcal{L} = \lambda_{rgb}\mathcal{L}_{rgb} + \lambda_{depth}\mathcal{L}_{depth} + \lambda_{rgbDiff}\mathcal{L}_{rgbDiff} + \lambda_{kl}\mathcal{L}_{kl}.$

5. Empirical Results and Ablation Studies

Comparative evaluations highlight key strengths and trade-offs among approaches in the emergent landscape:

Model	FID (↓)	FVD (↓)	ObjMC (↓)	Notes
LeviTor	25.41–28.33	190.44	25.97	DAVIS, multi-point
DragAnything	36.04	324.95	23.12	Less accurate 3D control
Zo3T	74.83	197.63	12.74	VIPSeg, zero-shot
VideoFrom3D	—	—	—	High style/geometry match
Pixel-to-4D	FVD 24–36.4	PSNR 14.9–19.4	—	SOTA on driving datasets

Consistently, including depth and multi-instance information is necessary for robust 3D occlusion and object-specific control (Wang et al., 2024, Zhang et al., 8 Sep 2025).
Removing depth supervision or 3D projection mechanisms degrades both FID/FVD and the visual plausibility of perspective (e.g., LeviTor ablations, Zo3T: FID increases from 74.83 to 76.12, ObjMC from 12.74 to 12.98).
Ablation of test-time LoRA or guidance rectification leads to motion artifacts, off-manifold collapse, and notable reductions in quantitative and qualitative fidelity (Zhang et al., 8 Sep 2025).
Explicit 3D scene models (Pixel-to-4D) yield the highest geometric and temporal coherence without flicker, and benefit from fast inference without diffusion (Almeida et al., 2 Jan 2026).

6. Architectural Variants and Methodological Trends

The field is rapidly exploring novel 3D representations and modular training schemes:

LeviTor advances sparse mask clustering plus per-point depth as a compact, practical form of 3D control signal for video diffusion, showing that mask abstraction into cluster points is effective for both flexibility and model adherence (Wang et al., 2024).
Zo3T demonstrates that local test-time U-Net adaptation (via transient LoRA adapters) unlocks robust, on-manifold generative alignment to user motion trajectories even for frozen video diffusion models (Zhang et al., 8 Sep 2025).
VideoFrom3D's two-stage procedure mirrors traditional animation pipelines (keyframe anchoring and inbetweening) but achieves style and shape consistency in a zero-shot way due to the synergy of image and video diffusion, reinforcing the value of compositional generation (Kim et al., 22 Sep 2025).
Pixel-to-4D establishes that explicit, differentiable 4D Gaussian splat fields allow for direct rendering of plausible, continuous-time camera and object trajectories without stochastic sampling or iterative refinement, representing a distinct move away from diffusion-centric regimes (Almeida et al., 2 Jan 2026).

7. Limitations and Future Directions

Despite significant advances, current approaches exhibit open challenges:

Perspective-correct occlusion and appearance changes remain imperfect where depth prediction, segmentation, or clustering is ambiguous (e.g., thin/small objects or complex topology).
Object-centric methods (e.g., LeviTor, Zo3T) rely on effective instance separation and mask tracking; failure modes may include mask leakage, blurred deformations, or rigid motion artifacts with too few or too many cluster points.
Test-time optimization (Zo3T) incurs significant computational overhead (e.g., ≈175 s per 576×1024×14-frame video) and elevated memory demands. Further acceleration and stabilization are warranted.
3D scene–level approaches such as Pixel-to-4D depend on accurate monocular depth and segmentation; complex real-world occlusion or reflective/translucent surfaces may challenge current representations.
Style/percept matching across unseen domains may require further hybridization of image/video guidance, LoRA fine-tuning, or cross-modal priors.

Continued progress in scalable 3D scene understanding, adaptive diffusion control, and explicit neural representation is expected to expand the fidelity, applicability, and controllability of zero-shot, 3D-aware trajectory-guided image-to-video generation frameworks (Wang et al., 2024, Zhang et al., 8 Sep 2025, Kim et al., 22 Sep 2025, Almeida et al., 2 Jan 2026).