Multi-View Trajectory-Video Control

Updated 24 November 2025

Multi-view trajectory-video control is defined by methods that use explicit 3D trajectory and SE(3) representations to synchronize multiple video views.
The approach integrates diffusion-based generative models with cross-view and temporal attention to ensure geometric and spatiotemporal consistency.
Applications range from robotics and autonomous driving simulations to virtual production, with challenges in computational cost and calibration accuracy.

Multi-view trajectory-video control defines a class of methods and models that deliver explicit, high-fidelity control over camera or object trajectories across multiple synchronized video views. Central to this paradigm is the tight linkage between 3D trajectories—often represented as per-frame SE(3) extrinsics, 6-DoF object pose sequences, or trajectory videos projected via camera geometry—and spatiotemporally consistent generative video models, particularly in diffusion-based frameworks. This technology is foundational in domains ranging from robot-embodiment world models and virtual production to controllable driving simulation, novel-view synthesis, and multi-entity motion editing. Multi-view trajectory-video control models are distinguished by their explicit trajectory parameterizations, geometric and cross-view attention mechanisms, and the capacity to generate or edit long, coherent videos under complex, user-driven camera or object motion constraints.

1. Canonical Problem Formulation and Trajectory Representations

Multi-view trajectory-video control formalizes the problem as generating a set of (potentially synchronized) videos, each associated with a unique, user-specified camera or object trajectory expressed in SE(3) or as explicit control signals. Core abstractions include:

Camera Trajectories: Represented as per-frame extrinsics $C_t = [R_t \mid T_t] \in \mathrm{SE}(3)$ and intrinsics $K$ , or in spherical/polar coordinates (φ, θ, r) for canonicalization (Yang et al., 3 Apr 2025).
Object Trajectories/6-DoF Control: For N entities each with 6-DoF poses $P_n = [R_n^{1:F}; T_n^{1:F}]$ , fused via gated self-attention and plug-and-play object injectors (Fu et al., 2024).
Trajectory Videos: In robotics, multi-view trajectory video control projects 3D end-effector Cartesian trajectories into N synchronized 2D “glowing point” videos, one per calibrated camera, preserving temporal and spatial correspondences between action and observation (Su et al., 17 Nov 2025).

Explicit mathematical mappings are used to transform world-coordinate trajectories into image-plane paths for each view: $u_t = \Pi(K, R, t) X_t, \quad \Pi(K, R, t) = K [R \mid t]$ where $X_t \in \mathbb{R}^{n \times 3}$ denotes the 3D entity positions, and projections are performed per-frame and per-camera.

2. Core Architectures and Conditioning Mechanisms

Multi-view trajectory-video control models centre on large-scale, latent space video diffusion backbones, typically enhanced with specialized conditioning and attention for trajectory injection and cross-view consistency:

Trajectory Conditioning:
- Plücker coordinate embeddings or ray-based features are concatenated channel-wise to the input latent or injected at every U-Net layer (Xu et al., 2024, Yao et al., 2024, Xu et al., 16 Oct 2025).
- Object pose features and text prompts are fused via residual updates or gated self-attention after spatial attention blocks (Fu et al., 2024).
Cross-View/Frame Attention:
- View-integrated attention modules perform full dot-product attention across both the temporal (frames) and spatial (views) axes (Xu et al., 2024).
- Epipolar-attention masks restrict inter-view attention to geometry-consistent correspondences, enforcing 3D-coherence without explicit losses (Kuang et al., 2024, Yao et al., 2024).
ControlNet/Injection Blocks:
- Trajectory parameters are encoded via neural adapters, injected via learnable linear layers, and combined with the pre-trained backbone using minimal-train, zero-mean initializations (Yao et al., 2024, Xu et al., 16 Oct 2025).
- The main denoiser is conditioned at early and intermediate steps with explicit trajectory tokens, with additional CLIP/text, lighting, and relighting controls for scene stylization (Xu et al., 16 Oct 2025).

3. Training Objectives, Data, and Factorization

Models are trained via standard diffusion denoising losses or rectified flows, with architectural disentanglement of trajectory geometry and appearance/content. Notable strategies include:

Training Losses:
- Denoising loss $L_{\text{diff}} = E[\| \epsilon_\theta(z_t, t, c) - \epsilon \|_2^2]$ , jointly over all views and control streams (Chen et al., 2024, Yang et al., 3 Apr 2025).
- Multi-view/object consistency terms, e.g., feature matching, pose error (RotErr, TransErr), and multi-term losses for identity, pose, and lighting (Xu et al., 16 Oct 2025, Xu et al., 2024).
Hybrid/Fine-Tuning Schemes:
- Factorized fine-tuning alternates between spatial (multi-view image) and temporal (video clip) updates, permitting spatial blocks to generalize over viewpoints and temporal blocks to encode dynamics (Seo et al., 16 Jun 2025).
- Domain adaptors (LoRA) and domain-specific plug-and-play injectors preserve the generalization capacity of frozen base models while adapting to domain shifts in multi-entity or synthetic content (Fu et al., 2024).
Training Data:
- Static or dynamic real multi-view video, synthetic renderings (e.g., 4DGS, UE4 assets), monocular web-scale datasets with camera pose estimation, and text-annotated trajectory databases (OmniTr) (Xu et al., 16 Oct 2025, Xu et al., 2024, Yang et al., 3 Apr 2025).
- Double-reprojection and video folding augmentations emulate multi-view supervision where real data is sparse (YU et al., 7 Mar 2025, Kuang et al., 2024).

4. Multi-Entity and Safety-Critical Extensions

The framework generalizes to multi-agent and safety-critical scenarios:

Multi-entity Control: 3DTrajMaster conducts plug-and-play fusion of prompt–trajectory pairs for N entities, with each trajectory governed by its own 6-DoF pose sequence. The object injector’s residual gated self-attention ensures pose conditioning minimally perturbs the video prior (Fu et al., 2024).
Safety-critical Planning: SafeMVDrive integrates a vision-augmented, GRPO-finetuned trajectory selector with a two-stage (collision → evasion) controllable trajectory generator and a control-conditioned, UniMLVG-based video synthesizer. Diffusion models for trajectory and video stages are linked by 3D bounding-box projections, rasterized maps, and camera conditions to enforce scene realism and safety (Zhou et al., 23 May 2025).

5. Consistency Mechanisms and Evaluation

Strict 3D and temporal consistency is paramount:

Epipolar Masking and Plücker Encoding: Networks use epipolar or Plücker-based constraints for attention masking and feature propagation, ensuring that inter-view synthesis respects the underlying geometry without explicit 3D supervision (Xu et al., 2024, Kuang et al., 2024, Yao et al., 2024).
Architectural Disentanglement: Geometry is handled via explicit rendering or flow features, while appearance is completed stochastically using cross-attention or hybrid blending strategies (YU et al., 7 Mar 2025, Seo et al., 16 Jun 2025, Xu et al., 16 Oct 2025).
Evaluation Protocols:
- Geometric accuracy via pose reconstruction (COLMAP), RotErr, TransErr, and epipolar matching (SuperGlue).
- Video quality metrics: FID, FVD, LPIPS, CLIP-based scores, and temporal/frame consistency.
- Task-specific metrics: Average/Final Displacement Error (ADE/FDE), object mask Jaccard (for embodied world models), and trajectory adherence (OmniCam) (Su et al., 17 Nov 2025, Yang et al., 3 Apr 2025, Chen et al., 2024).

6. Applications and Limitations

Typical applications include:

Controllable driving data generation (UniMLVG, SafeMVDrive, MyGo, DiVE): Generation of 6-view, longhorizon videos conditioned on BEV layouts, 3D bounding box sequences, or user-specified trajectories for end-to-end AD and planner stress-testing (Chen et al., 2024, Zhou et al., 23 May 2025, Yao et al., 2024, Jiang et al., 2024).
Robotics and embodied world models (MTV-World): Visuomotor forecasting with multi-view trajectory videos, enabling high-consistency robotic action simulation and object interaction modeling (Su et al., 17 Nov 2025).
Virtual production and character-customization (Virtually Being): Multi-view identity preservation and camera/lighting control using 4DGS rending and customized ControlNet architectures (Xu et al., 16 Oct 2025).
Scene and object manipulation (ObjCtrl-2.5D): Training-free, camera-pose-based object motion control by mapping 2D+depth trajectories to local camera extrinsic sequences, isolating object regions for local control (Wang et al., 2024).
Controllable multi-view video editing (TrajectoryCrafter, Vid-CamEdit): Arbitrary camera trajectory “redirection” in real videos, blending geometric priors with generative image completion to enable in-the-wild, user-driven novel view synthesis (YU et al., 7 Mar 2025, Seo et al., 16 Jun 2025).

Limitations include geometry estimator imperfections (ghosting, occlusions at large angles), computational expense due to multi-step diffusion, and the requirement for high-quality calibration or pose estimation. Future directions identified include learned 4D radiance fields, fast samplers for longer-term control, high-fidelity object manipulation, and adaptive trajectory extraction from weakly annotated or unconstrained video sources (YU et al., 7 Mar 2025, Yang et al., 3 Apr 2025, Su et al., 17 Nov 2025).

7. Representative Frameworks and Comparative Metrics

A comparative summary of key frameworks, their core innovations, and benchmark performance illustrates the state of the art in multi-view trajectory-video control:

Framework	Key Innovations	Main Quantitative Highlights
UniMLVG (Chen et al., 2024)	Temporal and cross-view DiT; explicit ray encoding; 20 s 6-view driving videos	FID = 8.8, FVD = 60.1 (nuScenes 6-view); mAP_obj = 22.50, IoU_road = 70.81
Cavia (Xu et al., 2024)	View-integrated attention, Plücker coordinates, multi-source training	20–30% FID/FVD improvement, >3× epipolar precision over prior work on RealEstate10K
SafeMVDrive (Zhou et al., 23 May 2025)	Visual context–augmented adversarial trajectories, two-stage generator, UniMLVG video synthesis	FID: 20.63; sample-level planner CR: 0.303 (vs. 0.007 on origin data)
3DTrajMaster (Fu et al., 2024)	Plug-and-play object injector, LoRA adaptor, 360°-Motion synthetic dataset	RotErr: 0.277°, TransErr: 0.398 m; FID: 96.8 (vs. ~105 baseline), multi-entity control
MTV-World (Su et al., 17 Nov 2025)	Trajectory-video control for robotic world models, Jaccard evaluation, CLIP fusion	Jaccard: 53.9 (vs. 18.3 single-view), FID: 23.2, FVD: 39.1
TrajectoryCrafter (YU et al., 7 Mar 2025)	Dual-stream DiT, double-reprojection monocular training	PSNR: 14.24 (vs 10–11), SSIM: 0.417 (vs 0.34–0.36), VBench ∼15–20% gain
OmniCam (Yang et al., 3 Apr 2025)	Multimodal text/video-to-camera trajectory, LoRA+diffusion, OmniTr dataset	Avg. trajectory adherence ∼0.92, FID: 8.26 on OmniTr, RotErr: 3.1°

These frameworks collectively define the cutting edge of multi-view trajectory-video control, enabling efficient, reliable, and highly controllable video generation across robotics, autonomous systems, virtual production, and visual understanding.