Papers
Topics
Authors
Recent
2000 character limit reached

Multi-View Trajectory-Video Control

Updated 24 November 2025
  • Multi-view trajectory-video control is defined by methods that use explicit 3D trajectory and SE(3) representations to synchronize multiple video views.
  • The approach integrates diffusion-based generative models with cross-view and temporal attention to ensure geometric and spatiotemporal consistency.
  • Applications range from robotics and autonomous driving simulations to virtual production, with challenges in computational cost and calibration accuracy.

Multi-view trajectory-video control defines a class of methods and models that deliver explicit, high-fidelity control over camera or object trajectories across multiple synchronized video views. Central to this paradigm is the tight linkage between 3D trajectories—often represented as per-frame SE(3) extrinsics, 6-DoF object pose sequences, or trajectory videos projected via camera geometry—and spatiotemporally consistent generative video models, particularly in diffusion-based frameworks. This technology is foundational in domains ranging from robot-embodiment world models and virtual production to controllable driving simulation, novel-view synthesis, and multi-entity motion editing. Multi-view trajectory-video control models are distinguished by their explicit trajectory parameterizations, geometric and cross-view attention mechanisms, and the capacity to generate or edit long, coherent videos under complex, user-driven camera or object motion constraints.

1. Canonical Problem Formulation and Trajectory Representations

Multi-view trajectory-video control formalizes the problem as generating a set of (potentially synchronized) videos, each associated with a unique, user-specified camera or object trajectory expressed in SE(3) or as explicit control signals. Core abstractions include:

  • Camera Trajectories: Represented as per-frame extrinsics Ct=[RtTt]SE(3)C_t = [R_t \mid T_t] \in \mathrm{SE}(3) and intrinsics KK, or in spherical/polar coordinates (φ, θ, r) for canonicalization (Yang et al., 3 Apr 2025).
  • Object Trajectories/6-DoF Control: For N entities each with 6-DoF poses Pn=[Rn1:F;Tn1:F]P_n = [R_n^{1:F}; T_n^{1:F}], fused via gated self-attention and plug-and-play object injectors (Fu et al., 10 Dec 2024).
  • Trajectory Videos: In robotics, multi-view trajectory video control projects 3D end-effector Cartesian trajectories into N synchronized 2D “glowing point” videos, one per calibrated camera, preserving temporal and spatial correspondences between action and observation (Su et al., 17 Nov 2025).

Explicit mathematical mappings are used to transform world-coordinate trajectories into image-plane paths for each view: ut=Π(K,R,t)Xt,Π(K,R,t)=K[Rt]u_t = \Pi(K, R, t) X_t, \quad \Pi(K, R, t) = K [R \mid t] where XtRn×3X_t \in \mathbb{R}^{n \times 3} denotes the 3D entity positions, and projections are performed per-frame and per-camera.

2. Core Architectures and Conditioning Mechanisms

Multi-view trajectory-video control models centre on large-scale, latent space video diffusion backbones, typically enhanced with specialized conditioning and attention for trajectory injection and cross-view consistency:

  • Trajectory Conditioning:
  • Cross-View/Frame Attention:
    • View-integrated attention modules perform full dot-product attention across both the temporal (frames) and spatial (views) axes (Xu et al., 14 Oct 2024).
    • Epipolar-attention masks restrict inter-view attention to geometry-consistent correspondences, enforcing 3D-coherence without explicit losses (Kuang et al., 27 May 2024, Yao et al., 10 Sep 2024).
  • ControlNet/Injection Blocks:
    • Trajectory parameters are encoded via neural adapters, injected via learnable linear layers, and combined with the pre-trained backbone using minimal-train, zero-mean initializations (Yao et al., 10 Sep 2024, Xu et al., 16 Oct 2025).
    • The main denoiser is conditioned at early and intermediate steps with explicit trajectory tokens, with additional CLIP/text, lighting, and relighting controls for scene stylization (Xu et al., 16 Oct 2025).

3. Training Objectives, Data, and Factorization

Models are trained via standard diffusion denoising losses or rectified flows, with architectural disentanglement of trajectory geometry and appearance/content. Notable strategies include:

  • Training Losses:
    • Denoising loss Ldiff=E[ϵθ(zt,t,c)ϵ22]L_{\text{diff}} = E[\| \epsilon_\theta(z_t, t, c) - \epsilon \|_2^2], jointly over all views and control streams (Chen et al., 6 Dec 2024, Yang et al., 3 Apr 2025).
    • Multi-view/object consistency terms, e.g., feature matching, pose error (RotErr, TransErr), and multi-term losses for identity, pose, and lighting (Xu et al., 16 Oct 2025, Xu et al., 14 Oct 2024).
  • Hybrid/Fine-Tuning Schemes:
    • Factorized fine-tuning alternates between spatial (multi-view image) and temporal (video clip) updates, permitting spatial blocks to generalize over viewpoints and temporal blocks to encode dynamics (Seo et al., 16 Jun 2025).
    • Domain adaptors (LoRA) and domain-specific plug-and-play injectors preserve the generalization capacity of frozen base models while adapting to domain shifts in multi-entity or synthetic content (Fu et al., 10 Dec 2024).
  • Training Data:

4. Multi-Entity and Safety-Critical Extensions

The framework generalizes to multi-agent and safety-critical scenarios:

  • Multi-entity Control: 3DTrajMaster conducts plug-and-play fusion of prompt–trajectory pairs for N entities, with each trajectory governed by its own 6-DoF pose sequence. The object injector’s residual gated self-attention ensures pose conditioning minimally perturbs the video prior (Fu et al., 10 Dec 2024).
  • Safety-critical Planning: SafeMVDrive integrates a vision-augmented, GRPO-finetuned trajectory selector with a two-stage (collision → evasion) controllable trajectory generator and a control-conditioned, UniMLVG-based video synthesizer. Diffusion models for trajectory and video stages are linked by 3D bounding-box projections, rasterized maps, and camera conditions to enforce scene realism and safety (Zhou et al., 23 May 2025).

5. Consistency Mechanisms and Evaluation

Strict 3D and temporal consistency is paramount:

6. Applications and Limitations

Typical applications include:

  • Controllable driving data generation (UniMLVG, SafeMVDrive, MyGo, DiVE): Generation of 6-view, longhorizon videos conditioned on BEV layouts, 3D bounding box sequences, or user-specified trajectories for end-to-end AD and planner stress-testing (Chen et al., 6 Dec 2024, Zhou et al., 23 May 2025, Yao et al., 10 Sep 2024, Jiang et al., 3 Sep 2024).
  • Robotics and embodied world models (MTV-World): Visuomotor forecasting with multi-view trajectory videos, enabling high-consistency robotic action simulation and object interaction modeling (Su et al., 17 Nov 2025).
  • Virtual production and character-customization (Virtually Being): Multi-view identity preservation and camera/lighting control using 4DGS rending and customized ControlNet architectures (Xu et al., 16 Oct 2025).
  • Scene and object manipulation (ObjCtrl-2.5D): Training-free, camera-pose-based object motion control by mapping 2D+depth trajectories to local camera extrinsic sequences, isolating object regions for local control (Wang et al., 10 Dec 2024).
  • Controllable multi-view video editing (TrajectoryCrafter, Vid-CamEdit): Arbitrary camera trajectory “redirection” in real videos, blending geometric priors with generative image completion to enable in-the-wild, user-driven novel view synthesis (YU et al., 7 Mar 2025, Seo et al., 16 Jun 2025).

Limitations include geometry estimator imperfections (ghosting, occlusions at large angles), computational expense due to multi-step diffusion, and the requirement for high-quality calibration or pose estimation. Future directions identified include learned 4D radiance fields, fast samplers for longer-term control, high-fidelity object manipulation, and adaptive trajectory extraction from weakly annotated or unconstrained video sources (YU et al., 7 Mar 2025, Yang et al., 3 Apr 2025, Su et al., 17 Nov 2025).

7. Representative Frameworks and Comparative Metrics

A comparative summary of key frameworks, their core innovations, and benchmark performance illustrates the state of the art in multi-view trajectory-video control:

Framework Key Innovations Main Quantitative Highlights
UniMLVG (Chen et al., 6 Dec 2024) Temporal and cross-view DiT; explicit ray encoding; 20 s 6-view driving videos FID = 8.8, FVD = 60.1 (nuScenes 6-view); mAP_obj = 22.50, IoU_road = 70.81
Cavia (Xu et al., 14 Oct 2024) View-integrated attention, Plücker coordinates, multi-source training 20–30% FID/FVD improvement, >3× epipolar precision over prior work on RealEstate10K
SafeMVDrive (Zhou et al., 23 May 2025) Visual context–augmented adversarial trajectories, two-stage generator, UniMLVG video synthesis FID: 20.63; sample-level planner CR: 0.303 (vs. 0.007 on origin data)
3DTrajMaster (Fu et al., 10 Dec 2024) Plug-and-play object injector, LoRA adaptor, 360°-Motion synthetic dataset RotErr: 0.277°, TransErr: 0.398 m; FID: 96.8 (vs. ~105 baseline), multi-entity control
MTV-World (Su et al., 17 Nov 2025) Trajectory-video control for robotic world models, Jaccard evaluation, CLIP fusion Jaccard: 53.9 (vs. 18.3 single-view), FID: 23.2, FVD: 39.1
TrajectoryCrafter (YU et al., 7 Mar 2025) Dual-stream DiT, double-reprojection monocular training PSNR: 14.24 (vs 10–11), SSIM: 0.417 (vs 0.34–0.36), VBench ∼15–20% gain
OmniCam (Yang et al., 3 Apr 2025) Multimodal text/video-to-camera trajectory, LoRA+diffusion, OmniTr dataset Avg. trajectory adherence ∼0.92, FID: 8.26 on OmniTr, RotErr: 3.1°

These frameworks collectively define the cutting edge of multi-view trajectory-video control, enabling efficient, reliable, and highly controllable video generation across robotics, autonomous systems, virtual production, and visual understanding.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-view Trajectory-Video Control.