Multi-view Diffusion Trajectories (MDTs)

Updated 20 April 2026

Multi-view Diffusion Trajectories (MDTs) are a framework of interacting diffusion operators that couple stochastic processes across multiple views, enabling robust multi-view data integration.
They underpin manifold learning and generative modeling by synthesizing 3D-consistent images and videos through precise camera trajectory conditioning and pose-adaptive attention.
MDTs employ explicit operator products, score coupling, and unsupervised trajectory policy learning to maintain ergodicity, integrate geometric data, and stabilize 4D content synthesis.

Multi-view Diffusion Trajectories (MDTs) are a unifying abstraction for describing sequential or coupled stochastic processes operating across multiple distinct data views or camera perspectives, in which the evolution of system states is governed by interacting diffusion operators. In practice, MDTs appear in at least two technically distinct but structurally analogous domains: (1) multi-view geometric data analysis, where diffusion operators encode inter-view and intra-view affinities, and (2) generative modeling for vision, where diffusion-based neural architectures synthesize images or videos along prescribed multi-view spatiotemporal paths. MDTs provide a principled mechanism for fusing, regularizing, and propagating uncertainty or content across views, enabling tasks such as manifold learning, 3D-consistent scene synthesis, camera-controllable video generation, and risk-aware world modeling.

1. Mathematical Formulation and Operator Mechanisms

In the context of multi-view data geometry, MDTs are defined as inhomogeneous, trajectory-dependent diffusion processes over a set of discrete views. Consider a dataset $X = \{x_1, ..., x_N\}$ observed under $V$ distinct views, each associated with a kernel $K_v$ and Markov operator $P_v$ ( $P_v = D_v^{-1} K_v$ , $D_v$ the degree matrix) (Debaussart-Joniec et al., 1 Dec 2025). A multi-view diffusion trajectory of length $t$ is constructed as

$T^{(t)} = P_t P_{t-1} \cdots P_1,$

where each $P_\ell \in \{P_v\}$ may be chosen by a deterministic, stochastic, or learnable policy. The transition probabilities $T^{(t)}_{ij}$ quantify the likelihood of diffusing from $V$ 0 to $V$ 1 after $V$ 2 steps alternating among multiple views. Ergodicity is preserved under mild conditions, ensuring unique stationary distributions and nontrivial mixing behavior even with arbitrary view switching (Debaussart-Joniec et al., 1 Dec 2025).

For diffusion-based generative models (e.g., image/video synthesis), MDTs encode the simultaneous or sequential evolution of scene latents conditioned on multi-view camera trajectories. For instance, in CausNVS (Kong et al., 8 Sep 2025), a set of $V$ 3 RGB views, poses, and latent variables $V$ 4 is diffused forward and denoised autoregressively, with each $V$ 5 noised independently:

$V$ 6

The denoising trajectory of each view $V$ 7 depends causally on the sequence of previously decoded views and their relative poses, as modulated through attention mechanisms and architectural constraints (e.g., causal masking, sliding-window key-value caches).

2. Embedding, Distances, and Learning

A primary use of MDTs in geometric machine learning is to define view-intertwined embeddings and diffusion distances. Given a multi-view diffusion operator $V$ 8, the trajectory-dependent diffusion distance is

$V$ 9

where $K_v$ 0 is the stationary distribution at step $K_v$ 1 (Debaussart-Joniec et al., 1 Dec 2025). Spectral or SVD-based embeddings $K_v$ 2 preserve these distances in $K_v$ 3. In manifold learning, trajectories $K_v$ 4 reveal the dynamical geometry induced by multiview coupling (Lindenbaum et al., 2015).

Learning trajectory policies $K_v$ 5 or their convex combinations is typically performed via unsupervised optimization of internal quality measures, such as clustering indices or contrastive losses, with random MDTs serving as robust baselines (Performance Ratio to Random, PRR) (Debaussart-Joniec et al., 1 Dec 2025).

3. Camera Trajectory Conditioning and Geometric Integration

In neural generative models for 3D content or video, MDTs are intimately linked to camera trajectory specification and spatial-temporal consistency. Practically, each frame or view is indexed by a set of camera extrinsics $K_v$ 6; these are embedded via Plücker coordinates or pairwise-relative transforms to control pose-adaptive attention and frame synthesis (Xu et al., 16 Oct 2025, Xu et al., 2024, Kong et al., 8 Sep 2025). For example, in Cavia (Xu et al., 2024), per-pixel Plücker rays drive conditioning of a coupled U-Net across all views and frames, enabling simultaneous joint diffusion of a $K_v$ 7 latent tensor. Integration of cross-frame and cross-view attention is essential to prevent geometric drift and ensure trajectory-controlled consistency.

Autoregressive approaches (e.g., CausNVS) advance one view at a time, leveraging causal masking and per-frame noise to stabilize long trajectories (Kong et al., 8 Sep 2025). Score composition and coupling mechanisms allow modular fusion of dynamic priors (motion, appearance, geometry) as in Diffusion $K_v$ 8 (Yang et al., 2024) and Coupled Diffusion Sampling (Alzayer et al., 16 Oct 2025), where multi-view and edited-view samplers are linked via quadratic energy coupling during denoising.

4. Losses, Regularization, and Training Paradigms

Losses in MDT-based generative frameworks are structured to preserve geometry, temporal coherence, and multi-view identity:

Denoising score-matching objectives predominate, e.g.,

$K_v$ 9

where $P_v$ 0 is the noise predictor and 'pose' is the pose embedding (Xu et al., 16 Oct 2025, Kong et al., 8 Sep 2025, Zhang et al., 2024).

Auxiliary losses encode appearance consistency (e.g., AdaFace identity), geometry-appearance alignment, and regularization to prevent overfitting or catastrophic forgetting (Xu et al., 16 Oct 2025, Lin et al., 12 Mar 2026).
Specialized loss terms are introduced for scenario-driven MDTs, such as region-aware Direct Preference Optimization (DPO) focusing on risk-critical, motion-aware masked regions (Lin et al., 12 Mar 2026). Per-frame or region-weighted denoising, geometry-informed masking, and regularization terms collectively enforce local and global coherence.

Data pipelines leverage simulated or relit multi-view captures (e.g., 4D Gaussian Splatting), diverse camera trajectories (random linear/arc/bespoke paths), and injection of dynamic NeRFs or radiance fields for 4D content (Zhang et al., 2024, Xu et al., 16 Oct 2025). Inference-time techniques such as sliding-window context, noise-conditioning augmentation, and noise-blending further stabilize long MDT rollouts (Kong et al., 8 Sep 2025).

5. Applications and Empirical Results

MDTs underpin a broad spectrum of applications:

3D-Consistent Image and Video Synthesis: MDTs ensure frame, view, and camera trajectory consistency in neural rendering pipelines. CausNVS attains stable rollouts up to $P_v$ 1 frames (PSNR drop $P_v$ 2dB beyond $P_v$ 3) (Kong et al., 8 Sep 2025). Cavia yields substantial gains in FID/FVD, lowering COLMAP failure rate by half versus SVD, and dramatic increases in geometric and translation AUCs (Xu et al., 2024).
Virtual Character Capture and Customization: Virtually Being demonstrates multi-subject, multi-view, and lighting-adaptive person modeling through customized MDT-aware pipelines (AdaFace=0.351, translation error=0.267 m, rotation error=0.047 rad) (Xu et al., 16 Oct 2025).
Dynamic 4D Content and NeRFs: Variants such as Diffusion $P_v$ 4 and 4Diffusion combine video and multi-view models via score-composition or motion-module augmentation, enabling direct 4D Gaussian splatting or dynamic NeRF optimization from MDT-sampled multi-view video (Yang et al., 2024, Zhang et al., 2024).
Autonomous Driving Scenario Generation: RiskMV-DPO uses risk-anchored MDTs to synthesize rare, high-stakes multi-view driving situations, with 3D detection mAP increasing from 18.17 to 30.50 compared to baseline, FID reduced to 15.70 (Lin et al., 12 Mar 2026).
Image Editing: Coupled Diffusion Sampling applies MDTs to regularize independent image edits, delivering multi-view-consistent edits via quadratic coupling energy between diffusion trajectories (Alzayer et al., 16 Oct 2025).
Manifold Learning and Data Fusion: The MDT framework enables dynamic, unsupervised manifold embedding, clustering, and robust multi-view fusion, often outperforming fixed-fusion baselines (PRR > 1) and offering strong performance under view corruption (Debaussart-Joniec et al., 1 Dec 2025, Lindenbaum et al., 2015).

6. Algorithmic Schemes and Operator Space Exploration

Implementation of MDTs encompasses distinct algorithmic strategies:

Explicit Operator Products: Sequential composition of Markov operators (stochastic or learned), exact tracking of transition probabilities, and spectral/SVD embedding.
Parametric Score Combination and Coupling: Modular score interpolation (e.g., convex-combination), quadratic energy regularizers, or proximal updates in high-dimensional latent spaces (Yang et al., 2024, Alzayer et al., 16 Oct 2025).
Attention and Memory Architectures: Cross-view and cross-frame attention towers (or attention masking/caches) synchronize state evolution, while relative pose encodings or Plücker ray conditioning inject geometric information directly into backbone activations.
Random and Learned Trajectories: Neutral baselines employ uniformly random MDT sequences; learning approaches optimize sequence selection or convex combinations using internal, unsupervised quality measures (Debaussart-Joniec et al., 1 Dec 2025).

Selection of operator sets (e.g., including idle or teleportation moves), length of the MDT ( $P_v$ 5), and regularization (entropy-based selection, S(E[T^{(t)}])) are empirically guided; typical $P_v$ 6 lies in $P_v$ 7– $P_v$ 8 for most clustering or manifold tasks (Debaussart-Joniec et al., 1 Dec 2025). Performance is robust for both homogeneous (equal view reliability) and heterogeneous (noisy/missing/unequal) view scenarios.

7. Theoretical and Practical Considerations

MDTs generalize static multi-view fusion approaches, providing greater flexibility, noise robustness, and expressivity; e.g., convex MDTs can automatically down-weight unreliable views (Debaussart-Joniec et al., 1 Dec 2025). They retain key theoretical properties: ergodicity, unique stationary distributions, and metric preservation in trajectory-dependent diffusion spaces. As random MDTs often outperform or match classic fixed-fusion techniques, principled operator space exploration and learning is essential for state-of-the-art performance.

In MDT-driven generative architectures, ablation studies confirm critical dependencies on attention design, pose conditioning, noise scheduling, and training curriculum: removal of causal mechanisms, geometric injection, or context augmentation leads to degraded consistency, drift, or collapse (Kong et al., 8 Sep 2025, Xu et al., 2024). For manifold learning, MDTs enable smooth embedding of heterogeneous or missing-view data (Lindenbaum et al., 2015, Debaussart-Joniec et al., 1 Dec 2025).

Overall, Multi-view Diffusion Trajectories constitute a comprehensive, theoretically grounded, and highly flexible paradigm for multi-view data geometry, 3D-aware generative modeling, and structured multi-agent or physical scenario synthesis, with formal operator-space design, unsupervised learning procedures, and strong practical performance across domains.