Multi-View Diffusion Trajectories

Updated 4 July 2026

Multi-View Diffusion Trajectories (MDTs) are a family of diffusion processes that generate coupled multi-view states through shared latent representations and structured reverse diffusion paths.
They encompass varied methods—from fixed-view joint denoising and view-time video models to autoregressive camera-conditioned rollouts—ensuring cross-view consistency via architectural and geometric conditioning.
Empirical results highlight improvements in metrics like FID, CLIP, and geometric fidelity, demonstrating MDTs’ impact in novel view synthesis, autonomous planning, and multi-sensor integration.

to=arxiv_search.search 天天彩票与你同行json {"^{^{^{^{^{^{^{^12query12}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^{^{^{^{^",}}}}}}} "^{^{^{^{^{^{^{^{12max_results12}}}}}}}} ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^,}}} "^{^{^{^{^{^{^{^12sort_by12}}}}}}} "^{^{^{^{^{^{^{^{12relevance12}}}}}}}} to=arxiv_search.search 大发快三官网微信上的天天中彩票json {"^{^{^{^{^{^{^{^12query12}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^{^{^{^{^",}}}}}}} "^{^{^{^{^{^{^{^{12max_results12}}}}}}}} ^{^{^{^{12max_results12query12^{^{^{^,}}}}}}} "^{^{^{^{^{^{^{^12sort_by12}}}}}}} "^{^{^{^{^{^{^{^{12relevance12}}}}}}}} to=arxiv_search.search аанацҳауеит 天天彩票提现json {"^{^{^{^{^{^{^{^12query12}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^{^{^{^{^"Sharp-It:}}}}}}} A Multi-view to Multi-view Diffusion Model for ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} Synthesis and Manipulation\"", "^{^{^{^{^{^{^{^{12max_results12}}}}}}}} ^{^{^{^{12query12^{^{^{^,}}}}}}} "^{^{^{^{^{^{^{^12sort_by12}}}}}}} "^{^{^{^{^{^{^{^{12relevance12}}}}}}}} Multi-View Diffusion Trajectories (MDTs) is not yet a standardized formal object in the arXiv literature. The phrase is best used as an umbrella description for diffusion processes whose denoising path is jointly structured across multiple viewpoints, synchronized video streams, or camera-conditioned trajectory queries, rather than over isolated images. Under that reading, MDT-related systems range from fixed-grid multi-view latent refiners such as Sharp-It to camera-controllable multi-view video models such as Cavia, autoregressive novel-view systems such as CausNVS, and dynamic ^{^{^{^{12relevance12^{^{^{^D}}}}}}} generators such as DiffusionPRESERVED_PLACEHOLDER_^{^{^{^{12query12^{^{^{^}}}}}}} and ^{^{^{^{12relevance12^{^{^{^Diffusion;}}}}}}} what unifies them is that the evolving variable during reverse diffusion is a coupled multi-view state, and not a collection of independent per-view samples (&&&^{^{^{^{12query12^{^{^{^&&&,}}}}}}} &&&^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^&&&,}}} &&&^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^&&&,}}} &&&^{^{^{^{12relevance12^{^{^{^&&&,}}}}}}} &&&^{^{^{^{12query12^{^{^{^&&&).}}}}}}}

^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^.}}} Conceptual scope and defining properties

In current usage, MDTs are best understood as a family resemblance rather than a single formalism. Some systems denoise a fixed ordered set of rendered views packed into one latent tensor; some denoise a view-time lattice; some roll out camera-conditioned views autoregressively; and some use diffusion to generate physical or control trajectories that then condition multi-view rendering. A recurring distinction is whether the coupled object is a set of views, a video tensor with a view dimension, or a trajectory/control representation that later anchors image generation.

A second distinction concerns how cross-view consistency is obtained. In several image-first systems, consistency is largely implicit, learned, and architectural: all views are processed jointly, and attention over the packed representation serves as cross-view communication. In other systems, camera geometry is injected explicitly through Plücker coordinates, pairwise-relative pose encodings, or epipolar weighting. A third distinction is whether denoising is joint and parallel over a fixed bundle of views, or causal/autoregressive over a camera trajectory. These differences matter because they determine whether an MDT can support arbitrary camera queries, streaming inference, or only a fixed camera layout.

MDT regime	Representative systems	Core coupling mechanism
Fixed-view joint latent denoising	Sharp-It, MultiImageDream, MVDiff	Shared latent tensor, self-attention or dense ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} attention, sometimes epipolar weighting
Multi-view video denoising	Cavia, DiffusionPRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^,}}} ^{^{^{^{12relevance12^{^{^{^Diffusion}}}}}}}	Joint view-time tensor or row/column score composition
Autoregressive camera-trajectory rollout	CausNVS, Virtually Being	Causal attention with camera encodings and sequential view generation
Trajectory-conditioned control and rendering	RiskMV-DPO, TransDiffuser, MBD	Diffusion over planned trajectories or trajectory-conditioned controls

A common misconception is that any multi-view diffusion model is automatically an MDT. That is too broad. A fixed PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12^{^{^{^}}}}}}} grid with no arbitrary pose support and no explicit trajectory variable is MDT-like only in the weaker sense that its reverse diffusion path is jointly defined over multiple views. Conversely, a planner such as TransDiffuser is not a canonical multi-view image model, but it is relevant because diffusion is used to generate a trajectory distribution conditioned on rich multi-sensor context, and that trajectory view of diffusion transfers directly to MDT design (&&&^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^&&&).}}}

^{^{^{^{12max_results12^{^{^{^.}}}}}}} Joint latent denoising over fixed view sets

A major MDT branch operates on a fixed, ordered set of target views and treats the entire set as one denoised object. "Sharp-It" is exemplary. It starts from a coarse but ^{^{^{^{12sort_by12^{^{^{^D-consistent}}}}}}} object produced by Shap-E, renders it into six predefined camera views arranged as a PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12^{^{^{^}}}}}}} grid, and applies a multi-view-to-multi-view latent diffusion refiner before reconstructing with a feed-forward sparse-view model such as InstantMesh. The model is built on Zero^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12^{^{^{^++,}}} uses Stable Diffusion’s VAE latent space, expands the UNet input to ^{^{^{^{12sort_by12^{^{^{^}}}}}}} channels—^{^{^{^{12relevance12^{^{^{^}}}}}}} for noisy latents and ^{^{^{^{12relevance12^{^{^{^}}}}}}} for VAE-encoded degraded renderings—and relies on global self-attention over a packed PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12^{^{^{^}}}}}}} six-view grid. The training objective is standard latent diffusion with v-prediction,

PRESERVED_PLACEHOLDER_^{^{^{^{12query12^{^{^{^}}}}}}}

with a CFG drop probability of ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^.}}} The method preserves consistency through the coarse ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} prior, joint processing, and conditioning on the degraded view set, but it does not introduce explicit epipolar attention, triplanes, pose embeddings, differentiable reprojection, or geometric consistency losses. On its Objaverse-derived paired dataset, Sharp-It reports FID ^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^{.^{^{^{^{12(Edelstein et al., 2024)}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^{^{^{^}}}, CLIP ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12query12}}}}}}}}}}}}, and DINO ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12max_results12}}}}}}}}}}}}, versus FID ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^}}}, CLIP ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12max_results12}}}}}}}}}}}}, and DINO ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12relevance12}}}}}}}}}}}} for the strongest reported Zero^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12^{^{^⁺⁺}} with SDEdit baseline, with runtime around ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}} seconds (&&&^{^{^{^{12query12^{^{^{^&&&).}}}}}}}

MultiImageDream shows the same fixed-bundle logic under image prompting rather than ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} refinement. It extends ImageDream, itself derived from MVDream, from one prompt image to multiple prompt images without fine-tuning. MVDream jointly denoises four orthogonal target views with densely connected ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} attention over a stacked feature map of shape PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^.}}} ImageDream adds a local controller based on resampled CLIP features and a pixel controller that appends a prompt-image latent, changing the attention tensor to PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12^{^{^{^.}}}}}}} MultiImageDream generalizes this by concatenating multiple local-token banks and stacking multiple prompt-image latents, giving PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12^{^{^{^}}}}}}} for PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12^{^{^{^}}}}}}} prompt images. The generated four-view latents therefore evolve under a shared denoising trajectory that is continuously anchored by multiple observed-view conditions. Quantitatively, the strongest gains appear on synthesized multi-view imagery: for example, “^{^{^{^{12max_results12^{^{^{^-ImageDream}}}}}}} - pixel(f) + local(fb)” improves the single-image baseline from QIS PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}}, CLIP(TX) PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}}, CLIP(IM) PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^{^{^{^}}} to QIS PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^}}}, CLIP(TX) PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^}}}, CLIP(IM) PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}}, while ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} gains after SDS-based NeRF optimization are present but more limited (&&&^{^{^{^{12max_results12^{^{^{^&&&).}}}}}}}

MVDiff occupies a related but more geometry-explicit position. It builds a Scene Representation Transformer that aggregates one or more source views into a latent scene representation PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^,}}} predicts a coarse PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^{^{^{^}}} target latent PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^,}}} and feeds multiple target views jointly into a latent diffusion UNet. Its most explicit geometric device is epipolar attention: for each pair of views, it builds a weighted affinity correction from inverse epipolar distance and modifies attention affinities as

PRESERVED_PLACEHOLDER_^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^}}}

The paper states that target views are predicted simultaneously rather than sequentially. In ablation, removing epipolar attention reduces PSNR/SSIM/LPIPS from ^{^{^{^{12max_results12query12^{^{^{^{.^{^{^{^{12max_results12relevance12}}}}}}}}}}}} to ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^{/^{^{^{^{12query12^{^{^{^.}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12^{^{^{^{/^{^{^{^{12query12^{^{^{^.}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^}}}, while removing multi-view attention gives ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^{.^{^{^{^{12relevance12max_results12^{^{^{^/}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^{/^{^{^{^{12query12^{^{^{^.}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^}}}. On GSO novel-view synthesis with one reference view, MVDiff reports PSNR ^{^{^{^{12max_results12query12^{^{^{^{.^{^{^{^{12max_results12relevance12}}}}}}}}}}}}, SSIM ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12sort_by12relevance12}}}}}}}}}}}}, LPIPS ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12query12relevance12query12}}}}}}}}}}}}, improving over Zero^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12^{^{^{^-XL’s}}} ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^{.^{^{^{^{12relevance12sort_by12^{^{^{^/}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^{/^{^{^{^{12query12^{^{^{^.}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12relevance12^{^{^{^}}}; for downstream GSO reconstruction it reports Chamfer Distance ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12query12relevance12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}} and Volume IoU ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12sort_by12query12max_results12}}}}}}}}}}}} with one input view, and improves further with more reference views (&&&^{^{^{^{12sort_by12^{^{^{^&&&).}}}}}}}

These fixed-view systems illustrate a narrow but important MDT regime. Their trajectory is the reverse diffusion path of a packed multi-view latent, usually under a fixed camera layout. This suggests that synchronized denoising alone can be a strong source of multi-view coherence, but also that flexibility in camera graphs, trajectory length, and online ^{^{^{^{^{^{^{^12query12}}}}}}} is limited unless the formulation is widened beyond a fixed bundle.

^{^{^{^{12sort_by12^{^{^{^.}}}}}}} Multi-view video denoising on view-time lattices

A stronger MDT interpretation appears when diffusion is defined over both view and time. Cavia does this explicitly. It extends Stable Video Diffusion to camera-controllable multi-view video generation and represents the latent state as a tensor of shape

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12query12^{^{^{^}}}}}}}

where PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}} is the number of views and PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12max_results12^{^{^{^}}}}}}} is frames per view. Camera control is encoded by Plücker ray coordinates derived from extrinsics and intrinsics, concatenated channel-wise with latent inputs. The architectural core is View-Integrated Attention: cross-frame attention rearranges features so that attention spans all spatiotemporal tokens within a view, and cross-view attention rearranges them so that synchronized timesteps from all views attend jointly. This gives direct communication between different camera trajectories of the same scene. Cavia is trained with EDM-style denoising score matching,

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12sort_by12^{^{^{^}}}}}}}

on a mixture of static scene/object multi-view videos, synthetic multi-view dynamic videos, and monocular dynamic videos with estimated poses. Its ablations are unusually MDT-relevant: removing cross-view attention causes different object motions to appear in different views, and removing cross-frame attention causes severe distortions. On RealEstate^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^K}}} monocular camera control, Cavia reports FID ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^{.^{^{^{^{12relevance12sort_by12^{^{^{^}}}}}}}}}}}, FVD ^{^{^{^{12query12query12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}}, and COLMAP error ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^{^{^{^{.^{^{^{^{12relevance12^{^{^{^%}}}}}}}}}}}; on two-view generation it reports FID ^{^{^{^{12sort_by12^{^{^{^{.^{^{^{^{12sort_by12max_results12}}}}}}}}}}}}, FVD ^{^{^{^{12relevance12relevance12^{^{^{^{.^{^{^{^{12sort_by12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}}, and Prec. PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12relevance12^{^{^{^}}}}}}}, MS. PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12query12^{^{^{^}}}}}}} on Real^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^K,}}} outperforming CameraCtrl in the reported table (&&&^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^&&&).}}}

DiffusionPRESERVED_PLACEHOLDER_^{^{^{^{12max_results12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}} addresses the same joint structure from a different angle. Rather than training a native multi-view video model, it composes a pretrained video diffusion prior and a pretrained multi-view diffusion prior over a dense image array

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12max_results12^{^{^{^}}}}}}}

Its key theorem assumes conditional independence between the same-frame multi-view context and same-view temporal context given a center cell, yielding the score identity

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12sort_by12^{^{^{^}}}}}}}

In practice the unknown single-image marginal score is approximated by a convex combination of the row and column scores, with a logistic schedule

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12relevance12^{^{^{^}}}}}}}

and the paper reports PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12query12^{^{^{^,}}}}}}} PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}} as the best setting. This is presented as a way to decouple geometry-consistent generation and temporally smooth appearance during denoising. The output is a dense multi-view, multi-frame lattice used to optimize ^{^{^{^{12relevance12^{^{^{^D}}}}}}} Gaussian Splatting. The paper reports around ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}} minutes end-to-end runtime and gives user-study evidence that DiffusionPRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12max_results12^{^{^{^}}}}}}} improves geometric consistency and overall model quality over Animate^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12relevance12^{^{^{^}}} and DreamGaussian^{^{^{^{12relevance12^{^{^{^D,}}}}}}} while CLIP similarity on video-to-^{^{^{^{12relevance12^{^{^{^D}}}}}}} is ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12relevance12}}}}}}}}}}}} versus ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12max_results12}}}}}}}}}}}} for Efficient^{^{^{^{12relevance12^{^{^{^D}}}}}}} (&&&^{^{^{^{12relevance12^{^{^{^&&&).}}}}}}}

^{^{^{^{12relevance12^{^{^{^Diffusion}}}}}}} replaces score composition with a learned unified denoiser. It starts from ImageDream, inserts zero-initialized motion modules into a frozen ^{^{^{^{12sort_by12^{^{^{^D-aware}}}}}}} UViT, and trains a multi-view video diffusion model, ^{^{^{^{12relevance12^{^{^{^DM,}}}}}}} on a curated set of ^{^{^{^{12relevance12max_results12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}} animated Objaverse assets rendered as synchronized multi-view videos. The latent tensor has shape

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12sort_by12^{^{^{^}}}}}}}

with PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12relevance12^{^{^{^}}}}}}} viewpoints: one source monocular view and four target views. Spatial modules reuse ImageDream’s ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} self-attention across views, while motion modules reshape the tensor to apply temporal self-attention across frames. The resulting denoiser is used in a ^{^{^{^{12relevance12^{^{^{^D-aware}}}}}}} SDS objective,

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12query12^{^{^{^}}}}}}}

to optimize a hash-encoded dynamic NeRF, together with an anchor loss and regularizers. In direct multi-view video evaluation against ImageDream, ^{^{^{^{12relevance12^{^{^{^DM}}}}}}} reports CLIP-I ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12query12^{^{^{^}}}, LPIPS ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12relevance12^{^{^{^}}}, CLIP-C ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12query12^{^{^{^}}}, and FVD ^{^{^{^{12query12max_results12sort_by12^{^{^{^{.^{^{^{^{12relevance12query12}}}}}}}}}}}}, compared with ImageDream’s ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12max_results12max_results12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}}, ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12query12relevance12^{^{^{^}}}, ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12relevance12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^{^{^{^}}}, and ^{^{^{^{12query12relevance12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^{.^{^{^{^{12max_results12query12^{^{^{^}}}}}}}}}}}. On full ^{^{^{^{12relevance12^{^{^{^D}}}}}}} generation, the final system reports CLIP-I ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12sort_by12query12sort_by12}}}}}}}}}}}}, CLIP-C ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12relevance12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12relevance12^{^{^{^}}}, and FVD ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^{.^{^{^{^{12sort_by12^{^{^{^}}}}}}}}}}}, outperforming Consistent^{^{^{^{12relevance12^{^{^{^D,}}}}}}} DreamGaussian^{^{^{^{12relevance12^{^{^{^D,}}}}}}} and ^{^{^{^{12relevance12^{^{^{^D-fy}}}}}}} in the reported benchmark (&&&^{^{^{^{12query12^{^{^{^&&&).}}}}}}}

Together, these systems show two distinct routes to MDTs over view-time structure. One route composes orthogonal score fields over rows and columns; the other trains a single denoiser whose latent trajectory already spans views and frames. This suggests that “trajectory” can refer both to the reverse diffusion path and to the induced coupling topology over a view-time grid.

^{^{^{^{12relevance12^{^{^{^.}}}}}}} Camera-controlled and autoregressive trajectory formulations

CausNVS pushes MDTs toward open-ended novel-view synthesis. It addresses the limits of non-autoregressive multi-view diffusion by generating target views sequentially, conditioned on accumulated context and target poses. Given input views

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}}

and target poses

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12max_results12^{^{^{^}}}}}}}

it represents each frame as

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12sort_by12^{^{^{^}}}}}}}

uses causal masking in frame-wise attention, and trains with independent per-frame noise levels under

PRESERVED_PLACEHOLDER_^{^{^{^{12sort_by12relevance12^{^{^{^}}}}}}}

Camera conditioning is handled by pairwise-relative camera pose encoding (CaPE),

PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12query12^{^{^{^}}}}}}}

At inference, CausNVS combines pose-aware sliding windows, key-value caching, and noise conditioning augmentation to mitigate drift. It is trained with PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}} and evaluated up to PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12max_results12^{^{^{^,}}}}}}} including rollouts up to PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12sort_by12^{^{^{^}}}}}}} training length. The paper reports strong flexible PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12relevance12^{^{^{^{-to-PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12query12^{^{^{^}}}}}}}}}}}}}}} synthesis on RealEstate^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^K,}}} DL^{^{^{^{12sort_by12^{^{^{^DV,}}}}}}} and LLFF, and shows that causal training generalizes more robustly than non-causal alternatives across different sequence lengths (&&&^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^&&&).}}}

Virtually Being addresses a related problem from the customization side rather than the NVS side. Its main contribution is a data pipeline that uses ^{^{^{^{12relevance12^{^{^{^D}}}}}}} Gaussian Splatting to re-render the same captured performance under many virtual camera trajectories, thereby fine-tuning camera-controllable video diffusion models for multi-view identity preservation. Human capture uses a ^{^{^{^{12max_results12query12^{^{^{^}}}}}}} synchronized camera face rig and a ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^{^{^{^}}} synchronized camera full-body rig; each subject performs ^{^{^{^{12sort_by12^{^{^{^{–^{^{^{^{12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}} multi-view sequences lasting about ^{^{^{^{12query12query12^{^{^{^}}}}}}} to ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12query12^{^{^{^}}} frames at ^{^{^{^{12max_results12relevance12^{^{^{^}}}}}}} fps. New training trajectories are created by randomly sampling start and end camera positions within a ^{^{^{^{12max_results12^{^{^{^{–^{^{^{^{12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}} meter radius and linearly interpolating between them, while lighting diversity is added with Lux Post Facto and HDRI maps. Camera information is represented using Plücker coordinates, encoded by a fully convolutional encoder, and injected through a ControlNet-style path into CogVideoX; the paper states that camera conditioning is applied only during the first ^{^{^{^{12relevance12query12^{^{^{^%}}}}}}} of denoising timesteps and only into the first ^{^{^{^{12max_results12query12^{^{^{^%}}}}}}} of DiT blocks. On evaluation, the customized model reports AdaFace ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12query12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}} versus ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12max_results12max_results12}}}}}}}}}}}} for a frontal-only variant, and its pretrained camera-conditioned version reports TransErr ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12max_results12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^{^{^{^}}} and RotErr ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12query12relevance12max_results12}}}}}}}}}}}}; dynamic-camera customization remains markedly better than static-camera-only customization in camera-control metrics (&&&^{^{^{^{12max_results12max_results12^{^{^{^&&&).}}}}}}}

The contrast between CausNVS and Virtually Being is instructive. CausNVS is a general autoregressive camera-trajectory diffusion model with explicit causal rollout and arbitrary PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^{-to-PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12max_results12^{^{^{^}}}}}}}}}}} ^{^{^{^{^{^{^{^{12query12^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} support. Virtually Being, by contrast, is a customization framework that binds identity to camera-conditioned video denoising through richly re-rendered supervision. This suggests two complementary MDT strategies: one can either build camera-trajectory flexibility into the sampler itself, or shape a strong conditional diffusion backbone with trajectory-rich data.

^{^{^{^{12query12^{^{^{^.}}}}}}} Trajectory-conditioned control and scenario generation beyond rendering

A second MDT-adjacent research line treats the trajectory itself as the primary generated object. TransDiffuser is a diffusion-based end-to-end planner for autonomous driving whose conditioning is inherently multi-modal and partially multi-view. It predicts future ego trajectories

PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12sort_by12^{^{^{^}}}}}}}

uses an action-space parameterization following TrajHF, and conditions a denoising decoder on fused scene features

PRESERVED_PLACEHOLDER_^{^{^{^{12relevance12relevance12^{^{^{^}}}}}}}

The reverse update follows standard DDPM noise prediction, while the main novelty is a decorrelation regularizer over the fused multi-modal representation: PRESERVED_PLACEHOLDER_^{^{^{^{12query12query12^{^{^{^}}}}}}} with PRESERVED_PLACEHOLDER_^{^{^{^{12query12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^,}}} and total loss

PRESERVED_PLACEHOLDER_^{^{^{^{12query12max_results12^{^{^{^}}}}}}}

using PRESERVED_PLACEHOLDER_^{^{^{^{12query12sort_by12^{^{^{^.}}}}}}} The paper reports PDMS ^{^{^{^{12relevance12relevance12^{^{^{^{.^{^{^{^{12sort_by12query12}}}}}}}}}}}} on NAVSIM, or ^{^{^{^{12relevance12relevance12^{^{^{^{.^{^{^{^{12relevance12}}}}}}}}}}}} in the main table, with ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}} denoising steps and ^{^{^{^{12sort_by12query12^{^{^{^}}}}}}} trajectory candidates. Its own discussion is explicit that “multi-modal representation” here means fused sensor/modal features rather than a true per-view camera-token model; this makes it relevant to MDTs by analogy rather than by direct multi-view image coupling (&&&^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^&&&).}}}

RiskMV-DPO is closer to an MDT pattern in the rendering sense because it generates risk-conditioned future trajectories and ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} boxes first, then uses them as geometric anchors for multi-view scenario diffusion. The decomposition is explicit: PRESERVED_PLACEHOLDER_^{^{^{^{12query12relevance12^{^{^{^}}}}}}} Risk is defined from relative displacement, velocities, approach cues, lateral attenuation, and type coefficients; the per-agent risk at time PRESERVED_PLACEHOLDER_^{^{^{^{12query12query12^{^{^{^}}}}}}} is

PRESERVED_PLACEHOLDER_^{^{^{^{12query12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}}

and the motion generator is trained so that generated modes match a target risk level. Diffusion training then adds geometry-appearance alignment and Region-Aware DPO with a fused ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} multi-view mask. On nuScenes, the final system reports FID ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^{.^{^{^{^{12max_results12query12^{^{^{^}}}}}}}}}}}, FVD ^{^{^{^{12sort_by12max_results12^{^{^{^{.^{^{^{^{12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^{^{^{^}}}, and mAP ^{^{^{^{12sort_by12query12^{^{^{^{.^{^{^{^{12query12query12}}}}}}}}}}}}, versus FID ^{^{^{^{12max_results12query12^{^{^{^{.^{^{^{^{12relevance12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^}}}, FVD ^{^{^{^{12relevance12relevance12^{^{^{^{.^{^{^{^{12sort_by12relevance12}}}}}}}}}}}}, and mAP ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^{^{^{^{.^{^{^{^{12all:multi-view}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^{^{^{^}}} for MagicDriveV^{^{^{^{12max_results12^{^{^{^.}}}}}}} The ablations also show monotonic gains in MV-SSIM from ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12all:multi-view}}}}}}}}}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^{^{^{^}}} to ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12sort_by12query12(Edelstein et al., 2024)}}}}}}}}}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}} and improved Depth AbsRel from ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12max_results12query12query12}}}}}}}}}}}} to ^{^{^{^{12query12^{^{^{^{.^{^{^{^{12max_results12query12relevance12}}}}}}}}}}}} as motion-aware masking, ^{^{^{^{12sort_by12^{^{^{^D}}}}}}} multi-view masks, VGGT geometry features, and alignment are added (&&&^{^{^{^{12max_results12relevance12^{^{^{^&&&).}}}}}}}

Model-Based Diffusion for Trajectory Optimization is not a multi-view image model, but it is highly relevant to MDT methodology because it shows that diffusion trajectories can be driven directly by known model information rather than learned denoisers. It defines a target trajectory density

PRESERVED_PLACEHOLDER_^{^{^{^{12query12max_results12^{^{^{^}}}}}}}

over full trajectory vectors PRESERVED_PLACEHOLDER_^{^{^{^{12query12sort_by12^{^{^{^,}}}}}}} approximates the score of the noised density by Monte Carlo,

PRESERVED_PLACEHOLDER_^{^{^{^{12query12relevance12^{^{^{^}}}}}}}

and performs deterministic reverse updates. With PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^{^{^{^,}}} the update reduces to a CEM-type weighted mean, which the paper uses to explain its connection to sampling-based optimization. This suggests a broader MDT design principle: different “views” of a trajectory distribution—dynamics, cost, constraints, demonstrations—can be fused as probabilistic factors inside diffusion-style iterative refinement, even when no trained multi-view denoiser exists (&&&^{^{^{^{12max_results12query12^{^{^{^&&&).}}}}}}}

These trajectory-conditioned systems expand MDTs beyond novel-view synthesis. They imply that “multi-view” need not mean only multiple cameras; it can also mean multiple structured factors over the same future trajectory. That is an inference from their formulations, but it is a practically important one for planning-oriented diffusion research.

The deepest theoretical precursor is "MultiView Diffusion Maps," which defines a diffusion process on a multiview state space of paired sample-view indices PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^.}}} For PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^{^{^{^}}} views and PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12^{^{^{^}}} samples per view, it constructs a block kernel PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12^{^{^{^}}} with zero diagonal blocks and off-diagonal blocks PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^{^{^{^,}}} row-normalizes to obtain a Markov operator

PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}}

and interprets PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^{^{^{^}}} as a cross-view random walk in which “staying in the same view … is forbidden.” Diffusion coordinates are then

and multi-view diffusion distances compare rows of PRESERVED_PLACEHOLDER_^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12^{^{^{^.}}} This is not generative diffusion in the DDPM sense, but it is a rigorous operator-theoretic model of cross-view diffusion trajectories and remains one of the clearest mathematical templates for MDTs as trajectories on a coupled multi-view state space (&&&^{^{^{^{12max_results12^{^{^{^&&&).}}}}}}}

A complementary trajectory-centric perspective comes from "Let us Build Bridges," which treats diffusion models as latent variable models over full trajectories PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12query12^{^{^{^}}}}}}} and develops PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^-bridges}}} and PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12max_results12^{^{^{^-bridges.}}}}}}} The learnable constrained model

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12sort_by12^{^{^{^}}}}}}}

is built by starting from a bridge process that already satisfies endpoint constraints and then adding a learnable drift while preserving support in the constrained domain. This suggests that MDTs with hard multi-view consistency or structured-output constraints may be more naturally formulated as bridge processes than as free reverse-time denoisers (&&&^{^{^{^{12max_results12max_results12^{^{^{^&&&).}}}}}}}

A more diagnostic view is provided by "Tracing the Roots," which treats a diffusion trajectory itself as a discriminative object. For each time step PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12relevance12^{^{^{^,}}}}}}} it extracts

PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12query12^{^{^{^}}}}}}}

concatenates them over time, and shows that the full temporal signature supports origin attribution better than single-step thresholds. On CIFAR-^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^{^{^{^}}} membership inference, the paper reports AUC ^{^{^{^{12sort_by12sort_by12^{^{^{^{.^{^{^{^12sort_by12}}}}}}}}}}} using all features and all steps, versus ^{^{^{^{12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12^{^{^{^{.^{^{^{^{12max_results12^{^{^{^}}}}}}}}}}} for the Matsumoto single-step baseline; on CelebA-HQ ^{^{^{^{12max_results12query12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^,}}} it reports ^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12query12^{^{^{^{.^{^{^{^{12query12^{^{^{^}}}}}}}}}}} versus ^{^{^{^{12sort_by12query12^{^{^{^{.^{^{^{^{12max_results12}}}}}}}}}}}}. While this is not a multi-view image-generation paper, it demonstrates that diffusion trajectories carry structured information beyond terminal samples, which is directly relevant if MDTs are to be analyzed rather than merely built (&&&^{^{^{^{12sort_by12^{^{^{^&&&).}}}}}}}

Across the literature, three open issues recur. First, many systems remain tied to a fixed view layout: Sharp-It uses six predefined views in a fixed PRESERVED_PLACEHOLDER_^{^{^{^{12max_results12(Edelstein et al., 2024)}}}} OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^{^{^{^}}} grid, MVDiff is evaluated on preset novel-view bundles, and ^{^{^{^{12relevance12^{^{^{^Diffusion}}}}}}} uses orthogonal views in canonical coordinates. Second, consistency is often architectural rather than explicitly geometric: several methods lack epipolar losses, reprojection constraints, or depth supervision. Third, scalability with longer horizons, more views, or arbitrary camera graphs remains uneven: Cavia and CausNVS address this more directly, but even they retain synchronized timesteps and equal-length sequences. A final misconception is that better visual quality automatically implies better geometric fidelity. The reported mAP gains in RiskMV-DPO, the depth and MV-SSIM ablations, and the detector-compatibility improvements suggest that geometry-aware conditioning matters separately from appearance realism (&&&^{^{^{^{12max_results12relevance12^{^{^{^&&&).}}}}}}}

The field therefore contains both a practical and a theoretical split. Practically, MDTs are already realized in several incompatible but productive forms: fixed-grid joint denoisers, view-time video models, autoregressive camera-conditioned rollouts, and trajectory-first control pipelines. Theoretically, the strongest unifying ideas currently come from multiview Markov operators, bridge processes, and explicit analysis of diffusion trajectories. This suggests that a future formal theory of MDTs would likely need to combine all three: a coupled state space over views, a trajectory law over denoising time, and constraint mechanisms strong enough to preserve geometry under flexible camera or control trajectories.

Markdown Report Issue Upgrade to Chat

References (14)

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation (2024)

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention (2024)

Multi-view Image Prompted Multi-view Diffusion for Improved 3D Generation (2024)

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View (2024)

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models (2024)

4Diffusion: Multi-view Video Diffusion Model for 4D Generation (2024)

TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving (2025)

MultiView Diffusion Maps (2015)

Tracing the Roots: Leveraging Temporal Dynamics in Diffusion Trajectories for Origin Attribution (2024)

10.

CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis (2025)

11.

Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures (2025)

12.

Risk-Controllable Multi-View Diffusion for Driving Scenario Generation (2026)

13.

Model-Based Diffusion for Trajectory Optimization (2024)

14.

Let us Build Bridges: Understanding and Extending Diffusion Generative Models (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-View Diffusion Trajectories (MDTs).

Multi-View Diffusion Trajectories

^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^.}}} Conceptual scope and defining properties

^{^{^{^{12max_results12^{^{^{^.}}}}}}} Joint latent denoising over fixed view sets

^{^{^{^{12sort_by12^{^{^{^.}}}}}}} Multi-view video denoising on view-time lattices

^{^{^{^{12relevance12^{^{^{^.}}}}}}} Camera-controlled and autoregressive trajectory formulations

^{^{^{^{12query12^{^{^{^.}}}}}}} Trajectory-conditioned control and scenario generation beyond rendering

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-View Diffusion Trajectories

12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12. Conceptual scope and defining properties

12max_results12. Joint latent denoising over fixed view sets

12sort_by12. Multi-view video denoising on view-time lattices

12relevance12. Camera-controlled and autoregressive trajectory formulations

12query12. Trajectory-conditioned control and scenario generation beyond rendering

12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12. Mathematical foundations, diagnostics, and open issues

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

^{^{^{^{12all:multi-view}}}} diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^{^{^{^.}}} Conceptual scope and defining properties

^{^{^{^{12max_results12^{^{^{^.}}}}}}} Joint latent denoising over fixed view sets

^{^{^{^{12sort_by12^{^{^{^.}}}}}}} Multi-view video denoising on view-time lattices

^{^{^{^{12relevance12^{^{^{^.}}}}}}} Camera-controlled and autoregressive trajectory formulations

^{^{^{^{12query12^{^{^{^.}}}}}}} Trajectory-conditioned control and scenario generation beyond rendering