Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-View Diffusion Trajectories

Updated 4 July 2026
  • Multi-View Diffusion Trajectories (MDTs) are a family of diffusion processes that generate coupled multi-view states through shared latent representations and structured reverse diffusion paths.
  • They encompass varied methods—from fixed-view joint denoising and view-time video models to autoregressive camera-conditioned rollouts—ensuring cross-view consistency via architectural and geometric conditioning.
  • Empirical results highlight improvements in metrics like FID, CLIP, and geometric fidelity, demonstrating MDTs’ impact in novel view synthesis, autonomous planning, and multi-sensor integration.

to=arxiv_search.search 天天彩票与你同行json {"12query12 diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12", "12max_results12 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12, "12sort_by12 "12relevance12 to=arxiv_search.search 大发快三官网 微信上的天天中彩票json {"12query12 OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12", "12max_results12 12max_results12query12, "12sort_by12 "12relevance12 to=arxiv_search.search аанацҳауеит 天天彩票提现json {"12query12 diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12"Sharp-It: A Multi-view to Multi-view Diffusion Model for 12sort_by12D Synthesis and Manipulation\"", "12max_results12 12query12, "12sort_by12 "12relevance12 Multi-View Diffusion Trajectories (MDTs) is not yet a standardized formal object in the arXiv literature. The phrase is best used as an umbrella description for diffusion processes whose denoising path is jointly structured across multiple viewpoints, synchronized video streams, or camera-conditioned trajectory queries, rather than over isolated images. Under that reading, MDT-related systems range from fixed-grid multi-view latent refiners such as Sharp-It to camera-controllable multi-view video models such as Cavia, autoregressive novel-view systems such as CausNVS, and dynamic 12relevance12D generators such as DiffusionPRESERVED_PLACEHOLDER_12query12^ and 12relevance12Diffusion; what unifies them is that the evolving variable during reverse diffusion is a coupled multi-view state, and not a collection of independent per-view samples (&&&12query12&&&, &&&12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12&&&, &&&12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12&&&, &&&12relevance12&&&, &&&12query12&&&).

12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12. Conceptual scope and defining properties

In current usage, MDTs are best understood as a family resemblance rather than a single formalism. Some systems denoise a fixed ordered set of rendered views packed into one latent tensor; some denoise a view-time lattice; some roll out camera-conditioned views autoregressively; and some use diffusion to generate physical or control trajectories that then condition multi-view rendering. A recurring distinction is whether the coupled object is a set of views, a video tensor with a view dimension, or a trajectory/control representation that later anchors image generation.

A second distinction concerns how cross-view consistency is obtained. In several image-first systems, consistency is largely implicit, learned, and architectural: all views are processed jointly, and attention over the packed representation serves as cross-view communication. In other systems, camera geometry is injected explicitly through Plücker coordinates, pairwise-relative pose encodings, or epipolar weighting. A third distinction is whether denoising is joint and parallel over a fixed bundle of views, or causal/autoregressive over a camera trajectory. These differences matter because they determine whether an MDT can support arbitrary camera queries, streaming inference, or only a fixed camera layout.

MDT regime Representative systems Core coupling mechanism
Fixed-view joint latent denoising Sharp-It, MultiImageDream, MVDiff Shared latent tensor, self-attention or dense 12sort_by12D attention, sometimes epipolar weighting
Multi-view video denoising Cavia, DiffusionPRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12, 12relevance12Diffusion Joint view-time tensor or row/column score composition
Autoregressive camera-trajectory rollout CausNVS, Virtually Being Causal attention with camera encodings and sequential view generation
Trajectory-conditioned control and rendering RiskMV-DPO, TransDiffuser, MBD Diffusion over planned trajectories or trajectory-conditioned controls

A common misconception is that any multi-view diffusion model is automatically an MDT. That is too broad. A fixed PRESERVED_PLACEHOLDER_12max_results12^ grid with no arbitrary pose support and no explicit trajectory variable is MDT-like only in the weaker sense that its reverse diffusion path is jointly defined over multiple views. Conversely, a planner such as TransDiffuser is not a canonical multi-view image model, but it is relevant because diffusion is used to generate a trajectory distribution conditioned on rich multi-sensor context, and that trajectory view of diffusion transfers directly to MDT design (&&&12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12&&&).

12max_results12. Joint latent denoising over fixed view sets

A major MDT branch operates on a fixed, ordered set of target views and treats the entire set as one denoised object. "Sharp-It" is exemplary. It starts from a coarse but 12sort_by12D-consistent object produced by Shap-E, renders it into six predefined camera views arranged as a PRESERVED_PLACEHOLDER_12sort_by12^ grid, and applies a multi-view-to-multi-view latent diffusion refiner before reconstructing with a feed-forward sparse-view model such as InstantMesh. The model is built on Zero12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12++, uses Stable Diffusion’s VAE latent space, expands the UNet input to 12sort_by12^ channels—12relevance12^ for noisy latents and 12relevance12^ for VAE-encoded degraded renderings—and relies on global self-attention over a packed PRESERVED_PLACEHOLDER_12relevance12^ six-view grid. The training objective is standard latent diffusion with v-prediction,

PRESERVED_PLACEHOLDER_12query12^

with a CFG drop probability of 12query12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12. The method preserves consistency through the coarse 12sort_by12D prior, joint processing, and conditioning on the degraded view set, but it does not introduce explicit epipolar attention, triplanes, pose embeddings, differentiable reprojection, or geometric consistency losses. On its Objaverse-derived paired dataset, Sharp-It reports FID 12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12.12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^, CLIP 12query12.12relevance12query12, and DINO 12query12.12relevance12max_results12, versus FID 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^, CLIP 12query12.12sort_by12max_results12, and DINO 12query12.12sort_by12relevance12 for the strongest reported Zero12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12++ with SDEdit baseline, with runtime around 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^ seconds (&&&12query12&&&).

MultiImageDream shows the same fixed-bundle logic under image prompting rather than 12sort_by12D refinement. It extends ImageDream, itself derived from MVDream, from one prompt image to multiple prompt images without fine-tuning. MVDream jointly denoises four orthogonal target views with densely connected 12sort_by12D attention over a stacked feature map of shape PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12. ImageDream adds a local controller based on resampled CLIP features and a pixel controller that appends a prompt-image latent, changing the attention tensor to PRESERVED_PLACEHOLDER_12max_results12. MultiImageDream generalizes this by concatenating multiple local-token banks and stacking multiple prompt-image latents, giving PRESERVED_PLACEHOLDER_12sort_by12^ for PRESERVED_PLACEHOLDER_12relevance12^ prompt images. The generated four-view latents therefore evolve under a shared denoising trajectory that is continuously anchored by multiple observed-view conditions. Quantitatively, the strongest gains appear on synthesized multi-view imagery: for example, “12max_results12-ImageDream - pixel(f) + local(fb)” improves the single-image baseline from QIS PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^, CLIP(TX) PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^, CLIP(IM) PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^ to QIS PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^, CLIP(TX) PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^, CLIP(IM) PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^, while 12sort_by12D gains after SDS-based NeRF optimization are present but more limited (&&&12max_results12&&&).

MVDiff occupies a related but more geometry-explicit position. It builds a Scene Representation Transformer that aggregates one or more source views into a latent scene representation PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12, predicts a coarse PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^ target latent PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12, and feeds multiple target views jointly into a latent diffusion UNet. Its most explicit geometric device is epipolar attention: for each pair of views, it builds a weighted affinity correction from inverse epipolar distance and modifies attention affinities as

PRESERVED_PLACEHOLDER_12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12^

The paper states that target views are predicted simultaneously rather than sequentially. In ablation, removing epipolar attention reduces PSNR/SSIM/LPIPS from 12max_results12query12.12max_results12relevance12 to 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12/12query12. OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12/12query12. diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^, while removing multi-view attention gives 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12.12relevance12max_results12/ diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12/12query12. diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12^. On GSO novel-view synthesis with one reference view, MVDiff reports PSNR 12max_results12query12.12max_results12relevance12, SSIM 12query12.12sort_by12sort_by12relevance12, LPIPS 12query12.12query12relevance12query12, improving over Zero12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12sort_by12-XL’s 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12.12relevance12sort_by12/ OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12/12query12. diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12relevance12^; for downstream GSO reconstruction it reports Chamfer Distance 12query12.12query12relevance12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^ and Volume IoU 12query12.12relevance12sort_by12query12max_results12 with one input view, and improves further with more reference views (&&&12sort_by12&&&).

These fixed-view systems illustrate a narrow but important MDT regime. Their trajectory is the reverse diffusion path of a packed multi-view latent, usually under a fixed camera layout. This suggests that synchronized denoising alone can be a strong source of multi-view coherence, but also that flexibility in camera graphs, trajectory length, and online 12query12 is limited unless the formulation is widened beyond a fixed bundle.

12sort_by12. Multi-view video denoising on view-time lattices

A stronger MDT interpretation appears when diffusion is defined over both view and time. Cavia does this explicitly. It extends Stable Video Diffusion to camera-controllable multi-view video generation and represents the latent state as a tensor of shape

PRESERVED_PLACEHOLDER_12max_results12query12^

where PRESERVED_PLACEHOLDER_12max_results12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^ is the number of views and PRESERVED_PLACEHOLDER_12max_results12max_results12^ is frames per view. Camera control is encoded by Plücker ray coordinates derived from extrinsics and intrinsics, concatenated channel-wise with latent inputs. The architectural core is View-Integrated Attention: cross-frame attention rearranges features so that attention spans all spatiotemporal tokens within a view, and cross-view attention rearranges them so that synchronized timesteps from all views attend jointly. This gives direct communication between different camera trajectories of the same scene. Cavia is trained with EDM-style denoising score matching,

PRESERVED_PLACEHOLDER_12max_results12sort_by12^

on a mixture of static scene/object multi-view videos, synthetic multi-view dynamic videos, and monocular dynamic videos with estimated poses. Its ablations are unusually MDT-relevant: removing cross-view attention causes different object motions to appear in different views, and removing cross-frame attention causes severe distortions. On RealEstate12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12K monocular camera control, Cavia reports FID 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12.12relevance12sort_by12^, FVD 12query12query12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^, and COLMAP error 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12.12relevance12%; on two-view generation it reports FID 12sort_by12.12sort_by12max_results12, FVD 12relevance12relevance12.12sort_by12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^, and Prec. PRESERVED_PLACEHOLDER_12max_results12relevance12^, MS. PRESERVED_PLACEHOLDER_12max_results12query12^ on Real12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12K, outperforming CameraCtrl in the reported table (&&&12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12&&&).

DiffusionPRESERVED_PLACEHOLDER_12max_results12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^ addresses the same joint structure from a different angle. Rather than training a native multi-view video model, it composes a pretrained video diffusion prior and a pretrained multi-view diffusion prior over a dense image array

PRESERVED_PLACEHOLDER_12max_results12max_results12^

Its key theorem assumes conditional independence between the same-frame multi-view context and same-view temporal context given a center cell, yielding the score identity

PRESERVED_PLACEHOLDER_12max_results12sort_by12^

In practice the unknown single-image marginal score is approximated by a convex combination of the row and column scores, with a logistic schedule

PRESERVED_PLACEHOLDER_12max_results12relevance12^

and the paper reports PRESERVED_PLACEHOLDER_12sort_by12query12, PRESERVED_PLACEHOLDER_12sort_by12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^ as the best setting. This is presented as a way to decouple geometry-consistent generation and temporally smooth appearance during denoising. The output is a dense multi-view, multi-frame lattice used to optimize 12relevance12D Gaussian Splatting. The paper reports around 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^ minutes end-to-end runtime and gives user-study evidence that DiffusionPRESERVED_PLACEHOLDER_12sort_by12max_results12^ improves geometric consistency and overall model quality over Animate12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12relevance12^ and DreamGaussian12relevance12D, while CLIP similarity on video-to-12relevance12D is 12query12.12relevance12relevance12 versus 12query12.12relevance12max_results12 for Efficient12relevance12D (&&&12relevance12&&&).

12relevance12Diffusion replaces score composition with a learned unified denoiser. It starts from ImageDream, inserts zero-initialized motion modules into a frozen 12sort_by12D-aware UViT, and trains a multi-view video diffusion model, 12relevance12DM, on a curated set of 12relevance12max_results12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^ animated Objaverse assets rendered as synchronized multi-view videos. The latent tensor has shape

PRESERVED_PLACEHOLDER_12sort_by12sort_by12^

with PRESERVED_PLACEHOLDER_12sort_by12relevance12^ viewpoints: one source monocular view and four target views. Spatial modules reuse ImageDream’s 12sort_by12D self-attention across views, while motion modules reshape the tensor to apply temporal self-attention across frames. The resulting denoiser is used in a 12relevance12D-aware SDS objective,

PRESERVED_PLACEHOLDER_12sort_by12query12^

to optimize a hash-encoded dynamic NeRF, together with an anchor loss and regularizers. In direct multi-view video evaluation against ImageDream, 12relevance12DM reports CLIP-I 12query12.12sort_by12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12query12^, LPIPS 12query12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12relevance12^, CLIP-C 12query12.12relevance12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12query12^, and FVD 12query12max_results12sort_by12.12relevance12query12, compared with ImageDream’s 12query12.12sort_by12max_results12max_results12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^, 12query12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12query12relevance12^, 12query12.12relevance12relevance12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^, and 12query12relevance12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12.12max_results12query12^. On full 12relevance12D generation, the final system reports CLIP-I 12query12.12sort_by12sort_by12query12sort_by12, CLIP-C 12query12.12relevance12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12relevance12^, and FVD 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12relevance12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12.12sort_by12^, outperforming Consistent12relevance12D, DreamGaussian12relevance12D, and 12relevance12D-fy in the reported benchmark (&&&12query12&&&).

Together, these systems show two distinct routes to MDTs over view-time structure. One route composes orthogonal score fields over rows and columns; the other trains a single denoiser whose latent trajectory already spans views and frames. This suggests that “trajectory” can refer both to the reverse diffusion path and to the induced coupling topology over a view-time grid.

12relevance12. Camera-controlled and autoregressive trajectory formulations

CausNVS pushes MDTs toward open-ended novel-view synthesis. It addresses the limits of non-autoregressive multi-view diffusion by generating target views sequentially, conditioned on accumulated context and target poses. Given input views

PRESERVED_PLACEHOLDER_12sort_by12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^

and target poses

PRESERVED_PLACEHOLDER_12sort_by12max_results12^

it represents each frame as

PRESERVED_PLACEHOLDER_12sort_by12sort_by12^

uses causal masking in frame-wise attention, and trains with independent per-frame noise levels under

PRESERVED_PLACEHOLDER_12sort_by12relevance12^

Camera conditioning is handled by pairwise-relative camera pose encoding (CaPE),

PRESERVED_PLACEHOLDER_12relevance12query12^

At inference, CausNVS combines pose-aware sliding windows, key-value caching, and noise conditioning augmentation to mitigate drift. It is trained with PRESERVED_PLACEHOLDER_12relevance12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^ and evaluated up to PRESERVED_PLACEHOLDER_12relevance12max_results12, including rollouts up to PRESERVED_PLACEHOLDER_12relevance12sort_by12^ training length. The paper reports strong flexible PRESERVED_PLACEHOLDER_12relevance12relevance12-to-PRESERVED_PLACEHOLDER_12relevance12query12^ synthesis on RealEstate12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12K, DL12sort_by12DV, and LLFF, and shows that causal training generalizes more robustly than non-causal alternatives across different sequence lengths (&&&12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12&&&).

Virtually Being addresses a related problem from the customization side rather than the NVS side. Its main contribution is a data pipeline that uses 12relevance12D Gaussian Splatting to re-render the same captured performance under many virtual camera trajectories, thereby fine-tuning camera-controllable video diffusion models for multi-view identity preservation. Human capture uses a 12max_results12query12^ synchronized camera face rig and a 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^ synchronized camera full-body rig; each subject performs 12sort_by1212(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^ multi-view sequences lasting about 12query12query12^ to 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12query12^ frames at 12max_results12relevance12^ fps. New training trajectories are created by randomly sampling start and end camera positions within a 12max_results1212all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^ meter radius and linearly interpolating between them, while lighting diversity is added with Lux Post Facto and HDRI maps. Camera information is represented using Plücker coordinates, encoded by a fully convolutional encoder, and injected through a ControlNet-style path into CogVideoX; the paper states that camera conditioning is applied only during the first 12relevance12query12% of denoising timesteps and only into the first 12max_results12query12% of DiT blocks. On evaluation, the customized model reports AdaFace 12query12.12sort_by12query12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^ versus 12query12.12sort_by12max_results12max_results12 for a frontal-only variant, and its pretrained camera-conditioned version reports TransErr 12query12.12max_results12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^ and RotErr 12query12.12query12relevance12max_results12; dynamic-camera customization remains markedly better than static-camera-only customization in camera-control metrics (&&&12max_results12max_results12&&&).

The contrast between CausNVS and Virtually Being is instructive. CausNVS is a general autoregressive camera-trajectory diffusion model with explicit causal rollout and arbitrary PRESERVED_PLACEHOLDER_12relevance12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12-to-PRESERVED_PLACEHOLDER_12relevance12max_results12^ 12query12^ support. Virtually Being, by contrast, is a customization framework that binds identity to camera-conditioned video denoising through richly re-rendered supervision. This suggests two complementary MDT strategies: one can either build camera-trajectory flexibility into the sampler itself, or shape a strong conditional diffusion backbone with trajectory-rich data.

12query12. Trajectory-conditioned control and scenario generation beyond rendering

A second MDT-adjacent research line treats the trajectory itself as the primary generated object. TransDiffuser is a diffusion-based end-to-end planner for autonomous driving whose conditioning is inherently multi-modal and partially multi-view. It predicts future ego trajectories

PRESERVED_PLACEHOLDER_12relevance12sort_by12^

uses an action-space parameterization following TrajHF, and conditions a denoising decoder on fused scene features

PRESERVED_PLACEHOLDER_12relevance12relevance12^

The reverse update follows standard DDPM noise prediction, while the main novelty is a decorrelation regularizer over the fused multi-modal representation: PRESERVED_PLACEHOLDER_12query12query12^ with PRESERVED_PLACEHOLDER_12query12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12, and total loss

PRESERVED_PLACEHOLDER_12query12max_results12^

using PRESERVED_PLACEHOLDER_12query12sort_by12. The paper reports PDMS 12relevance12relevance12.12sort_by12query12 on NAVSIM, or 12relevance12relevance12.12relevance12 in the main table, with 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^ denoising steps and 12sort_by12query12^ trajectory candidates. Its own discussion is explicit that “multi-modal representation” here means fused sensor/modal features rather than a true per-view camera-token model; this makes it relevant to MDTs by analogy rather than by direct multi-view image coupling (&&&12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12&&&).

RiskMV-DPO is closer to an MDT pattern in the rendering sense because it generates risk-conditioned future trajectories and 12sort_by12D boxes first, then uses them as geometric anchors for multi-view scenario diffusion. The decomposition is explicit: PRESERVED_PLACEHOLDER_12query12relevance12^ Risk is defined from relative displacement, velocities, approach cues, lateral attenuation, and type coefficients; the per-agent risk at time PRESERVED_PLACEHOLDER_12query12query12^ is

PRESERVED_PLACEHOLDER_12query12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^

and the motion generator is trained so that generated modes match a target risk level. Diffusion training then adds geometry-appearance alignment and Region-Aware DPO with a fused 12sort_by12D multi-view mask. On nuScenes, the final system reports FID 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12.12max_results12query12^, FVD 12sort_by12max_results12.12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12^, and mAP 12sort_by12query12.12query12query12, versus FID 12max_results12query12.12relevance12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12^, FVD 12relevance12relevance12.12sort_by12relevance12, and mAP 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12sort_by12.12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^ for MagicDriveV12max_results12. The ablations also show monotonic gains in MV-SSIM from 12query12.12sort_by12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12max_results12^ to 12query12.12sort_by12query12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^ and improved Depth AbsRel from 12query12.12max_results12query12query12 to 12query12.12max_results12query12relevance12 as motion-aware masking, 12sort_by12D multi-view masks, VGGT geometry features, and alignment are added (&&&12max_results12relevance12&&&).

Model-Based Diffusion for Trajectory Optimization is not a multi-view image model, but it is highly relevant to MDT methodology because it shows that diffusion trajectories can be driven directly by known model information rather than learned denoisers. It defines a target trajectory density

PRESERVED_PLACEHOLDER_12query12max_results12^

over full trajectory vectors PRESERVED_PLACEHOLDER_12query12sort_by12, approximates the score of the noised density by Monte Carlo,

PRESERVED_PLACEHOLDER_12query12relevance12^

and performs deterministic reverse updates. With PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12, the update reduces to a CEM-type weighted mean, which the paper uses to explain its connection to sampling-based optimization. This suggests a broader MDT design principle: different “views” of a trajectory distribution—dynamics, cost, constraints, demonstrations—can be fused as probabilistic factors inside diffusion-style iterative refinement, even when no trained multi-view denoiser exists (&&&12max_results12query12&&&).

These trajectory-conditioned systems expand MDTs beyond novel-view synthesis. They imply that “multi-view” need not mean only multiple cameras; it can also mean multiple structured factors over the same future trajectory. That is an inference from their formulations, but it is a practically important one for planning-oriented diffusion research.

The deepest theoretical precursor is "MultiView Diffusion Maps," which defines a diffusion process on a multiview state space of paired sample-view indices PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12. For PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^ views and PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12^ samples per view, it constructs a block kernel PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12^ with zero diagonal blocks and off-diagonal blocks PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12query12, row-normalizes to obtain a Markov operator

PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^

and interprets PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12max_results12^ as a cross-view random walk in which “staying in the same view … is forbidden.” Diffusion coordinates are then

PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12^

and multi-view diffusion distances compare rows of PRESERVED_PLACEHOLDER_12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12relevance12. This is not generative diffusion in the DDPM sense, but it is a rigorous operator-theoretic model of cross-view diffusion trajectories and remains one of the clearest mathematical templates for MDTs as trajectories on a coupled multi-view state space (&&&12max_results12&&&).

A complementary trajectory-centric perspective comes from "Let us Build Bridges," which treats diffusion models as latent variable models over full trajectories PRESERVED_PLACEHOLDER_12max_results12query12^ and develops PRESERVED_PLACEHOLDER_12max_results12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12-bridges and PRESERVED_PLACEHOLDER_12max_results12max_results12-bridges. The learnable constrained model

PRESERVED_PLACEHOLDER_12max_results12sort_by12^

is built by starting from a bridge process that already satisfies endpoint constraints and then adding a learnable drift while preserving support in the constrained domain. This suggests that MDTs with hard multi-view consistency or structured-output constraints may be more naturally formulated as bridge processes than as free reverse-time denoisers (&&&12max_results12max_results12&&&).

A more diagnostic view is provided by "Tracing the Roots," which treats a diffusion trajectory itself as a discriminative object. For each time step PRESERVED_PLACEHOLDER_12max_results12relevance12, it extracts

PRESERVED_PLACEHOLDER_12max_results12query12^

concatenates them over time, and shows that the full temporal signature supports origin attribution better than single-step thresholds. On CIFAR-12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12^ membership inference, the paper reports AUC 12sort_by12sort_by12.12sort_by12 using all features and all steps, versus 12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12sort_by12.12max_results12^ for the Matsumoto single-step baseline; on CelebA-HQ 12max_results12query12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12, it reports 12all:multi-view diffusion trajectories OR all:multi-view diffusion OR all:camera-controllable multi-view video diffusion OR all:autoregressive multi-view diffusion novel view synthesis12query12query12.12query12^ versus 12sort_by12query12.12max_results12. While this is not a multi-view image-generation paper, it demonstrates that diffusion trajectories carry structured information beyond terminal samples, which is directly relevant if MDTs are to be analyzed rather than merely built (&&&12sort_by12&&&).

Across the literature, three open issues recur. First, many systems remain tied to a fixed view layout: Sharp-It uses six predefined views in a fixed PRESERVED_PLACEHOLDER_12max_results12(Edelstein et al., 2024) OR (Xu et al., 2024) OR (Kim et al., 2024) OR (Bourigault et al., 2024) OR (Yang et al., 2024) OR (Zhang et al., 2024) OR (Jiang et al., 14 May 2025) OR (Lindenbaum et al., 2015) OR (Floros et al., 2024)12^ grid, MVDiff is evaluated on preset novel-view bundles, and 12relevance12Diffusion uses orthogonal views in canonical coordinates. Second, consistency is often architectural rather than explicitly geometric: several methods lack epipolar losses, reprojection constraints, or depth supervision. Third, scalability with longer horizons, more views, or arbitrary camera graphs remains uneven: Cavia and CausNVS address this more directly, but even they retain synchronized timesteps and equal-length sequences. A final misconception is that better visual quality automatically implies better geometric fidelity. The reported mAP gains in RiskMV-DPO, the depth and MV-SSIM ablations, and the detector-compatibility improvements suggest that geometry-aware conditioning matters separately from appearance realism (&&&12max_results12relevance12&&&).

The field therefore contains both a practical and a theoretical split. Practically, MDTs are already realized in several incompatible but productive forms: fixed-grid joint denoisers, view-time video models, autoregressive camera-conditioned rollouts, and trajectory-first control pipelines. Theoretically, the strongest unifying ideas currently come from multiview Markov operators, bridge processes, and explicit analysis of diffusion trajectories. This suggests that a future formal theory of MDTs would likely need to combine all three: a coupled state space over views, a trajectory law over denoising time, and constraint mechanisms strong enough to preserve geometry under flexible camera or control trajectories.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-View Diffusion Trajectories (MDTs).