Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis (2509.06579v1)

Published 8 Sep 2025 in cs.CV

Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a causal, autoregressive multi-view diffusion framework that leverages per-frame noise conditioning and relative camera pose encoding for flexible 3D view synthesis.
  • It employs a UNet-based latent diffusion backbone with causal masking and KV caching to support arbitrary input-output configurations and efficient streaming inference.
  • Experimental results on datasets like RealEstate10K and LLFF demonstrate robust spatial consistency, improved generalization, and stable performance over extended autoregressive rollouts.

CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

Introduction and Motivation

CausNVS introduces a causal, autoregressive multi-view diffusion framework for novel view synthesis (NVS), addressing the limitations of non-autoregressive models that require fixed input-output view configurations and suffer from inefficient inference. The model is designed to support arbitrary numbers of input and output views, enabling sequential, streaming, and interactive applications in 3D scene understanding and world modeling. The core technical contributions include causal masking, per-frame noise conditioning, and relative camera pose encoding (CaPE), which together facilitate robust autoregressive generation and efficient spatial memory via key-value (KV) caching.

Model Architecture and Training

CausNVS builds upon a latent diffusion backbone, specifically a UNet architecture operating in VAE latent space. Frame-wise attention layers are inserted into deeper blocks, equipped with causal masking to ensure each frame is conditioned only on past information. The model is trained with sequences of F=8F=8 frames, where each frame is assigned an independent noise level, sampled randomly. This per-frame noise conditioning enables the model to learn denoising from partially noisy contexts, reducing the training-inference gap typical in autoregressive generation.

The training objective is formulated as:

Lcausal=E{(xi,pi),ti,ϵi}i=1F∑i=1F∥ϵ^θ(vi∣v<i)−ϵi∥22\mathcal{L}_{\text{causal}} = \mathbb{E}_{\{(\bm{x}_i, \bm{p}_i), t_i, \bm{\epsilon}_i\}_{i=1}^{F}} \sum_{i=1}^{F} \left\| \bm{\hat\epsilon}_{\theta}\left(\bm{v}_i|\bm{v}_{<i}\right) - \bm{\epsilon}_i \right\|_2^{2}

where vi=(ziti,pi)\bm{v}_i = (\bm{z}_i^{t_i}, \bm{p}_i) is the latent representation and pose of frame ii, and ϵi\bm{\epsilon}_i is the added noise.

Relative Camera Pose Encoding (CaPE)

A key innovation is the use of CaPE for pose conditioning. Unlike absolute pose encodings (e.g., Plücker rays, raw extrinsics), CaPE encodes only pairwise-relative camera relationships, making the attention mechanism invariant to global coordinate changes. This enables efficient KV caching and sliding-window attention, as cached attention remains valid even as the reference frame shifts during autoregressive rollout.

(Figure 1)

Figure 1: CaPE attention scores vary periodically with rotation and linearly with translation, providing SE(3)-aware inductive bias for spatial consistency.

Inference: KV Caching and Spatial Memory

During inference, CausNVS employs a spatially-aware sliding-window attention mechanism, selecting the top-KK nearest views in pose space for context aggregation. KV caching is used to store attention computations from previously generated views, significantly reducing computational overhead. Unlike external memory systems, this approach integrates spatial memory directly into the transformer architecture, supporting efficient and scalable autoregressive generation.

(Figure 2)

Figure 2: Causal multi-view diffusion pipeline with frame-wise attention, causal masking, CaPE, and KV caching for efficient autoregressive denoising.

Experimental Results

CausNVS is evaluated on RealEstate10K, DL3DV, and LLFF datasets, demonstrating strong performance across diverse scenes and camera trajectories. The model supports flexible NN-to-MM synthesis, generalizing to arbitrary input-output configurations without retraining. Quantitative results show competitive PSNR, SSIM, and LPIPS metrics compared to state-of-the-art baselines, with consistent improvements as more input views are provided.

Notably, CausNVS maintains stable quality over long autoregressive rollouts, up to 10×10\times longer than the training horizon, with only moderate degradation. In contrast, non-causal models exhibit significant drift and instability when evaluated outside their training configuration.

(Figure 3)

Figure 3: Visual comparison of CausNVS and baselines on diverse scenes and trajectories, highlighting robust spatial consistency and generalization.

Flexible Trajectory and Spatial Consistency

CausNVS generalizes to customized camera trajectories, including those with revisited views and non-monotonic motion, maintaining spatial consistency and appearance coherence. The combination of CaPE, causal masking, and KV caching enables the model to retrieve geometrically relevant prior content, supporting interactive and streaming NVS scenarios.

(Figure 4)

Figure 4: Novel view synthesis on diverse customized trajectories, demonstrating spatial consistency even for trajectories that return back.

Figure 5

Figure 5: Novel view synthesis on Re10K with customized trajectories, showcasing spatial consistency under diverse camera motions.

Ablation: Autoregressive vs. Non-autoregressive

Ablation studies confirm that causal masking and autoregressive training are critical for generalization to variable-length input-output configurations. The causal model maintains performance across a wide range of sequence lengths and input views, while the non-causal variant degrades rapidly outside its training setup. The spatial attention window further enables efficient inference, achieving comparable results to global attention with reduced FLOPS.

(Figure 6)

Figure 6: Key properties of CausNVS: robust generalization to variable sequence lengths, stable autoregressive rollout, and efficient spatial attention windowing.

Limitations and Future Directions

CausNVS relies on multi-step denoising, which constrains real-time applicability. Future work may explore consistency training or distillation for faster generation. Scaling to longer sequences and more diverse datasets would further improve generalization. Integrating multimodal signals (audio, language, actions) could enable fully grounded world models with controllable rollout in complex environments.

Conclusion

CausNVS advances the state of flexible 3D novel view synthesis by introducing a causal, autoregressive multi-view diffusion framework with relative pose encoding and efficient spatial memory. The model supports arbitrary input-output view configurations, robust autoregressive rollout, and efficient inference, making it suitable for streaming, interactive, and generative 3D applications. Theoretical implications include improved generalization and scalability in world modeling, while practical impacts span AR/VR, content creation, and simulation. Future research should address real-time generation, multimodal integration, and broader deployment considerations.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub