Consistency of 4DiM outputs across multiple views

Determine whether 4DiM (Controlling Space and Time with Diffusion Models) can produce multi-view-consistent outputs when synthesizing images under specified viewpoints and timestamps, i.e., ascertain if generations from multiple cameras at given times are mutually consistent so they can be used as multi-view videos for reconstruction.

Background

Multi-view consistency is critical for using generative outputs in 4D reconstruction, as inconsistencies across views can cause artifacts or failure in optimization. The authors note that while various video generation models offer camera control, they generally cannot generate multi-view videos that are mutually consistent. They specifically call out uncertainty regarding whether 4DiM, which synthesizes images under novel views and timestamps, achieves the necessary cross-view consistency.

Resolving this question would clarify whether 4DiM can serve as a viable multi-view video source for downstream dynamic 3D reconstruction tasks, similar to the role envisioned for multi-view video diffusion models in CAT4D.

References

4DiM trained a diffusion model for synthesizing images under novel views and timestamps, but it's unclear whether their model can produce consistent multi-view videos.

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models  (2411.18613 - Wu et al., 2024) in Section 2 (Related Work) – Video Generation Models with Camera Control