ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis (2409.02048v1)

Published 3 Sep 2024 in cs.CV

Abstract: Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control. To further enlarge the generation range of novel views, we tailored an iterative view synthesis strategy together with a camera trajectory planning algorithm to progressively extend the 3D clues and the areas covered by the novel views. With ViewCrafter, we can facilitate various applications, such as immersive experiences with real-time rendering by efficiently optimizing a 3D-GS representation using the reconstructed 3D points and the generated novel views, and scene-level text-to-3D generation for more imaginative content creation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in synthesizing high-fidelity and consistent novel views.

References (79)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces ViewCrafter, a method that combines point-conditioned video diffusion with point cloud reconstruction for high-fidelity novel view synthesis from sparse inputs.
It employs an iterative view synthesis strategy with adaptive camera trajectory planning to improve scene reconstruction and maintain precise pose control.
Experimental results on Tanks-and-Temples, RealEstate10K, and CO3D datasets demonstrate significant improvements in LPIPS, PSNR, SSIM, and FID over previous state-of-the-art methods.

The paper introduces ViewCrafter, a novel method designed for high-fidelity novel view synthesis of generic scenes from single or sparse images, leveraging the capabilities of video diffusion models while maintaining precise camera pose control. The method integrates the generative power of video diffusion models with coarse 3D priors offered by point-based representations to generate high-quality video frames with accurate camera pose control.

The core of ViewCrafter is a point-conditioned video diffusion model that generates consistent videos under a novel view trajectory, conditioned on frames rendered from a point cloud reconstructed from single or sparse images. This addresses the limitations of existing 3D neural reconstruction techniques, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D-GS), which require dense multi-view captures.

Key components and strategies include:

Point Cloud Reconstruction: A point cloud representation is established from the reference image(s) using a dense stereo model. In cases where only a single input image is available, the input image is duplicated to create a paired input, and then its point map and camera intrinsics are estimated. For more than two input images, global point map alignment is performed with a few optimization iterations. The colored point cloud is obtained by integrating the point maps with their corresponding RGB images.
Point-Conditioned Video Diffusion Model: The method learns a conditional distribution $p(\bm{x}~|~\mathbf{I}^{\text{ref}},\mathbf{P})$ to produce novel views $\bm{x} = \{\bm{x}^{0},...,\bm{x}^{L-1}\}$ based on point cloud renders $\mathbf{P}$ and reference image(s) $\mathbf{I}^{\text{ref}}$ , where $\bm{x}$ denotes the video frames. The architecture inherits the Latent Diffusion Model (LDM) architecture, including a VAE encoder $\mathcal{E}$ and decoder $\mathcal{D}$ for image compression, a video denoising U-Net with spatial and temporal layers, and a CLIP image encoder for reference image understanding. Point cloud renders are encoded using $\mathcal{E}$ and concatenated with noise in the channel dimension within the video denoising U-Net.
Iterative View Synthesis: To address the challenges of generating long videos, an iterative view synthesis strategy is employed, coupled with a content-adaptive camera trajectory planning algorithm. The camera is navigated from one of the reference views to a target camera pose to reveal occlusions and missing regions of the current point cloud. Novel views are generated using ViewCrafter, and the generated views are back-projected to complete the point cloud.
Camera Trajectory Planning: A Next-Best-View (NBV)-based camera trajectory planning algorithm is designed to generate adaptive camera trajectories tailored to different scene types. Starting with the input reference image(s) $\mathcal{I}_{\text{ref}}$ , an initial point cloud $\mathcal{P}_{\text{ref}$ is constructed. The camera trajectory is initialized from one of the reference camera poses $\mathcal{C}_{\text{ref}$. Candidate camera poses $\mathcal{C}_{\text{can}={\{\mathcal{C}^{1}_{\text{can},...,{\mathcal{C}^{K}_{\text{can}\}$ are sampled from the searching space surrounding the current camera pose $\mathcal{C}_{\text{curr} = \mathcal{C}_{\text{ref}$, and a set of candidate masks $\mathcal{M}_{\text{can}$ are rendered from the current point cloud $\mathcal{P}_{\text{curr}$. A utility function $\mathcal{F}(\cdot)$ is established to determine the optimal camera pose for the subsequent step. $\mathcal{F}(\mathcal{C})=\left{ \begin{aligned} &\frac{\rm{sum}(\mathcal{M_{C})}{W\times H}, \frac{\rm{sum}(\mathcal{M_{C})}{W\times H} < \Theta \ &1-\frac{\rm{sum}(\mathcal{M_{C})}{W\times H}, \frac{\rm{sum}(\mathcal{M_{C})}{W\times H} > \Theta, \ \end{aligned} \right.$, where $\mathcal{C}\in \mathcal{C}{\text{can}}$ ,$\mathcal{M{C} \in \mathcal{M}{\text{can}}$, and$\text{sum}(\mathcal{M{C}) = \sum_{u=0}{W} \sum_{v=0}{H}\mathcal{M_{C}(u,v)$.

The paper also explores applications of ViewCrafter, including efficient optimization of a 3D-GS representation for real-time rendering and scene-level text-to-3D generation. In 3D-GS optimization, the iterative view synthesis strategy is used to complete the initial point cloud and synthesize novel views. The attributes of each 3D Gaussian are optimized under the supervision of the synthesized novel views, deprecating the densification, splitting, and opacity reset tricks, and reducing the optimization time to 2,000 iterations. For text-to-3D generation, a text-to-image diffusion model is used to generate a reference image from a text prompt, followed by ViewCrafter for novel view synthesis and 3D reconstruction.

The method was evaluated on the Tanks-and-Temples, RealEstate10K, and CO3D datasets. In zero-shot novel view synthesis, ViewCrafter outperformed baselines in image quality and pose accuracy metrics, demonstrating its ability to synthesize high-fidelity novel views and achieve precise pose control. In 3D-GS reconstruction, ViewCrafter surpassed previous state-of-the-art methods, validating its effectiveness in scene reconstruction from sparse views. On Tanks and Temples dataset, ViewCrafter achieved LPIPS $\downarrow$ 0.194, PSNR $\uparrow$ 21.26, SSIM $\uparrow$ 0.655, FID $\downarrow$ 27.18 on the "easy" set and LPIPS $\downarrow$ 0.283, PSNR $\uparrow$ 18.07, SSIM $\uparrow$ 0.563, FID $\downarrow$ 38.92 on the "hard" set. In comparison, the next best method, MotionCtrl, achieved LPIPS $\downarrow$ 0.400, PSNR $\uparrow$ 15.34, SSIM $\uparrow$ 0.427, FID $\downarrow$ 70.3 on the "easy" set and LPIPS $\downarrow$ 0.473, PSNR $\uparrow$ 13.29, SSIM $\uparrow$ 0.384, FID $\downarrow$ 196.8 on the "hard" set.

Ablation studies compared the point cloud-based pose condition strategy with Plücker coordinates, demonstrating that ViewCrafter achieves more accurate pose control. The studies also confirmed the robustness of ViewCrafter to imperfect point cloud conditions and validated the effectiveness of the training paradigm and the camera trajectory planning algorithm.