Novel View Synthesis with Diffusion Models (2210.04628v1)

Published 6 Oct 2022 in cs.CV, cs.GR, and cs.LG

Abstract: We present 3DiM, a diffusion model for 3D novel view synthesis, which is able to translate a single input view into consistent and sharp completions across many views. The core component of 3DiM is a pose-conditional image-to-image diffusion model, which takes a source view and its pose as inputs, and generates a novel view for a target pose as output. 3DiM can generate multiple views that are 3D consistent using a novel technique called stochastic conditioning. The output views are generated autoregressively, and during the generation of each novel view, one selects a random conditioning view from the set of available views at each denoising step. We demonstrate that stochastic conditioning significantly improves the 3D consistency of a naive sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM's generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to measure the 3D consistency of a generated object by training a neural field on the model's output views. 3DiM is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes.

Authors (6)

Daniel Watson (8 papers)
William Chan (54 papers)
Ricardo Martin-Brualla (28 papers)
Jonathan Ho (27 papers)
Andrea Tagliasacchi (78 papers)
Mohammad Norouzi (81 papers)

Citations (236)

View on Semantic Scholar

Summary

Overview of "Novel View Synthesis with Diffusion Models"

The paper "Novel View Synthesis with Diffusion Models" introduces 3DiM, an innovative approach utilizing diffusion models for synthesizing novel views. This method seeks to enhance 3D consistency and visual fidelity using diffusion models, a notable departure from traditional geometry-aware models such as Neural Radiance Fields (NeRF) which rely on explicit 3D representations. This research paves the way for generating realistic and sharp images across multiple views without complex and computationally expensive 3D geometry-based methods.

Core Innovations

3DiM Architecture: The paper presents a diffusion model specifically tailored for 3D novel view synthesis. The model takes a source image along with its pose and produces images corresponding to new poses using a pose-conditional image-to-image diffusion process. The strategy allows for generating coherent and high-fidelity completions from a single input view.
Stochastic Conditioning: One of the paper's pivotal contributions is the stochastic conditioning technique. This method significantly improves the 3D consistency of the images generated by the model. Random conditioning views are selected at each denoising step, which helps maintain consistency across views by effectively utilizing all available conditioning views.
Evaluation Methodology: The introduction of the 3D consistency scoring metric provides a novel evaluation approach. This methodology assesses the 3D consistency of generated objects by training a neural field on these outputs, offering quantitative metrics to gauge the performance of diffusion models in achieving 3D coherence.
X-UNet Architecture: The research also proposes X-UNet, an extension of the traditional UNet tailored for the task at hand. This variant incorporates cross-attention layers enabling efficient information exchange between input and target views, boosting the quality of the generated images.

Results and Analysis

The model was benchmarked against previous approaches using the SRN ShapeNet dataset. The results indicate that 3DiM achieves superior visual fidelity with better 3D consistency compared to traditional regression-based models, which often suffer from blurriness. The paper provides a comprehensive analysis and visual illustrations that highlight the improved sample quality of the proposed method over other methods like PixelNeRF and Light Field Networks.

The paper also conducts a series of ablation studies to investigate the effect of different components on model performance. Removing stochastic conditioning or using a simplified regression-based approach instead of diffusion significantly degraded the model's performance, both in terms of qualitative sample quality and quantitative measures like the Fréchet Inception Distance (FID).

Implications and Future Directions

3DiM's ability to generate high-quality novelty view synthesis without explicit 3D representations simplifies the process and potentially broadens the applicability in real-world settings where data might be scarce or noisy. These findings suggest that diffusion models, once tuned for such tasks, offer a robust alternative to more complex 3D reconstruction pipelines, potentially facilitating more practical applications in fields like virtual reality, augmented reality, and gaming.

The paper indicates promising directions for future research, such as scaling the model to handle large, real-world 3D datasets and investigating the integration of stochastic noise in more granular control of generated scenes. Additionally, end-to-end approaches that maintain 3D consistency by design remain an open area to explore, aiming for more expressive and flexible applications of diffusion in 3D rendering and synthesis tasks. As large-scale datasets become increasingly available, methods like 3DiM will likely play a critical role in advancing AI capabilities in 3D content creation.

PDF Markdown

Related Papers

YouTube

Show All Videos