Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data (2306.07881v2)

Published 13 Jun 2023 in cs.CV

Abstract: We present Viewset Diffusion, a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We note that there exists a one-to-one mapping between viewsets, i.e., collections of several 2D views of an object, and 3D models. Hence, we train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models, thus generating those too. We fit a diffusion model to a large number of viewsets for a given category of objects. The resulting generator can be conditioned on zero, one or more input views. Conditioned on a single view, it performs 3D reconstruction accounting for the ambiguity of the task and allowing to sample multiple solutions compatible with the input. The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset. Project page: szymanowiczs.github.io/viewset-diffusion.

Citations (80)

View on Semantic Scholar

Summary

The paper presents the Viewset Diffusion model that maps multi-view 2D images to coherent 3D reconstructions without requiring 3D ground truth data.
It uses a neural encoder-decoder architecture with radiance fields and voxel grids to learn a one-to-one mapping between viewsets and 3D models, achieving enhanced reconstruction quality.
Experiments on benchmarks like ShapeNet-SRN, CO3D, and Minens demonstrate its superior performance over existing single-view methods in handling ambiguous inputs.

Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

The paper "Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data" by Szymanowicz et al., presents an innovative approach to 3D object generation and reconstruction using diffusion models trained solely on multi-view 2D data. The pivotal idea underlying this work is the one-to-one mapping between viewsets, which are collections of 2D views of an object, and 3D models. By exploiting this mapping, the authors achieve 3D model generation and reconstruction from 2D images without requiring access to 3D ground truth data.

Methodology Overview

The core contribution of this paper is the development of a diffusion-based generative model called Viewset Diffusion. This model extends the framework of Denoising Diffusion Probabilistic Models (DDPMs), which have shown significant success in image generation tasks, to the domain of 3D object modeling. The Viewset Diffusion model operates by generating viewsets, thus bridging the gap between 2D and 3D data modalities. A key component of their approach is a neural network that simultaneously generates both viewsets and corresponding 3D models, enabling single-view and few-view 3D reconstructions and unconditional 3D generation.

To perform this process, the model is structured with an encoder that produces a 3D model from noisy viewsets, followed by a decoder to render clean viewsets, reinforcing the one-to-one mapping in a learnable and differentiable manner. An integral part of this method is the utilization of a neural representation of 3D objects via radiance fields, discretized over a voxel grid, to facilitate the rendering tasks involved.

Numerical Results

The authors provide thorough evaluations of the Viewset Diffusion model on datasets such as ShapeNet-SRN, CO3D, and a newly introduced dataset called Minens. Notably, their model demonstrates significant improvements in perceptual quality and sharpness of 3D reconstructions when only 2D views are available, outperforming existing single-view reconstruction frameworks like PixelNeRF. On the Minens dataset, designed specifically to evaluate ambiguity and diversity in reconstruction, the model excels by sampling multiple plausible 3D structures that correspond to ambiguous 2D inputs, thereby highlighting the significance of probabilistic modeling.

Implications and Future Work

The Viewset Diffusion model opens new possibilities for 3D generative models trained on 2D datasets, which are more accessible than their 3D counterparts. This research suggests potential advancements for AI applications in areas such as virtual reality, augmented reality, and autonomous robotics, where understanding and interacting with the 3D world from visual inputs is crucial.

The paper lays a promising foundation for future exploration into more complex categories of objects and possibly integrating multimodal data. There is room for improvement in refining the implementations for higher resolution outputs, which could further bridge the gap to photorealistic 3D model synthesis from images. Additionally, investigating the extension of this framework to handle dynamic and deformable objects could broaden its applicability significantly.

In conclusion, the Viewset Diffusion model provides a significant step forward in the efficient generation and reconstruction of 3D objects from 2D data alone, offering a solution that is both resource-efficient and versatile for various practical applications in computer vision and beyond.

PDF Markdown