DeepVoxels: Learning Persistent 3D Feature Embeddings (1812.01024v2)

Published 3 Dec 2018 in cs.CV

Abstract: In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry. At its core, our approach is based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying 3D scene structure. Our approach combines insights from 3D geometric computer vision with recent advances in learning image-to-image mappings based on adversarial loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction of the scene, using a 2D re-rendering loss and enforces perspective and multi-view geometry in a principled manner. We apply our persistent 3D scene representation to the problem of novel view synthesis demonstrating high-quality results for a variety of challenging scenes.

Citations (641)

View on Semantic Scholar

Summary

The paper introduces a persistent 3D feature embedding that synthesizes coherent novel views without requiring explicit 3D geometry reconstruction.
It employs a Cartesian 3D grid with a volume lifting layer and occlusion reasoning to integrate multi-view geometry, significantly boosting PSNR and SSIM metrics.
The approach enhances practical applications in VR/AR and guides future research into efficient 3D scene representation in neural networks.

DeepVoxels: Learning Persistent 3D Feature Embeddings

The paper "DeepVoxels: Learning Persistent 3D Feature Embeddings" addresses a fundamental challenge in generative neural networks: the synthesis of coherent image sequences representing the same 3D scene from diverse viewpoints. Traditional methods have struggled with this aspect due to the reliance on 2D convolutions and the lack of an inherent understanding of 3D scene geometry. DeepVoxels introduces a persistent 3D feature embedding system that encodes view-dependent appearances without requiring explicit 3D geometry modeling.

At its core, DeepVoxels relies on a Cartesian 3D grid of persistent embedded features that learn to exploit the underlying 3D structure of the scene. This approach synthesizes novel views by using multi-view geometry principles, adversarial loss functions, and a supervised training setup without needing a 3D reconstruction.

The proposed method excels in novel view synthesis, producing high-quality results even for complex scenes, as demonstrated on synthetic datasets and real captures. The system consists of several key components: a volume lifting layer for converting 2D image features into 3D space, a fully convolutional 3D network to process these features, and an occlusion module that reasons about visibility based on softmax-weighted depth predictions. The resulting representation is intermediate, residing in a structured latent space that is inherently tied to the 3D properties of the scene.

Strong Numerical Outcomes

Quantitative measures, including PSNR and SSIM, indicate that DeepVoxels outperforms several strong baselines by significant margins. Evaluations on high-quality 3D scans reveal average improvements in PSNR exceeding 7 dB over competing models, demonstrating the system's ability to maintain fine details and robustly handle varying perspectives.

Notable Claims and Contributions

The paper asserts several critical innovations:

A persistent 3D feature representation that effectively incorporates 3D scene information for high-quality image synthesis.
An effective occlusion reasoning mechanism based on learned soft visibility weights, enhancing both result quality and generalization to novel viewpoints.
The enforcement of perspective and multi-view geometry in a structured and interpretable manner during training, without relying on 3D supervision.

Implications and Future Directions

The implications of DeepVoxels extend to both practical and theoretical domains in AI. Practically, the ability to synthesize coherent novel views has the potential to significantly benefit applications in virtual reality, augmented reality, and other spatial computing domains. Theoretically, DeepVoxels offers a pathway for further exploration into integrating 3D geometric understanding within neural architectures, possibly inspiring new frameworks that balance 3D reasoning with computational efficiency.

There remains room for future advancements, particularly in addressing the memory inefficiency of the dense voxel grids and enhancing generalization capabilities. Progress in sparse neural networks could offer solutions to these challenges. Moreover, extending the approach to handle more complex, non-Lambertian scenes could broaden its applicability.

In conclusion, the DeepVoxels approach presents a significant step forward in generative model capabilities by incorporating persistent 3D feature embeddings, paving the way for future developments in scene representation and novel view synthesis.

PDF Markdown

Related Papers

YouTube

Show All Videos