Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision (1912.07372v2)

Published 16 Dec 2019 in cs.CV, cs.LG, and eess.IV

Abstract: Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

Citations (930)

View on Semantic Scholar

Summary

The paper introduces a differentiable volumetric rendering technique that leverages analytic depth gradients to learn implicit 3D representations using only 2D images.
The method achieves memory efficiency and robust performance in both single-view and multi-view reconstruction, comparable to 3D-supervised approaches.
The approach offers practical benefits for applications in augmented reality, robotics, and medical imaging by enabling 3D reconstruction from limited supervision.

Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

The paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision" by Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger introduces a novel approach for 3D reconstruction from 2D images using implicit representations. This work addresses the limitations of existing methods that typically require explicit 3D supervision in the form of detailed 3D ground truth data. The proposed Differentiable Volumetric Rendering (DVR) technique leverages implicit differentiation to derive depth gradients analytically, thus enabling the training of 3D shape and texture representations directly from 2D RGB images.

Key Contributions

Differentiable Rendering for Implicit Representations: The paper proposes a differentiable rendering approach specifically tailored for implicit representations of shape and texture. Implicit representations define the geometry and appearance of objects continuously in space, offering advantages such as flexibility in object topology and scalability to high resolutions.
Analytic Depth Gradients: A significant insight of the paper is the use of implicit differentiation for computing depth gradients analytically. This allows for the direct optimization of the 3D representation based on 2D image supervision, bridging the gap between 2D observations and 3D model learning.
Memory Efficiency: Another intriguing aspect of the DVR method is its memory efficiency. Unlike voxel-based methods that require large memory footprints to store volumetric data during the forward pass, the DVR method maintains a constant memory footprint regardless of the sampling accuracy in the depth prediction step.

Methodology

The DVR framework builds on the concept of implicit neural representations, such as occupancy networks for shape and texture fields for color information. The learning process involves running a volumetric rendering process in which rays are cast through pixels, evaluating the occupancy network along each ray to find the surface depth. The texture field is then queried at the identified surface points to predict the corresponding color values.

Loss Functions

The authors define several loss functions to train the DVR model:

RGB Loss: A photometric reconstruction loss is used to measure the consistency between predicted and observed pixel values.
Depth Loss: When depth maps are available, an additional depth consistency loss can be used to improve the geometry accuracy.
Freespace and Occupancy Losses: These losses are introduced to handle points lying outside or inside the object masks, encouraging the network to produce valid surface points.
Normal Loss: An optional smoothness prior is included to regularize surface normals and enforce natural shapes.

Experimental Evaluation

The paper reports thorough experimental evaluations on both synthetic (ShapeNet) and real-world (DTU) datasets. Key findings include:

Single-View Reconstruction: DVR was shown to produce high-quality 3D reconstructions from single 2D images, performing comparably or better than existing methods with 3D supervision. The approach generalizes across various object categories and is not restricted by object topology.
Learning from Limited Supervision: Remarkably, DVR can accurately reconstruct 3D shapes using as little as one image per object during training. This is achieved by aggregating information over multiple training instances, demonstrating the robustness of the method.
Multi-View Reconstruction: The method was successfully applied to multi-view stereo tasks, generating watertight meshes and leveraging texture information to surpass the visual hull, indicating its applicability to real-world scenarios.

Implications and Future Directions

The DVR method presents a scalable and flexible approach for 3D reconstruction without the need for 3D ground truth, making it highly relevant for applications where obtaining detailed 3D data is challenging. Practically, this can benefit areas such as augmented reality, robotics, and medical imaging, where acquiring detailed 3D scans may not always be feasible.

Theoretically, this research opens new avenues for investigating how implicit representations can be further refined and utilized, especially in contexts involving more complex material properties and dynamic scenes. Future work could explore extending DVR to handle varying lighting conditions, soft masks, and automatically estimated camera parameters, thereby broadening its applicability even further.

In summary, the paper introduces a significant advancement in the field of 3D computer vision by providing a robust and efficient framework for learning 3D representations directly from 2D images. This capability is particularly valuable for advancing real-world applications requiring detailed 3D understanding from limited data inputs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/chrisoffner3d/status/1802050992352969059

https://twitter.com/chrisoffner3d/status/1802008975719719350