Neural RGB-D Surface Reconstruction (2104.04532v3)

Published 9 Apr 2021 in cs.CV

Abstract: Obtaining high-quality 3D reconstructions of room-scale scenes is of paramount importance for upcoming applications in AR or VR. These range from mixed reality applications for teleconferencing, virtual measuring, virtual room planing, to robotic applications. While current volume-based view synthesis methods that use neural radiance fields (NeRFs) show promising results in reproducing the appearance of an object or scene, they do not reconstruct an actual surface. The volumetric representation of the surface based on densities leads to artifacts when a surface is extracted using Marching Cubes, since during optimization, densities are accumulated along the ray and are not used at a single sample point in isolation. Instead of this volumetric representation of the surface, we propose to represent the surface using an implicit function (truncated signed distance function). We show how to incorporate this representation in the NeRF framework, and extend it to use depth measurements from a commodity RGB-D sensor, such as a Kinect. In addition, we propose a pose and camera refinement technique which improves the overall reconstruction quality. In contrast to concurrent work on integrating depth priors in NeRF which concentrates on novel view synthesis, our approach is able to reconstruct high-quality, metrical 3D reconstructions.

Authors (5)

Dejan Azinović (3 papers)
Ricardo Martin-Brualla (28 papers)
Dan B Goldman (15 papers)
Matthias Nießner (177 papers)
Justus Thies (62 papers)

Citations (251)

View on Semantic Scholar

Summary

The paper introduces a hybrid scene representation using TSDF for geometry and a volumetric radiance field for appearance to enhance 3D reconstruction.
It extends the NeRF framework by integrating RGB-D sensor depth data, significantly improving the accuracy and completeness of reconstructed scenes.
Joint pose and camera refinement techniques mitigate sensor noise, achieving superior alignment and capturing fine structural details.

A Summary of "Neural RGB-D Surface Reconstruction"

This paper presents a new approach to reconstruct high-quality 3D models of room-scale scenes using RGB-D input data from consumer-level cameras. The authors propose a novel method that combines neural implicit representations with traditional rendering techniques to better capture scene geometry. By leveraging both color and depth information, the approach aims to overcome limitations of existing methods that often suffer from poor geometric reconstruction due to relying solely on RGB images or depth data.

Key Contributions

Hybrid Scene Representation: The authors introduce a hybrid neural representation composed of a truncated signed distance function (TSDF) for surface geometry and a volumetric radiance field for appearance. This design allows for effective incorporation of depth measurements while supporting differentiable volumetric rendering, enabling high-quality scene reconstruction.
Integration with NeRF: Building upon the Neural Radiance Fields (NeRF) framework, the authors reformulate NeRF's density-based volumetric approach using a differentiable rendering formulation based on signed distance functions. By doing so, the proposed method can achieve accurate geometric representations otherwise not possible with density fields.
Depth Information Utilization: The method extends NeRF to incorporate depth data provided by an RGB-D sensor, such as a Kinect. This integration allows the system to probe geometry in regions where depth data is sparse, leveraging the richer color information available.
Pose and Camera Refinement: To mitigate errors stemming from initial camera pose inaccuracies, the authors propose an optimization scheme that jointly refines camera poses and scene representation. This technique significantly enhances alignment and overall reconstruction quality.

Experimental Results

The paper provides quantitative and qualitative evaluations on both real (ScanNet) and synthetic datasets. The proposed method demonstrates superior performance over traditional depth fusion methods like BundleFusion and learned methods that rely solely on RGB or depth information. Specific improvements were noted in reconstruction completeness and accuracy, where the hybrid representation is particularly effective at capturing thin structures and filling missing geometric details.

In terms of pose estimation, the system achieves better alignment than methods initialized with sensor noise, proven through rigorous testing against synthetic scenes for which ground-truth camera trajectories and geometry were known.

Implications and Future Directions

The integration of TSDFs with the NeRF framework introduces a promising direction for enhancing 3D reconstruction systems in environments where depth sensors are used. This advancement translates to potential improvements in applications such as augmented reality (AR) and virtual reality (VR), where accurate scene geometry is crucial.

Looking forward, the paper hints at future work integrating more sophisticated scene representations that can handle high-frequency details in extremely large environments. Moreover, the authors recognize the need for more computationally efficient methods, highlighting the potential of voxel grid optimizations or other sparse representations to expedite convergence and reduce memory footprint.

Overall, the paper presents a clear, methodical innovation in 3D scene reconstruction, with significant implications for real-world applications that demand precise and comprehensive geometric modeling.

PDF Markdown