Neural 3D Scene Reconstruction with the Manhattan-world Assumption (2205.02836v2)

Published 5 May 2022 in cs.CV

Abstract: This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planer constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality. The code is available at https://zju3dv.github.io/manhattan_sdf.

Citations (144)

View on Semantic Scholar

Summary

The paper introduces an implicit neural representation for 3D scene geometry using a signed distance function for multi-view consistency.
The method integrates 2D semantic segmentation to impose Manhattan-world planar constraints, enhancing reconstruction in low-textured indoor areas.
Results on ScanNet and 7-Scenes show significant improvements in accuracy and completeness compared to prior MVS approaches.

Neural 3D Scene Reconstruction with the Manhattan-world Assumption

The paper presents an advanced method for reconstructing 3D indoor scenes using multi-view images, particularly addressing issues in low-textured regions common in indoor environments. Traditional MVS approaches often struggle with these challenges due to unreliable stereo matching in such areas. The researchers propose integrating planar constraints, specifically under the Manhattan-world assumption, into implicit neural representation-based methods to enhance reconstruction effectiveness in planar regions like floors and walls.

Technical Approach

Implicit Neural Representations: The authors implement an MLP network to represent a signed distance function (SDF) for scene geometry. Unlike existing MVS methods focusing on per-view depth optimization, this approach encapsulates scene geometries and semantics in 3D space resulting in comprehensive multi-view consistency.
Semantic Segmentation Integration: By employing a 2D semantic segmentation network to predict planar regions, the model incorporates these predictions into the 3D reconstruction process using a second MLP network. This setup optimizes geometry and semantics concurrently within 3D space.
Loss Function Design: A novel loss function was crafted for joint optimization, which bolstered reconstruction and segmentation accuracies against inaccuracies stemming from initial semantic predictions. The loss function forces surface normals in specific regions to comply with the Manhattan-world assumption, ensuring structural alignment in these reconstructed areas.

Results and Comparisons

The approach was tested on the ScanNet and 7-Scenes datasets, demonstrating significant improvements in 3D reconstruction quality compared to previous state-of-the-art methods. Quantitative metrics highlighted the comprehensive accuracy and completeness of reconstructions, with notable enhancements in planar regions due to better enforcement of multi-view consistency.

The integration of semantic segmentation into the 3D reconstruction workflow also improved semantic segmentation accuracy, indicating that the proposed method benefits from both geometric and semantic refinement.

Implications

The integration of geometric constraints using the Manhattan-world assumption into neural 3D reconstruction offers promising advancements in scenarios where traditional methods falter. The ability to maintain high reconstruction fidelity in low-textured areas unlocks new potential for applications in augmented reality, robotics, and even autonomous navigation where accurate scene understanding is crucial.

Future Directions

Expanding this approach to accommodate more general geometric assumptions could further increase its applicability, especially in environments that do not strictly adhere to the Manhattan-world model. Future work may explore integrating other architectural assumptions, such as the Atlanta-world model, broadening the scope of neural 3D reconstruction.

This method sets a new benchmark in combining neural scene reconstruction with semantic priors, paving the way for more robust and accurate models capable of understanding and rendering 3D environments seamlessly.

Neural 3D Scene Reconstruction with the Manhattan-world Assumption (2205.02836v2)

Summary

Neural 3D Scene Reconstruction with the Manhattan-world Assumption

Technical Approach

Results and Comparisons

Implications

Future Directions

GitHub

YouTube

Neural 3D Scene Reconstruction with the Manhattan-world Assumption (2205.02836v2)

Summary

Neural 3D Scene Reconstruction with the Manhattan-world Assumption

Technical Approach

Results and Comparisons

Implications

Future Directions

Related Papers

GitHub

YouTube