DPSNet: End-to-end Deep Plane Sweep Stereo

Published 2 May 2019 in cs.CV and cs.RO | (1905.00538v1)

Abstract: Multiview stereo aims to reconstruct scene depth from images acquired by a camera under arbitrary motion. Recent methods address this problem through deep learning, which can utilize semantic cues to deal with challenges such as textureless and reflective regions. In this paper, we present a convolutional neural network called DPSNet (Deep Plane Sweep Network) whose design is inspired by best practices of traditional geometry-based approaches for dense depth reconstruction. Rather than directly estimating depth and/or optical flow correspondence from image pairs as done in many previous deep learning methods, DPSNet takes a plane sweep approach that involves building a cost volume from deep features using the plane sweep algorithm, regularizing the cost volume via a context-aware cost aggregation, and regressing the dense depth map from the cost volume. The cost volume is constructed using a differentiable warping process that allows for end-to-end training of the network. Through the effective incorporation of conventional multiview stereo concepts within a deep learning framework, DPSNet achieves state-of-the-art reconstruction results on a variety of challenging datasets.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (219)

View on Semantic Scholar

Summary

The paper introduces DPSNet, a neural network integrating a differentiable plane sweep algorithm for end-to-end depth estimation from multiple unstructured images.
DPSNet achieves state-of-the-art performance on challenging datasets including MVS, SUN3D, and RGBD, demonstrating superior accuracy in preserving structural details and object boundaries.
This hybrid approach combining classical geometry and deep learning opens potential for future 3D reconstruction research, including incorporating semantics and pose estimation.

End-to-end Deep Plane Sweep Stereo with DPSNet

The paper introduces DPSNet (Deep Plane Sweep Network), an advanced approach to multiview stereo that builds on the concepts traditionally utilized in non-learning-based dense depth reconstruction methods. The principal aim of the study is to devise a convolutional neural network capable of end-to-end estimation of scene depth from multiple unstructured images despite challenges such as textureless areas and reflective surfaces.

Technical Overview

DPSNet distinguishes itself by integrating the plane-sweep algorithm within a neural network framework, thereby enabling the creation of a cost volume from deep features through a differentiable warping process. This differentiable plane-sweep formulation translates the classical method into an end-to-end learning paradigm. Unlike several existing methods which require external computation of plane-sweep volumes, DPSNet models the cost volume using 3D convolutions on concatenated deep features. This synthesis empowers the network to excel in the estimation of dense depth maps without reliance on pre-established plane-sweep volumes as input, facilitating more efficient multi-view processing.

A significant component of DPSNet is its cost aggregation mechanism, which leverages context-aware filtering to regularize the cost volume, thereby mitigating the impact of unreliable matches. The aggregation, achieved using a series of dilated convolutions, refines the cost slices and improves depth accuracy, especially in regions with sparse textures, which are often challenging for traditional stereo techniques.

Experimental Results

DPSNet demonstrates state-of-the-art performance across multiple challenging datasets, including MVS, SUN3D, and RGBD. The paper provides quantitative evidence that DPSNet consistently outperforms existing methods such as COLMAP, DeMoN, and DeepMVS on metrics like absolute relative error and root mean square error. Furthermore, the experiments illustrate DPSNet’s ability to preserve structural details in homogeneous regions and accurately delineate object boundaries, advantages largely attributable to its sophisticated cost aggregation module.

Implications and Future Work

The success of DPSNet in translating a traditionally geometry-based process into a deep learning context suggests significant potential for advancements in both practical applications and theoretical exploration in 3D scene reconstruction. The paper identifies promising avenues for extending DPSNet, such as incorporating semantic segmentation for cost aggregation and enhancing depth prediction through intelligent viewpoint selection. Additionally, lifting the requirement for pre-calibrated camera parameters by incorporating pose estimation into the end-to-end framework remains an intriguing future target.

In summary, DPSNet represents a substantial advancement in the domain of dense depth estimation from multiple views, with its innovative adaptation of classic methods to modern neural network architectures. Its demonstrated efficacy across diverse datasets underscores the value of hybrid approaches that combine the strengths of traditional algorithms and contemporary deep learning techniques.

Markdown Report Issue