DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points (2003.08933v2)

Published 19 Mar 2020 in cs.CV, cs.LG, and eess.IV

Abstract: Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation. Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems. However, this accuracy comes at a high computational cost which impedes practical adoption. Distinct from cost volume approaches, we propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs. An end-to-end network efficiently performs all three steps within a deep learning framework and trained with intermediate 2D image and 3D geometric supervision, along with depth supervision. Crucially, our first step complements pose estimation using interest point detection and descriptor learning. We demonstrate state-of-the-art results on depth estimation with lower compute for different scene lengths. Furthermore, our method generalizes to newer environments and the descriptors output by our network compare favorably to strong baselines. Code is available at https://github.com/magicleap/DELTAS

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel method that integrates interest point detection, matching, and algebraic triangulation for efficient depth estimation.
It employs a ResNet-50 based network to extract high-level features, enabling end-to-end training while reducing computational cost.
Results on ScanNet and Sun3D datasets show that DELTAS outperforms traditional MVS methods in both quantitative metrics and computational efficiency.

An Overview of DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points

The paper "DELTAS: Depth Estimation by Learning Triangulation And Densification of Sparse Points" proposes a novel approach to depth estimation using multi-view stereo (MVS) techniques. The authors address common challenges associated with existing depth estimation methods, including computational cost and the reliance on heavy cost volumes, by introducing a more efficient framework that combines interest point triangulation with dense depth map generation.

Methodology

The presented method is composed of three key steps: interest point detection and description, point matching and triangulation, and densification of sparse depth points.

Interest Point Detection and Description: The authors utilize a network reminiscent of SuperPoint's structure but with a deeper ResNet-50 backbone to detect interest points and compute descriptors. This decision allows for leveraging high-level features, essential for subsequent matching and pose estimation tasks.
Point Matching and Triangulation: The approach efficiently matches points across multiple views by leveraging geometric constraints, specifically along epipolar lines, thus narrowing the search space and enhancing computational efficiency. A differentiable algebraic triangulation step produces sparse 3D points, allowing for end-to-end training of the network.
Densification of Sparse Depth Points: The sparse 3D points derived from triangulation are refined into a dense depth map. This is achieved using an encoder-decoder network that combines RGB image data with sparse depth information, enabling high-quality depth estimation without the enormous computational overhead linked to large cost volumes.

Results and Discussion

The authors validate their approach using the ScanNet dataset for training and perform evaluations on both ScanNet and Sun3D datasets. Their method consistently outperforms traditional MVS methods such as DPSNet, MVDepthNet, and GPMVSNet across a range of depth estimation metrics. Notably, the proposed approach demonstrates superior performance both quantitatively and qualitatively, producing detailed and coherent depth maps while requiring fewer computational resources.

The incorporation of a learned triangulation strategy and the removal of dependency on costly cost volume constructions result in a significant reduction in computational demands, achieving state-of-the-art performance with improved efficiency. Additionally, the paper addresses practical considerations, such as adapting the method to datasets and conditions not observed during training, indicating a robust ability to generalize.

Implications and Future Directions

This research holds substantial implications for the development of efficient depth sensing in practical applications, including autonomous driving and AR/VR systems, where power and computational efficiency are crucial. The shift from traditional dense cost volume processing to sparse feature triangulation combined with CNN-based densification heralds a promising direction for stereo and multi-view depth estimation technologies.

Potential future developments could focus on further optimizing the matcher and triangulation components or integrating more advanced attention mechanisms for improved feature matching accuracy. Additionally, extending this framework to other domains, such as real-time video depth estimation, presents an intriguing opportunity for advancing 3D vision technologies further. The integration with SLAM systems could also provide a more comprehensive understanding and reconstruction of dynamic environments in real-time applications.

PDF Markdown

Related Papers

GitHub

GitHub - magicleap/DELTAS: Inference Code for DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse point (ECCV 2020)s (101 stars)

Tweets

https://twitter.com/tokufxug/status/1312399919710269443