Monocular Depth Estimation Based On Deep Learning: An Overview (2003.06620v2)

Published 14 Mar 2020 in cs.CV

Abstract: Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image (monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.

Authors (5)

Chaoqiang Zhao (17 papers)
Qiyu Sun (71 papers)
Chongzhen Zhang (5 papers)
Yang Tang (77 papers)
Feng Qian (41 papers)

Citations (235)

View on Semantic Scholar

Summary

Monocular Depth Estimation Based on Deep Learning: An Overview

Monocular depth estimation has been a subject of considerable research interest, primarily due to its relevance in various autonomous systems and visual applications. Traditional depth estimation approaches like structure from motion (SfM) and stereo vision traditionally depend on multi-view geometric constraints and therefore often result in sparse and viewpoint-dependent depth maps. In contrast, monocular depth estimation—inferring depth from a single image—addresses the ill-posed problem with emerging deep learning methods, which enable the derivation of dense and detailed depth maps in an end-to-end fashion.

The paper, "Monocular Depth Estimation Based On Deep Learning: An Overview," provides a comprehensive review of deep learning techniques that have been employed for monocular depth estimation. It categorizes the existing literature by the training paradigm used: supervised, unsupervised, and semi-supervised methods. Each of these paradigms involves distinct challenges, such as the requirement for large labeled datasets in supervised learning, or issues like scale ambiguity and inconsistency intrinsic to monocular video scenarios in unsupervised techniques.

Datasets and Evaluation Metrics

The paper first delineates commonly used datasets such as KITTI, NYU Depth, Cityscapes, and Make3D, emphasizing their role in benchmarking depth estimation models. KITTI and NYU Depth are highlighted as pivotal benchmarks due to their diversity and data richness. Evaluation metrics like RMSE, Absolute Relative Error, and accuracy measures ( $\delta$ thresholds) are rigorously adhered to for quantitative assessment, allowing for the comparison of algorithm performance comprehensively.

Supervised Learning Approaches

Supervised methods rely heavily on annotated datasets, where ground truth depths guide the learning processes. These methods have shown to deliver high-accuracy depth maps by leveraging sophisticated network architectures like fully convolutional networks, residual networks, and incorporating multi-task learning approaches. Innovative loss functions, such as the Berhu loss, have been proposed to better capture depth information by weighting errors dynamically.

Additionally, techniques utilizing Conditional Random Fields (CRFs) and adversarial learning frameworks provide structure-refinement capabilities, facilitating the generation of photorealistic depth predictions. However, the supervised paradigm’s dependency on large-scale, labeled depth datasets constrains its scalability.

Unsupervised and Semi-Supervised Learning Paradigms

Unsupervised learning, circumventing the need for labeled data, generally exploits geometric consistencies and temporal coherence in video sequences. Architectures integrate components like pose estimation networks to derive supervisory signals from reconstructed spatial transformations. Despite promising advancements, challenges persist, such as dynamic scenes introducing occlusions and the pivotal monocular scale problem, requiring further innovation in geometric or semantic constraints.

Semi-supervised methods, often incorporating stereo pairings, blend the strengths of supervised and unsupervised learning. Methods under this paradigm strive for a balance, leveraging sparse ground truth data and stereo correspondence to enhance accuracy, thereby addressing the limitations of purely monocular sources. These techniques also experience elevated performance by employing innovative approaches such as consistency losses, leveraging depth cues through stereo-focused architectures.

Theoretical and Practical Implications

Practically, deep learning-based monocular depth estimation has made advances towards real-world applications, notably in SLAM and autonomous navigation systems, where accurate environmental perception is crucial. The use of monocular cameras over depth sensors like LIDAR in autonomous systems offers significant savings in terms of cost and computational efficiency while enhancing mapping capabilities.

Future Directions and Challenges

Potential future directions emphasize increasing the accuracy, robustness, and real-time performance of monocular depth estimation models. The paper suggests further exploration into network architectures with enhanced spatial and semantic awareness, improved data synthesis for training, and novel unsupervised loss formulations. Research examining monocular depth perception mechanisms and symbolic knowledge integration to strengthen model generalization is encouraged, alongside the practical deployment of efficient models in resource-constrained environments.

Conclusion

This review paper positions deep learning as a pivotal tool in tackling the challenges of monocular depth estimation. By surveying the landscape of learning methodologies and challenges therein, it elucidates the trajectory for future research. With continued advancements, the transition towards robust, scalable, and efficient monocular depth estimation systems appears promising, with broad implications across domains involving computer vision and autonomous systems.

PDF Markdown

Related Papers

Find Related Papers