A Survey on Deep Learning Techniques for Stereo-based Depth Estimation (2006.02535v1)

Published 1 Jun 2020 in cs.CV and cs.GR

Abstract: Estimating depth from RGB images is a long-standing ill-posed problem, which has been explored for decades by the computer vision, graphics, and machine learning communities. Among the existing techniques, stereo matching remains one of the most widely used in the literature due to its strong connection to the human binocular system. Traditionally, stereo-based depth estimation has been addressed through matching hand-crafted features across multiple images. Despite the extensive amount of research, these traditional techniques still suffer in the presence of highly textured areas, large uniform regions, and occlusions. Motivated by their growing success in solving various 2D and 3D vision problems, deep learning for stereo-based depth estimation has attracted growing interest from the community, with more than 150 papers published in this area between 2014 and 2019. This new generation of methods has demonstrated a significant leap in performance, enabling applications such as autonomous driving and augmented reality. In this article, we provide a comprehensive survey of this new and continuously growing field of research, summarize the most commonly used pipelines, and discuss their benefits and limitations. In retrospect of what has been achieved so far, we also conjecture what the future may hold for deep learning-based stereo for depth estimation research.

Citations (221)

View on Semantic Scholar

Summary

The paper reviews deep learning methods for stereo-based depth estimation, highlighting key contributions such as cost volume formulation and end-to-end architectures.
It examines the evolution from traditional hand-crafted feature matching to CNN-driven approaches that effectively address challenges like occlusions and textureless regions.
The survey outlines current challenges and future directions, emphasizing self-supervised learning and domain adaptation for robust real-world applications.

Deep Learning Techniques for Stereo-based Depth Estimation

Stereo-based depth estimation remains a critical challenge in computer vision, characterized by its ill-posed nature due to aspects like occlusions, textures, and illumination conditions. The paper "A Survey on Deep Learning Techniques for Stereo-based Depth Estimation" offers a comprehensive overview of deep learning methodologies applied to this complex problem, reflecting rapid advancements in the field between 2014 and 2019.

Traditional vs. Deep Learning Approaches

Historically, stereo matching relied on matching hand-crafted features, focusing heavily on pixel correspondences across stereo pairs. These approaches, although effective, struggled with areas of ambiguity like textureless regions or regions with repetitive patterns. Deep learning has emerged as a potent alternative, capitalizing on convolutional neural networks (CNNs) for feature extraction and matching, thereby significantly uplifting stereo depth estimation performance.

Crafting Cost Volumes and Regularization

A core component of stereo depth estimation involves crafting cost volumes that encapsulate potential disparities. These cost volumes form the basis for computing depth maps. The paper discusses both 3D and 4D cost volumes, highlighting methods that regularize these volumes via 2D or 3D CNNs to improve disparity estimation. Regularization techniques like semi-global matching (SGM) or conditional random fields (CRFs) are leveraged within the deep network pipelines to refine estimations further.

Integration of Multiscale and Feature Learning

The survey reveals various architectures that broaden the receptive field and incorporate features at multiple scales—pivotal in capturing context and bolstering correspondence accuracy in challenging scenarios. Methods utilizing spatial pyramid pooling (SPP) or dilated convolutions are noted for efficiently managing multiscale features, which subsequently contribute to better depth estimation.

End-to-end Learning Architectures

End-to-end frameworks are prominently featured, as they simplify the depth estimation process and accelerate execution through learned differentiations across the network's stages. Techniques like DispNet and PSMNet embody this approach, demonstrating the capacity to learn disparities directly in one pass, facilitating real-time applications. Such architectures often entail sophisticated learning mechanisms, including hierarchical refinement and cascade structures, which prove essential for achieving spatial detail and depth accuracy concurrently.

Self-supervision and Domain Adaptation

The paper stresses the importance of self-supervised learning schemas, often employing image reconstruction losses to sidestep the need for expensive ground-truth data. This aligns well with emerging methodologies for unsupervised domain adaptation, critical to addressing domain shift challenges that arise when transitioning between synthetic datasets and real-world applications.

Evaluation and Comparison

Assessment across diverse datasets, including KITTI and ApolloScape, reveals gaps in performance that emphasize ongoing barriers like achieving sub-pixel accuracy, handling high-resolution inputs, and adapting to varying conditions—issues poised for further research and optimization. Techniques like hierarchical disparity processing are noted for their efficiency, yet the challenges of retaining fine details and minimizing artifacts remain pertinent.

In conclusion, stereo-based depth estimation continues to be a vibrant research domain, invigorated by the advances in deep learning frameworks. Future efforts are set to refine these methodologies, focusing on enhanced scalability, robustness under diverse settings, and reduced computational demands, ultimately broadening the application spectrum across industries reliant on accurate depth perception.

PDF Markdown