Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume (2003.13951v1)

Published 31 Mar 2020 in cs.CV and cs.LG

Abstract: Monocular depth estimation has become one of the most studied applications in computer vision, where the most accurate approaches are based on fully supervised learning models. However, the acquisition of accurate and large ground truth data sets to model these fully supervised methods is a major challenge for the further development of the area. Self-supervised methods trained with monocular videos constitute one the most promising approaches to mitigate the challenge mentioned above due to the wide-spread availability of training data. Consequently, they have been intensively studied, where the main ideas explored consist of different types of model architectures, loss functions, and occlusion masks to address non-rigid motion. In this paper, we propose two new ideas to improve self-supervised monocular trained depth estimation: 1) self-attention, and 2) discrete disparity prediction. Compared with the usual localised convolution operation, self-attention can explore a more general contextual information that allows the inference of similar disparity values at non-contiguous regions of the image. Discrete disparity prediction has been shown by fully supervised methods to provide a more robust and sharper depth estimation than the more common continuous disparity prediction, besides enabling the estimation of depth uncertainty. We show that the extension of the state-of-the-art self-supervised monocular trained depth estimator Monodepth2 with these two ideas allows us to design a model that produces the best results in the field in KITTI 2015 and Make3D, closing the gap with respect self-supervised stereo training and fully supervised approaches.

Authors (2)

Adrian Johnston (1 paper)
Gustavo Carneiro (129 papers)

Citations (221)

View on Semantic Scholar

Summary

Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume

The paper presents an innovative approach to monocular depth estimation, a fundamental problem in computer vision with substantial applications in perception for autonomous systems. The methodology leverages self-supervised learning from monocular videos to sidestep the prohibitive data demands of fully supervised approaches that rely on extensive ground truth datasets. The key advancements proposed involve the incorporation of self-attention mechanisms and discrete disparity prediction into the self-supervised monocular depth estimation framework.

Methodological Innovations

The proposed model significantly improves upon existing self-supervised methods by integrating two novelties: self-attention and discrete disparity volume (DDV).

Self-Attention Module: Unlike convolutional layers, which are limited by their local receptive fields, self-attention facilitates the inclusion of non-contiguous contextual information. This approach allows the model to infer coherent disparity values across separate regions of an image, enhancing depth estimation, particularly in scenarios involving complex and non-rigid motion.
Discrete Disparity Volume: The paper introduces a discrete disparity estimation approach, drawing on successes in fully supervised settings. By discretizing the disparity space, the model gains the ability to produce sharper depth maps and estimate pixel-wise depth uncertainty.

Implementation and Results

The efficacy of these enhancements is validated through comprehensive experiments on benchmark datasets such as KITTI 2015 and Make3D. Notably, the augmented model exhibits state-of-the-art performance in self-supervised monocular depth estimation on KITTI 2015. The quantitative evaluation demonstrates substantial performance gains across critical metrics, with improvements in absolute relative error (Abs Rel) and root mean squared error (RMSE) over the baseline model Monodepth2, making a strong case for the proposed innovations.

Furthermore, qualitative results underscore the model's capability to accurately resolve finer geometric details, such as thin structures and object boundaries, a traditionally challenging aspect of depth prediction tasks. These results suggest significant enhancement in handling occlusions and dynamic scene elements, attributed to the improved contextual reasoning and depth resolution mechanisms.

Practical and Theoretical Implications

From a practical standpoint, the model's ability to accurately predict depth from monocular images without requiring ground truth depth data or stereo image pairs opens avenues for deployment in cost-sensitive and resource-constrained environments, such as unmanned aerial vehicles and mobile robots. The inherent estimation of depth uncertainty further highlights the model's applicability in risk-sensitive scenarios, providing a mechanism for uncertainty-aware decision-making.

Theoretically, the integration of self-attention into depth estimation tasks represents a promising intersection of techniques typically reserved for NLP, emphasizing the versatility and potential of attention mechanisms in expanding contextual understanding in vision tasks.

Future Developments

Future work could explore extending the self-attention paradigm into the temporal domain, potentially leveraging attention-based sequence processing techniques for improved depth estimation across video frames. Additionally, further work may examine the model's adaptability across diverse environments and conditions, particularly focusing on generalization across different visual domains and scenes with varied textural and lighting conditions.

In conclusion, this paper delineates a meaningful evolution in self-supervised monocular depth estimation, underpinning the potential of self-attention and discrete disparity techniques to enhance model precision and reliability. These advancements promise considerable impact on the fields of autonomous navigation and 3D scene understanding, paving the way for robust, efficient, and deployable vision systems.

PDF Markdown