HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation (2012.07356v1)

Published 14 Dec 2020 in cs.CV and cs.AI

Abstract: Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at https://github.com/shawLyu/HR-Depth.

Citations (213)

View on Semantic Scholar

Summary

The paper introduces HR-Depth, a novel network that enhances high-resolution monocular depth estimation via refined feature fusion and redesigned skip connections.
It incorporates a lightweight fSE module to re-weight feature channels, achieving superior accuracy compared to many state-of-the-art self-supervised methods.
Experimental results on the KITTI dataset demonstrate improved depth gradient estimation and competitive performance with fewer parameters than supervised approaches.

An Overview of "HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation"

This paper introduces HR-Depth, an advanced convolutional network designed for high-resolution, self-supervised monocular depth estimation. The authors address the limitations of existing methodologies, which frequently struggle to capitalize on high-resolution input's potential improvements due to large depth gradient estimation inaccuracies and interpolation errors. Through a combination of architectural innovations and nuanced feature fusion strategies, HR-Depth demonstrates enhanced performance over state-of-the-art (SoTA) self-supervised methods, even competing favorably with supervised methods.

Technical Contributions

HR-Depth contributes several innovative strategies, particularly in handling high-resolution inputs for depth estimation. It identifies the core challenge in predicting high-quality depth maps lies in accurately estimating large gradient regions, especially object boundaries. To combat this, HR-Depth introduces:

Redesigned Skip-Connections: Borrowing from dense skip-connection architectures, such as U-Net++, this approach mitigates the semantic and spatial information gap between encoder and decoder layers, thereby enhancing feature fusion and boundary accuracy in the resulting depth maps.
Feature Fusion Squeeze-and-Excitation (fSE) Module: This module mitigates computational costs while improving feature aggregation by re-weighting feature channels based on their estimated importance. This design reduces unnecessary parameters compared to traditional convolutional approaches and improves the depth prediction accuracy.
HR-Depth Design and Evaluation: Utilizing ResNet-18 and MobileNetV3 as network backbones, HR-Depth surpasses previous methods, achieving superior performance with fewer parameters. Additionally, HR-Depth is augmented into a lightweight variant termed Lite-HR-Depth, retaining competitive performance to more complex models with significantly fewer parameters through the use of knowledge distillation.

Empirical Validation and Performance

HR-Depth is tested on the KITTI dataset, widely recognized for benchmarking depth estimation models. The network not only outperforms Monodepth2 (a widely used baseline) and PackNet-SfM but also achieves improved depth estimation accuracy at both standard and high resolutions. For higher resolutions, HR-Depth's performance gains are particularly prominent, attributing to its effective handling of depth gradients and interpolation errors.

The paper includes a comprehensive section on ablation studies to highlight the significance of each architectural innovation. These studies affirm that both the redesigned skip-connections and the fSE module contribute measurably to the overall performance improvements.

Implications and Future Research

The advancements introduced by HR-Depth have several implications:

Practical Deployment: The lightweight design of Lite-HR-Depth, with minimal parameter requirements, suggests potential applicative deployment in resource-constrained environments such as mobile devices or embedded systems involved in autonomous navigation.
Theoretical Insights: By elucidating the impact of high-resolution features with enriched semantic information on depth estimation tasks, HR-Depth outlines potential pathways for future research in architecture design and feature processing in computer vision.

The proposed methodologies could inspire further exploration into efficient model architectures for dense prediction tasks under self-supervised settings. Potential future work includes extending the architecture to different datasets and domains, integrating broader context or sequence information, and exploring more efficient training methodologies.

In summary, HR-Depth marries architectural efficiency with innovative semantic-spatial feature fusion to set a new benchmark for self-supervised monocular depth estimation. It is a substantial contribution to the computer vision domain, offering insights into both practical applications and theoretical underpinnings of high-resolution deep learning models.

PDF Markdown