Learning Monocular Depth by Distilling Cross-domain Stereo Networks (1808.06586v1)

Published 20 Aug 2018 in cs.CV

Abstract: Monocular depth estimation aims at estimating a pixelwise depth map for a single image, which has wide applications in scene understanding and autonomous driving. Existing supervised and unsupervised methods face great challenges. Supervised methods require large amounts of depth measurement data, which are generally difficult to obtain, while unsupervised methods are usually limited in estimation accuracy. Synthetic data generated by graphics engines provide a possible solution for collecting large amounts of depth data. However, the large domain gaps between synthetic and realistic data make directly training with them challenging. In this paper, we propose to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Cross-domain synthetic data could be fully utilized in this novel framework. Different strategies are proposed to ensure learned depth perception capability well transferred across different domains. Our extensive experiments show state-of-the-art results of monocular depth estimation on KITTI dataset.

PDF Abstract

Essay: Learning Monocular Depth by Distilling Cross-domain Stereo Networks

The paper "Learning Monocular Depth by Distilling Cross-domain Stereo Networks" proposes an innovative methodology for enhancing monocular depth estimation using stereo networks. Traditional approaches to depth estimation through monocular images encounter significant challenges due to the difficulties in acquiring comprehensive depth datasets for supervised learning and accuracy limitations in unsupervised methods. The authors present a novel pipeline that leverages synthetic datasets in conjunction with stereo networks to mitigate these obstacles.

The proposed framework uses stereo matching networks as intermediaries to convert synthetic data into a beneficial resource for training monocular depth networks, thereby alleviating the domain gap that arises when using synthetic data directly. This approach builds on the stereo network's inherent capability to generalize more effectively across domains, focusing on pixel matching rather than high-level semantic meanings typical in monocular depth estimation.

Methodology

The methodology unfolds over three sequential steps:

Training Stereo Networks with Synthetic Data: The research utilizes a modified DispNet to predict both disparity maps and occlusion masks from synthetic stereo image pairs. The synthetic data, generated from Scene Flow datasets, provides expansive training opportunities that bypass the requirement for extensive real-world depth annotations.
Domain-specific Fine-tuning: The paper explores both supervised and unsupervised fine-tuning of the initially trained stereo network on realistic datasets. For unsupervised fine-tuning, the authors propose an innovative method that incorporates occlusion handling and utilizes additional regularization terms to improve prediction accuracy, particularly in occluded regions, which is generally a weakness in photometric loss-based approaches.
Monocular Depth Estimation via Distillation: Finally, the refined stereo network guides the training of a monocular network. This step effectively distills the stereo network's learnt representations, enabling the monocular network to leverage the nuanced understanding of depth developed during prior steps.

Results

The experimental evaluations, particularly on the KITTI dataset, demonstrate that the proposed framework achieves state-of-the-art results. Notably, with only 100 ground truth images for fine-tuning, the stereo-to-monocular distillation process provides competitive results that surpass existing methods, both supervised and unsupervised. Moreover, the architecture is further validated against diverse datasets such as Cityscapes and Make3D, underscoring the model's robustness and generalizability across different domains.

Implications and Future Work

The research introduces a compelling method for cross-domain knowledge transfer in depth estimation, potentially reducing reliance on extensive labeled datasets. The combination of synthetic data, stereo matching, and distillation principles presents a framework that could be adapted and expanded to other visual recognition domains.

Future work might explore the integration of more sophisticated stereo matching algorithms and further extend fine-tuning strategies to enhance model performance. Additionally, exploring confidence measures within the outputs of stereo networks could refine the distillation process, enhancing the reliability of the monocular depth estimates.

In conclusion, the paper significantly contributes to the field by introducing a method that effectively intersects synthetic data utility and stereo matching gleaning to advance monocular depth estimation, offering insights and pathways for future advancements in computer vision research.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xiaoyang Guo (28 papers)
Hongsheng Li (340 papers)
Shuai Yi (45 papers)
Jimmy Ren (32 papers)
Xiaogang Wang (230 papers)

Citations (201)

View on Semantic Scholar

Learning Monocular Depth by Distilling Cross-domain Stereo Networks (1808.06586v1)

Essay: Learning Monocular Depth by Distilling Cross-domain Stereo Networks

Methodology

Results

Implications and Future Work

Related Papers