Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction (1803.03893v3)

Published 11 Mar 2018 in cs.CV

Abstract: Despite learning based methods showing promising results in single view depth estimation and visual odometry, most existing approaches treat the tasks in a supervised manner. Recent approaches to single view depth estimation explore the possibility of learning without full supervision via minimizing photometric error. In this paper, we explore the use of stereo sequences for learning depth and visual odometry. The use of stereo sequences enables the use of both spatial (between left-right pairs) and temporal (forward backward) photometric warp error, and constrains the scene depth and camera motion to be in a common, real-world scale. At test time our framework is able to estimate single view depth and two-view odometry from a monocular sequence. We also show how we can improve on a standard photometric warp loss by considering a warp of deep features. We show through extensive experiments that: (i) jointly training for single view depth and visual odometry improves depth prediction because of the additional constraint imposed on depths and achieves competitive results for visual odometry; (ii) deep feature-based warping loss improves upon simple photometric warp loss for both single view depth estimation and visual odometry. Our method outperforms existing learning based methods on the KITTI driving dataset in both tasks. The source code is available at https://github.com/Huangying-Zhan/Depth-VO-Feat

Citations (610)

View on Semantic Scholar

Summary

The paper presents a novel unsupervised learning framework that simultaneously estimates monocular depth and visual odometry using stereo sequences.
It introduces a deep feature reconstruction loss that leverages contextual cues beyond simple photometric error to improve robustness.
Evaluated on the KITTI dataset, the method achieves competitive performance for both depth estimation and visual odometry.

Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction

The paper presents a novel framework for unsupervised learning of monocular depth estimation and visual odometry (VO) using stereo sequences. Prior approaches often treated these tasks in a supervised manner, requiring ground truth data that is expensive to obtain. The authors leverage a self-supervised learning scheme by minimizing photometric error between views to address this limitation, focusing on both spatial (stereo pairs) and temporal (consecutive frames) photometric warp errors to train convolutional neural networks (CNNs) for depth and odometry estimation.

Key Contributions

Joint Depth and Odometry Estimation:
- The framework utilizes binocular stereo sequences, enabling the depth CNN and odometry CNN to estimate their respective outputs in real-world scale. This approach circumvents scale ambiguity issues inherent in purely monocular systems.
Feature-Based Reconstruction Loss:
- Moving beyond simple photometric error, the paper introduces a deep feature reconstruction loss. This considers contextual information rather than per-pixel color matching alone, thereby improving the robustness of depth and odometry predictions.
Performance on KITTI Dataset:
- The proposed method outperforms existing state-of-the-art models on the KITTI dataset, demonstrating strong numerical results. The framework achieves competitive VO results and enhances single view depth performance, verifying the efficacy of the stereo-based unsupervised approach and feature reconstruction.

Implications and Future Work

The utilization of stereo sequences for training without relying on ground truth presents practical implications for scalability in robotic vision systems, particularly for autonomous vehicles. The integration of deep feature matching showcases potential advancements in robust depth estimation even in unstructured environments. Future work may delve into addressing limitations such as handling occlusions and non-rigid scenes, as the current framework assumes a static world. Additionally, exploring improvements in CNN architectures tailored for odometry and augmenting the system with SLAM-inspired methodologies for map consistency could further enhance performance.

In summary, the paper successfully establishes a new direction in unsupervised learning for depth and odometry, pushing the boundaries on how self-supervised techniques can be extended to real-world robotic applications.

PDF Markdown

Related Papers

GitHub

GitHub - Huangying-Zhan/Depth-VO-Feat: Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction (350 stars)

Tweets

https://twitter.com/prototechno/status/1015475943912034304