UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning (1709.06841v2)

Published 20 Sep 2017 in cs.CV

Abstract: We propose a novel monocular visual odometry (VO) system called UnDeepVO in this paper. UnDeepVO is able to estimate the 6-DoF pose of a monocular camera and the depth of its view by using deep neural networks. There are two salient features of the proposed UnDeepVO: one is the unsupervised deep learning scheme, and the other is the absolute scale recovery. Specifically, we train UnDeepVO by using stereo image pairs to recover the scale but test it by using consecutive monocular images. Thus, UnDeepVO is a monocular system. The loss function defined for training the networks is based on spatial and temporal dense information. A system overview is shown in Fig. 1. The experiments on KITTI dataset show our UnDeepVO achieves good performance in terms of pose accuracy.

Authors (4)

Ruihao Li (4 papers)
Sen Wang (164 papers)
Zhiqiang Long (1 paper)
Dongbing Gu (6 papers)

Citations (489)

View on Semantic Scholar

Summary

The paper presents UnDeepVO, an innovative unsupervised deep learning method that recovers absolute scale for 6-DoF visual odometry.
It leverages a dual-branch CNN architecture with separate pose and dense depth estimators trained on stereo image pairs using photometric and disparity consistency losses.
Experimental results on the KITTI dataset show competitive performance against traditional and supervised methods, highlighting its potential for scalable visual SLAM applications.

Overview of UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning

The paper presents a new approach to monocular visual odometry (VO) called UnDeepVO, leveraging unsupervised deep learning to estimate the 6-DoF pose and depth from a monocular camera. The approach incorporates a significant advancement with its ability to recover absolute scale, typically considered a challenging aspect in monocular systems. By training on stereo image pairs and testing on monocular sequences, UnDeepVO effectively bridges the gap between geometric and deep learning methods for visual odometry.

Methodology

UnDeepVO employs a deep neural network architecture composed of two primary components: a pose estimator and a depth estimator. The pose estimator is based on a VGG-style CNN and predicts 6-DoF transformation from consecutive monocular images. The depth estimator follows an encoder-decoder structure to output dense depth maps directly, rather than disparity maps, to facilitate training convergence.

One of the core innovations is the unsupervised training scheme utilizing stereo images. By defining a loss function grounded in spatial and temporal dense information—specifically, photometric consistency, disparity consistency, and pose consistency losses—UnDeepVO ensures robust scale recovery without relying on expensive ground-truth data.

Experimental Evaluation

The experimentation carried out on the KITTI dataset showcases the efficacy of UnDeepVO, highlighting competitive performance in pose estimation against traditional model-based methods and supervised learning approaches. The results reveal that UnDeepVO can reliably recover scale in both pose and depth estimation, with notable improvements over benchmark methods when using lower-resolution images. Additionally, the system compares favorably with VISO2-M and ORB-SLAM-M under monocular conditions, even outperforming them in certain metrics.

Depth estimation results demonstrate that UnDeepVO's performance is noteworthy compared to both supervised and other unsupervised methods. Although it does not surpass all benchmarks in all metrics, the balance between unsupervised training and scale recovery is an important contribution to the field.

Implications and Future Directions

The implications of the UnDeepVO system are significant for applications where labeled data and ground-truth measurements are sparse or unavailable. By eliminating the requirement for such data, this approach facilitates the deployment of monocular visual odometry in a wider array of real-world scenarios, potentially lowering costs and broadening the accessibility of advanced visual SLAM systems.

The paper suggests further exploration into training with larger datasets to enhance performance, particularly in challenging conditions such as variable lighting and dynamic environments. Additionally, extending the system towards a complete visual SLAM solution is a proposed future direction, which could address the drift issues inherent in current VO methods.

In sum, this work represents a substantial step towards more flexible, scalable, and practical unsupervised visual odometry solutions, laying a foundation for future research in unsupervised deep learning for robotic perception.

PDF Markdown

Related Papers

YouTube

Show All Videos