D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry (2003.01060v2)

Published 2 Mar 2020 in cs.CV and cs.AI

Abstract: We propose D3VO as a novel framework for monocular visual odometry that exploits deep networks on three levels -- deep depth, pose and uncertainty estimation. We first propose a novel self-supervised monocular depth estimation network trained on stereo videos without any external supervision. In particular, it aligns the training image pairs into similar lighting condition with predictive brightness transformation parameters. Besides, we model the photometric uncertainties of pixels on the input images, which improves the depth estimation accuracy and provides a learned weighting function for the photometric residuals in direct (feature-less) visual odometry. Evaluation results show that the proposed network outperforms state-of-the-art self-supervised depth estimation networks. D3VO tightly incorporates the predicted depth, pose and uncertainty into a direct visual odometry method to boost both the front-end tracking as well as the back-end non-linear optimization. We evaluate D3VO in terms of monocular visual odometry on both the KITTI odometry benchmark and the EuRoC MAV dataset.The results show that D3VO outperforms state-of-the-art traditional monocular VO methods by a large margin. It also achieves comparable results to state-of-the-art stereo/LiDAR odometry on KITTI and to the state-of-the-art visual-inertial odometry on EuRoC MAV, while using only a single camera.

Authors (4)

Nan Yang (182 papers)
Lukas von Stumberg (11 papers)
Rui Wang (996 papers)
Daniel Cremers (274 papers)

Citations (348)

View on Semantic Scholar

Summary

The paper introduces a self-supervised network that estimates depth, pose, and photometric uncertainty to boost the accuracy of monocular visual odometry.
It integrates deep predictions into both front-end tracking and back-end bundle adjustment, significantly reducing photometric errors.
The framework achieves state-of-the-art performance on KITTI and EuRoC datasets, rivaling results from stereo and LiDAR-based systems.

An Overview of D3VO: Deep Depth, Deep Pose, and Deep Uncertainty for Monocular Visual Odometry

This paper introduces D3VO, a framework for monocular visual odometry (VO) that leverages the power of deep learning on three fronts: deep depth estimation, deep pose estimation, and deep uncertainty. The authors postulate that traditional geometric-based visual odometry approaches can be significantly supplemented by deep networks that offer robust metric scale depth maps and reliable pose estimations, alongside predictive uncertainty measures.

Core Methodology

D3VO is founded upon a self-supervised depth estimation network that predicts depth, pose, and photometric uncertainty from monocular data. This network is trained using stereo video datasets without requiring direct depth supervision. The network uses a novel approach by predicting brightness transformation parameters to counteract the illumination differences ubiquitous in stereo pairs, thereby enhancing depth estimation accuracy. It incorporates photometric uncertainty estimation into the pipeline to address inherent ambiguities and noise in the visual data, which is particularly beneficial for handling non-Lambertian surfaces and dynamic scenes.

All these predictions from the network are seamlessly integrated into both the front-end tracking and back-end photometric bundle adjustment of a sparse direct VO framework. This tight integration is what distinguishes D3VO from prior attempts that only superficially incorporated deep learning outputs into visual odometry.

Evaluation and Performance

D3VO's performance is rigorously evaluated on two prominent datasets: the KITTI Odometry Benchmark and the EuRoC MAV dataset. On the KITTI benchmark, D3VO demonstrates superior performance over existing monocular VO methods and achieves results comparable with stereo and LiDAR-based systems. Similarly, on the EuRoC MAV dataset, D3VO competes closely with the state-of-the-art visual-inertial odometry (VIO) systems, showcasing robustness and accuracy even with the challenging conditions presented by this dataset.

The authors also conduct comprehensive ablation studies, which underscore the contributions of deep depth, pose, and uncertainty to the overall VO performance. In particular, the incorporation of predictive brightness transformation reduces photometric errors significantly, as shown in additional experiments on the EuRoC dataset.

Implications and Future Directions

The introduction of D3VO bears significant implications for the field of visual odometry. By effectively marrying the consistency offered by traditional geometric methods with the adaptability of deep learning, D3VO opens up new avenues for achieving high-accuracy VO using monocular setups. The inclusion of deep uncertainty not only aids in improving the robustness of the tracking but also serves as a precursor for integrating deep network insights into conventional optimization pipelines, signaling a potential paradigm shift towards more hybrid systems in VO and SLAM research.

Looking forward, the framework established by D3VO could be extended to more complex sensor setups, potentially benefiting multi-modal odometry methodologies. Additionally, as self-supervised learning techniques continue to evolve, further refinement of the depth and pose estimation networks could see even greater robustness and accuracy being achieved without reliance on expensive ground-truth data.

In conclusion, this paper provides a robust framework that demonstrates how targeted integration of deep learning components can significantly enhance the capabilities of monocular visual odometry, offering results that closely rival more sensor-laden systems. The research holds the promise of advancing the field towards more efficient, scalable, and accurate navigation solutions.

PDF Markdown

Related Papers

YouTube

Show All Videos