Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction (2004.10681v3)

Published 22 Apr 2020 in cs.CV

Abstract: Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other's shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform pseudo RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires unlabeled monocular videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised monocular and stereo depth prediction networks (e.g, Monodepth2) and feature-based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.

Authors (6)

Lokender Tiwari (4 papers)
Pan Ji (53 papers)
Quoc-Huy Tran (18 papers)
Bingbing Zhuang (15 papers)
Saket Anand (28 papers)
Manmohan Chandraker (108 papers)

Citations (60)

View on Semantic Scholar

Summary

The paper introduces a novel iterative framework that synergistically integrates monocular SLAM with CNN-based depth prediction to enhance both pose and depth accuracy.
It simulates a pseudo RGB-D system by incorporating CNN-predicted depths into feature-based SLAM, effectively mitigating scale drift and sensor limitations.
Experimental results on KITTI and TUM datasets demonstrate significant performance gains, underscoring its potential for augmented reality, robotics, and autonomous navigation.

An Expert Overview of "Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction"

The paper "Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction" by Tiwari et al. presents a novel methodology for enhancing monocular Simultaneous Localization And Mapping (SLAM) and monocular depth prediction by integrating these two distinct fields. Traditional approaches in monocular SLAM and convolutional neural networks (CNNs) for depth prediction have limitations, particularly when utilized independently. This work proposes a synergistic framework that leverages the strengths of both geometric SLAM and CNN-based depth estimation, achieving improvements in robustness and accuracy from monocular video input without requiring labeled data.

Framework Overview

The framework introduced by the authors adopts a unique self-improving strategy where monocular SLAM and depth prediction aid each other iteratively. This involves two main components:

Pseudo RGB-D SLAM: Here, CNN-predicted depths are incorporated into a feature-based SLAM, simulating the effect of RGB-D sensors. This integration addresses the conventional scale drift and accuracy issues in monocular SLAM systems and enhances pose estimation.
Depth Network Refinement: SLAM-derived 3D structures and camera poses are used to fine-tune the CNN depth prediction through wide baseline losses. These new losses, the symmetric depth transfer and depth consistency, exploit geometric consistency over a broader temporal range, improving depth prediction accuracy for distant scene elements.

The framework's iterative process alternates between these two components until the improvements in depth prediction and SLAM pose accuracy converge.

Methodological Contributions

The paper delineates several innovations critical to the success of the proposed framework:

Narrow and Wide Baseline Integration: A novel approach by combining narrow baseline photometric losses with wide baseline geometric losses, effectively using both short and long-range scene information during depth refinement.
Pseudo RGB-D for SLAM: An adaptation of RGB-D SLAM using CNN-predicted depths, yielding enhanced robustness and accuracy when compared to traditional monocular SLAM.
Incremental Learning Loop: An iterative refinement procedure allowing mutual enhancement between SLAM and depth prediction through continuous feedback, improving both with every iteration.

Experimental Validation

The effectiveness of this framework is validated through extensive experimentation on the KITTI and TUM RGB-D datasets. The results demonstrate significant improvements in depth prediction and camera pose accuracy compared to state-of-the-art monocular and stereo depth estimation methods and established SLAM systems like ORB-SLAM. Notably, the proposed system achieves these results solely with monocular video input, emphasizing its potential applicability in real-world scenarios where stereoscopic or depth camera setups are impractical.

Implications and Future Directions

This research implies substantial advances in the field of 3D perception systems, particularly in applications requiring robust depth and localization capabilities from minimal sensor input. The practical implications extend to augmented reality, autonomous navigation, and robotic perception, where monocular camera setups are preferred due to cost and form factor constraints.

Looking forward, this framework presents several avenues for future research:

Real-time Implementations: Enhancing the framework for online, real-time applications could broaden its applicability across various domains.
Generalization to Diverse Environments: Further optimization and testing across different environmental conditions and camera setups, including rolling shutter effects and dynamic scenes, can help refine and generalize this approach.
Incorporation of Additional Modalities: Future work may explore integrating other sensory data, such as IMU readings, to further enhance robustness and accuracy.

Overall, this paper provides a compelling argument for the integration of geometric and learning-based approaches in monocular SLAM and depth prediction, offering insights that could inspire further advancements in 3D computer vision techniques.

PDF Markdown

Related Papers

YouTube

Show All Videos