Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints (1802.05522v2)

Published 15 Feb 2018 in cs.CV

Abstract: We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion ground truth, or multi-view video). Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only consider pixels in small local neighborhoods. Our main contribution is to explicitly consider the inferred 3D geometry of the scene, enforcing consistency of the estimated 3D point clouds and ego-motion across consecutive frames. This is a challenging task and is solved by a novel (approximate) backpropagation algorithm for aligning 3D structures. We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames. We also incorporate validity masks to avoid penalizing areas in which no useful information exists. We test our algorithm on the KITTI dataset and on a video dataset captured on an uncalibrated mobile phone camera. Our proposed approach consistently improves depth estimates on both datasets, and outperforms the state-of-the-art for both depth and ego-motion. Because we only require a simple video, learning depth and ego-motion on large and varied datasets becomes possible. We demonstrate this by training on the low quality uncalibrated video dataset and evaluating on KITTI, ranking among top performing prior methods which are trained on KITTI itself.

Citations (709)

View on Semantic Scholar

Summary

The paper introduces a novel unsupervised method that uses a 3D geometric loss to enforce consistency in point cloud estimates, significantly improving depth and ego-motion accuracy.
The paper employs principled validity masks to exclude occluded or unmatched regions, thereby enhancing training robustness in complex visual scenarios.
The paper develops an approximate backpropagation algorithm using ICP, which effectively aligns 3D structures without relying on ground-truth labels.

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints

The paper "Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints" authored by Reza Mahjourian, Martin Wicke, and Anelia Angelova introduces a novel unsupervised approach to infer depth and ego-motion from monocular video sequences. This research provides significant advancements in the field of computer vision, particularly in scenarios where acquiring labeled data is challenging.

Methodology and Contributions

This work differentiates itself by its substantial reliance on 3D geometric constraints, rather than focusing solely on 2D photometric errors, which is common in prior works. The core contributions of this paper include:

3D Geometric Loss Integration: The authors introduce a new 3D geometric loss that enforces the consistency of inferred 3D point clouds across consecutive video frames. This loss function directly penalizes inconsistencies in the estimated depth by comparing point clouds, enhancing the accuracy of the depth and ego-motion estimates without the need for ground truth data.
Incorporation of Principled Masks: To improve the robustness of the training process, the authors incorporate validity masks. These masks exclude parts of the scene that can't be matched across frames due to occlusions or field-of-view changes. This approach sidesteps the shortcomings of learned or post-processed masks that have been used in the literature to mitigate similar issues.
Novel Backpropagation Algorithm: A unique (approximate) backpropagation algorithm is developed to facilitate the alignment of 3D structures. This algorithm leverages Iterative Closest Point (ICP) calculations to produce gradients that are used for optimizing the depth and ego-motion predictions.

Experimental Evaluation and Results

The proposed method is evaluated using several datasets, including the KITTI dataset and a custom video dataset captured on an uncalibrated mobile phone camera. The results indicate that their approach consistently outperforms state-of-the-art methods in both depth and ego-motion estimation, highlighting the effectiveness of the 3D loss component.

Performance on KITTI Dataset:

The method achieved an absolute relative error (Abs Rel) of 0.159 and a root mean square error (RMSE) of 5.912 when evaluated on the KITTI dataset. These quantitative metrics indicate substantial improvements over prior unsupervised methods.

Generalization to Diverse Data:

With an emphasis on practical applicability, the paper demonstrates the robustness of the proposed method by training on the uncalibrated phone camera dataset and evaluating on the KITTI dataset. The model trained exclusively on this diverse and low-quality data converged to performance levels comparable to models trained on high-quality datasets, showcasing its potential for widespread deployment on varied and large-scale video data sources.

Implications and Future Directions

The practical implications of this research are significant for various fields, including autonomous driving and robotics, where understanding depth and motion from monocular cameras is crucial. The removal of reliance on labeled data or multiple calibrated cameras paves the way for scalable and cost-effective deployment of depth and ego-motion estimation systems.

Future Work Directions:

Handling Dynamic Scenes: One limitation is the model's handling of dynamic objects, which can bias depth estimates. Future efforts could be directed towards explicitly modeling dynamic elements within the scene to separate their motion from ego-motion accurately.
Occlusion Handling: The current principled masks do not fully account for occlusions or disocclusions caused by the viewpoint shifts. Enhancing mask computation to consider these phenomena could further improve estimation accuracy.
Integration with Other Sensors: While this work focuses on monocular video, combining the 3D geometric constraints method with other sensor data like IMU or radar could yield even more robust depth and ego-motion systems.

This paper establishes a robust framework for unsupervised learning using 3D geometric constraints, contributing substantially to the fields of computer vision and machine learning. The integration of geometric consistency into the learning process presents a promising advancement in unsupervised depth and ego-motion estimation, setting a strong foundation for future explorations and enhancements.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AljosaOsep/status/1844424721129144695

YouTube

Show All Videos