Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video (1908.10553v2)

Published 28 Aug 2019 in cs.CV

Abstract: Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i.e., the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale-consistent camera trajectories over a long video sequence.

Citations (466)

View on Semantic Scholar

Summary

The paper introduces a novel geometry consistency loss that aligns depth predictions across frames for scale-consistent estimation.
It employs a self-discovered mask to effectively manage dynamic objects and occlusions, improving image reconstruction accuracy.
The approach achieves state-of-the-art depth estimation on the KITTI dataset and robust global camera trajectory prediction using monocular videos.

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

The paper presents a methodology for unsupervised learning of depth and ego-motion from monocular video sequences, addressing challenges related to scale-inconsistency and moving objects in visual scenes. The authors propose a geometry consistency loss to achieve scale-consistent predictions and introduce a self-discovered mask to handle dynamic objects and occlusions, presenting a framework that is simpler and more efficient than previous approaches. The depth estimators show state-of-the-art performance on the KITTI dataset, while the ego-motion network predicts globally scale-consistent camera trajectories.

Key Contributions and Methodology

Geometry Consistency Loss: The paper introduces a geometry consistency loss to tackle the per-frame scale ambiguity inherent in monocular systems. This loss aligns predicted depth maps and projects them to achieve consistent 3D representations across consecutive frames, ensuring scale-consistency is propagated throughout the video sequence.
Self-discovered Mask: The self-discovered mask identifies regions affected by moving objects, occlusions, or estimation errors within dynamic scenes. By dynamically weighting these regions, the model avoids compromising the image reconstruction loss, improving accuracy without needing additional networks.
Simplified Framework: Unlike multi-task learning approaches, this framework focuses solely on depth and ego-motion estimation, demonstrating efficiency without additional computational overhead for tasks such as optical flow or semantic segmentation.

Performance and Evaluation

The proposed method was evaluated on the KITTI dataset, achieving state-of-the-art results for depth estimation and competitive visual odometry results compared to models trained with stereo image supervision. The paper demonstrates robust performance against previous models like ORB-SLAM and methods reliant on stereo inputs, showing that consistent camera trajectories can be achieved over long sequences using only monocular video data.

Implications and Future Directions

The ability to predict scale-consistent trajectories from monocular videos opens possibilities for visual SLAM applications and advancements in autonomous driving where sensing with minimal hardware is crucial. Future research could focus on enhancing the visual odometry accuracy further by incorporating drift correction mechanisms into the proposed framework.

Conclusion

This paper contributes to the field of computer vision by addressing crucial limitations in monocular depth and ego-motion estimation. The proposed geometry consistency loss and self-discovered mask represent significant methodological advancements, providing a more streamlined and efficient approach to understanding scale in monocular vision systems. This work lays the groundwork for future developments in the domain, especially in applications requiring robust and scalable visual understanding.