Unsupervised Monocular Depth Estimation with Left-Right Consistency (1609.03677v3)

Published 13 Sep 2016 in cs.CV, cs.LG, and stat.ML

Abstract: Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.

Citations (2,774)

View on Semantic Scholar

Summary

The paper presents an unsupervised CNN that leverages left-right disparity consistency to predict depth without ground truth supervision.
It reformulates depth estimation as an image reconstruction problem, integrating appearance matching, smoothness, and consistency losses for robust multi-scale disparity maps.
The approach outperforms several supervised methods on the KITTI dataset, demonstrating high accuracy and generalization across different environments.

Unsupervised Monocular Depth Estimation with Left-Right Consistency

The paper "Unsupervised Monocular Depth Estimation with Left-Right Consistency" addresses a fundamental challenge in computer vision: estimating depth from a single image without requiring supervised training on ground truth depth data. This research diverges from traditional methods that demand extensive labeled datasets, and instead, leverages the more readily available binocular stereo footage for training.

Novel Contributions

The authors propose a convolutional neural network (CNN) trained with a novel objective that exploits epipolar geometry constraints to generate disparity images. The key contributions of this work are as follows:

End-to-End Network Architecture: The architecture is fully convolutional and performs monocular depth estimation without any direct supervision of depth data. This is achieved by training the network with an image reconstruction loss.
Left-Right Disparity Consistency: A novel training loss enforces consistency between the left and right image disparities. This enhancement significantly improves the quality and robustness of the depth estimates.
State-of-the-Art Results: The proposed method outperforms several existing supervised methods on the KITTI driving dataset, demonstrating robustness and high performance.

Methodology

Depth Estimation as an Image Reconstruction Problem

The network learns to predict depth by synthesizing disparity fields that warp one image in a stereo pair to the other. The disparity is computed such that the left image can be reconstructed from the right image, and vice versa. The model uses a bilinear sampler to maintain differentiability during the training process.

Network Structure and Training

The architecture consists of an encoder-decoder design, inspired by DispNet, with enhancements for depth prediction. The decoder part generates multi-scale disparity maps, with each subsequent scale having progressively higher resolution.

Key components of the loss function include:

Appearance Matching Loss: This encourages the reconstructed image to resemble the input image using a combination of $L_1$ loss and SSIM loss.
Disparity Smoothness Loss: This term enforces local smoothness in the predicted disparities, weighted by the gradients of the input image.
Left-Right Consistency Loss: Novel to this work, it ensures the disparity generated from the left image is consistent with that from the right image, enforcing a bidirectional consistency constraint during training.

Experimental Evaluation

The proposed method was evaluated on the KITTI dataset and compared to several state-of-the-art supervised and unsupervised methods. Notable results include:

KITTI Split: The method achieves superior performance on commonly used depth error metrics compared to both supervised methods and previous unsupervised techniques.
Generalization: The model also generalizes well to other datasets such as Make3D and Cityscapes, highlighting its robustness across different environmental conditions and camera setups.

Implications and Future Directions

Practical Implications:

Ease of Data Collection: By eliminating the need for large amounts of labeled depth data, the method significantly reduces the effort and cost involved in training depth prediction models.
Wide Applicability: The approach can be used in varied applications, from robotic navigation to augmented reality, where depth perception from a single camera is crucial.

Theoretical Implications:

Advancement in Unsupervised Learning: The left-right consistency check offers a novel direction for ensuring high-quality predictions, opening avenues for other unsupervised learning tasks.
Improvement over Supervised Methods: By outperforming supervised methods, it challenges the assumption that supervised learning is indispensable for high-quality depth estimation.

Future Developments:

Temporal Consistency: Extending this work to videos by incorporating temporal consistency could further stabilize and improve depth predictions.
Occlusion Handling: Integrating explicit modeling of occlusions could resolve the remaining issues with depth discontinuities and occlusions.
Sparse Input Training: Exploring sparse input data, such as sparse LiDAR or other sensors, could provide additional robustness and accuracy.

In conclusion, the paper presents a significant advance in unsupervised depth estimation, leveraging stereo image pairs and a novel consistency enforcement mechanism to achieve superior accuracy and generalization. This research underscores the potential of unsupervised methods to meet or even exceed the performance of supervised counterparts, fostering further innovation in the field of computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos