DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency (1809.01649v1)

Published 5 Sep 2018 in cs.CV

Abstract: We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals. Our core idea is that for rigid regions we can use the predicted scene depth and camera motion to synthesize 2D optical flow by backprojecting the induced 3D scene flow. The discrepancy between the rigid flow (from depth prediction and camera motion) and the estimated flow (from optical flow model) allows us to impose a cross-task consistency loss. While all the networks are jointly optimized during training, they can be applied independently at test time. Extensive experiments demonstrate that our depth and flow models compare favorably with state-of-the-art unsupervised methods.

Citations (464)

View on Semantic Scholar

Summary

The paper presents an unsupervised framework that jointly learns depth and optical flow using a novel cross-task consistency loss.
The methodology leverages geometric relationships and standard priors with forward-backward consistency to robustly synthesize and validate motion predictions.
Results on KITTI and Make3D datasets demonstrate competitive performance, underscoring its potential for real-world autonomous applications.

Unsupervised Joint Learning of Depth and Flow Using Cross-Task Consistency

This paper presents a novel framework, termed DF-Net, designed for the simultaneous unsupervised learning of single-view depth prediction and optical flow estimation. Leveraging geometric consistency as supervisory signals, the framework uniquely couples these tasks, which are often tackled independently despite their inherent correlation. Herein lies the core contribution of the work: the introduction of a cross-task consistency loss to enhance the training efficacy of both models using unlabeled monocular videos.

Methodology

The authors propose a joint learning system comprising a depth network and a flow network, optimized concurrently to ensure both tasks benefit from shared insights. The method hinges upon the geometric relationship between scene depth, camera pose, and optical flow. Specifically, for rigid regions, the framework utilizes predicted scene depth and estimated camera motion to synthesize 2D optical flow by backprojecting the expected 3D scene flow. Discrepancies between the rigid flow synthesized in this manner and the flow estimated by the optical flow model introduce a cross-task consistency loss.

The framework employs several standard priors common in unsupervised methods, such as brightness constancy and spatial smoothness, but it extends these with a forward-backward consistency check to handle non-rigid transformations and occlusions. This approach enhances the robustness of the depth and flow estimation by identifying valid regions where cross-task consistency constraints are more reliably enforced.

Results

Extensive evaluations demonstrate that the models trained with this joint framework achieve commendable results compared to state-of-the-art unsupervised methods. The DF-Net shows competitive performance in single-view depth prediction on the KITTI and Make3D datasets and optical flow estimation on the KITTI flow datasets. Notably, it outperforms several baseline models and offers a feasible strategy for unsupervised pre-training in scenarios where ground truth data is limited.

Implications

The implications of this research are multifaceted. Practically, the joint training mechanism proposed can contribute to more accurate real-world applications in autonomous vehicles where depth and motion estimation are crucial. Theoretically, the introduction of cross-task consistency could inspire further research into multi-task learning and geometric consistency in computer vision, potentially fostering improvements in combining related tasks to extract enhanced feature embeddings.

Future Directions

The integration of stereo video data, which the authors suggest as a potential future avenue, could further refine the framework's efficacy by leveraging depth and pose supervision from calibrated stereo pairs. Additionally, exploring advanced network architectures for depth and flow networks might better address the inherent complexities of real-world challenges such as occlusions and dynamic scenes, thereby improving model robustness and generalization.

By focusing on the geometric linkage between tasks, this work not only contributes to advancements in individual task performance but also opens the door to a broader understanding of multi-task learning strategies in computer vision through unsupervised frameworks.

PDF Markdown