GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose (1803.02276v2)

Published 6 Mar 2018 in cs.CV

Abstract: We propose GeoNet, a jointly unsupervised learning framework for monocular depth, optical flow and ego-motion estimation from videos. The three components are coupled by the nature of 3D scene geometry, jointly learned by our framework in an end-to-end manner. Specifically, geometric relationships are extracted over the predictions of individual modules and then combined as an image reconstruction loss, reasoning about static and dynamic scene parts separately. Furthermore, we propose an adaptive geometric consistency loss to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively. Experimentation on the KITTI driving dataset reveals that our scheme achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.

Citations (1,092)

View on Semantic Scholar

Summary

The paper presents a joint unsupervised framework that simultaneously estimates dense depth, optical flow, and camera pose from monocular videos.
The methodology uses a two-stage architecture combining rigid scene reconstruction with residual flow estimation to handle dynamic objects.
The approach leverages an adaptive geometric consistency loss to improve robustness against occlusions and non-Lambertian surfaces.

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow, and Camera Pose

The paper "GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose" presents an innovative framework for 3D geometry understanding from monocular video sequences. Authored by Zhichao Yin and Jianping Shi, the research focuses on leveraging the geometric relationships inherent in 3D scenes to jointly learn monocular depth, optical flow, and ego-motion in an unsupervised manner.

Overview and Major Contributions

GeoNet is built upon the observation that multiple computer vision tasks like depth estimation, optical flow, and camera motion estimation inherently share geometric regularities given by the 3D structure of the scene. The principal contributions of the paper can be summarized as follows:

Joint Learning Framework: It introduces an end-to-end unsupervised framework that simultaneously estimates monocular depth, optical flow, and camera pose. This joint estimation leverages the geometric relationships among these tasks, improving the performance compared to handling each task independently.
Two-stage Architecture: GeoNet employs a cascaded architecture where the first stage reconstructs the rigid scene geometry using DepthNet and PoseNet, while the second stage uses ResFlowNet to refine the motion estimation for dynamic objects by learning residual optical flow.
Adaptive Geometric Consistency Loss: The paper introduces a novel adaptive geometric consistency loss to handle occlusions and non-Lambertian surfaces effectively. This loss is crucial to improving the robustness of the framework against outliers.

Methodology

GeoNet utilizes the following components in its methodology:

Rigid Structure Reconstruction: The first stage of GeoNet involves estimating the static components of the scene through DepthNet and PoseNet. DepthNet predicts depth maps from single frames, and PoseNet estimates the relative camera poses between frames. The rigid scene flow is derived using these predictions, followed by a robust image reconstruction loss that takes photometric consistency into account.
Residual Flow Estimation: Given that dynamic objects disrupt the smoothness of rigid scene flow, the second stage introduces ResFlowNet. This network learns the residual optical flow, which accounts for the motion of dynamic objects. This residual flow is then added to the rigid flow to obtain the final flow estimation.
Geometric Consistency Enforcement: GeoNet employs an adaptive geometric consistency loss that operates by enforcing the coherence of forward-backward flow predictions. This helps in filtering out inconsistent predictions due to occlusions and maintains smoothness in the flow field.

Experimental Results

The framework is rigorously evaluated on the KITTI dataset across three tasks: monocular depth estimation, optical flow prediction, and camera pose estimation.

Monocular Depth Estimation: GeoNet outperforms previously unsupervised methods and even some supervised approaches. Notably, the method achieves state-of-the-art results by exploiting the geometric regularities among the tasks.
Optical Flow Prediction: The method shows strong performance in predicting optical flow, particularly in handling occlusions and maintaining consistency in texture-ambiguous regions. The experimental results underscore the impact of incorporating geometric consistency and the two-stage refinement process.
Camera Pose Estimation: Results in camera pose estimation on the KITTI odometry dataset demonstrate that GeoNet performs comparably or better than other unsupervised methods, achieving precise estimations even in challenging scenarios involving dynamic objects.

Implications and Future Directions

The research highlights several significant implications for computer vision and AI:

Unified Framework: The successful joint learning framework opens avenues for developing unified models that can handle multiple tasks efficiently by leveraging shared geometric constraints. This reduces redundancy in training multiple isolated models.
Towards Complete Scene Understanding: By integrating depth, motion, and pose estimation, GeoNet moves towards comprehensive scene understanding, which is critical for applications in autonomous driving, augmented reality, and robotics.
Unsupervised Learning Advantage: The unsupervised learning paradigm in GeoNet avoids the need for expensive ground-truth data, demonstrating its effectiveness in large-scale deployment scenarios.

Future advancements could focus on refining the handling of non-rigid motions and extending the framework to incorporate semantic understanding. Integrating semantic cues could further enhance the model's capability to understand complex scenes with high-level contextual information.

In summary, GeoNet represents a significant step forward in unsupervised 3D scene understanding by effectively leveraging the geometric interdependencies of depth, optical flow, and camera pose estimation tasks.

PDF Markdown

Related Papers

YouTube

Show All Videos