Unsupervised Learning of Depth and Ego-Motion from Video (1704.07813v2)

Published 25 Apr 2017 in cs.CV

Abstract: We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthesis as the supervisory signal. The networks are thus coupled via the view synthesis objective during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performing comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performing favorably with established SLAM systems under comparable input settings.

Citations (2,441)

View on Semantic Scholar

Summary

The paper presents an unsupervised framework that leverages view synthesis as a supervisory signal for jointly estimating depth and camera motion without labeled data.
It employs independent CNNs to predict per-pixel depth maps and relative camera poses from video, achieving competitive results against supervised methods.
Empirical evaluations on the KITTI dataset demonstrate the method's high fidelity in monocular depth estimation and robust ego-motion prediction.

Unsupervised Learning of Depth and Ego-Motion from Video

The paper "Unsupervised Learning of Depth and Ego-Motion from Video," authored by Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe, presents an innovative approach for monocular depth and camera motion estimation using an unsupervised learning framework. This work leverages unstructured video sequences and bypasses the need for manual labeling or ground-truth pose information, distinguishing it from prior methods that require such supervision.

Summary of the Approach

The primary contribution of this paper lies in its unsupervised learning framework, which employs view synthesis as the supervisory signal. The framework encapsulates single-view depth and multi-view pose networks. A pivotal aspect is the loss function based on warping nearby views to the target using the estimated depth and pose, enabling the networks to be trained jointly. However, for inference, the networks can operate independently. The method relies on convolutional neural networks (CNNs) to predict depth maps and camera poses directly from image sequences.

Key Components of the Framework

Depth and Pose Networks: The depth network predicts a per-pixel depth map from a single image, while the pose network predicts the relative camera pose between consecutive frames.
View Synthesis Supervision: The view synthesis loss ensures that the transformed views align well with the target view, enforced by differentiable depth image-based rendering. This process inherently drives the network to learn accurate depth and pose estimation.
Explainability Mask: To account for limitations such as dynamic objects, occlusions, and non-Lambertian surfaces, an explainability mask is jointly predicted and used to weight the view synthesis objective, thus discounting parts of the scene that are less likely to obey the model’s assumptions.

Numerical Results

Empirical evaluation using the KITTI dataset demonstrates the effectiveness of the proposed method:

Monocular Depth Estimation: The unsupervised depth estimates perform impressively when compared to several supervised approaches (e.g., Eigen et al.). Notably, depth predictions achieve an Abs Rel error of 0.201 with the Cityscapes-pretrained and KITTI-finetuned model, suggesting high fidelity in depth estimation even without supervised depth information.
Pose Estimation: The pose estimation network, benchmarked against ORB-SLAM, shows competitive performance, especially in settings with minimal side-rotation.

Implications and Future Work

The implications of this research are significant for various fields, such as autonomous driving, robotics, and augmented reality, where monocular camera setups are prevalent, and acquiring labeled data is challenging. The unsupervised nature of the method enables leveraging vast quantities of video data readily available on the internet.

Future work could extend this framework to handle more complex scenes with significant dynamic objects and occlusions by integrating methods for explicit scene dynamics and motion segmentation. Another exciting direction is to explore unsupervised learning of full 3D volumetric representations, transcending the limitations posed by depth maps.

Further developments in this area of research could also investigate the internal representations learned by the system. Understanding and repurposing these representations for tasks such as object detection and semantic segmentation could unlock new applications and enhance the versatility of the trained models.

In conclusion, this paper provides a crucial step towards more flexible and scalable methods for depth and ego-motion estimation from monocular video, employing a fully unsupervised learning paradigm that proves to be highly effective through empirical validation.