DeMoN: Depth and Motion Network for Learning Monocular Stereo

Published 7 Dec 2016 in cs.CV | (1612.02401v2)

Abstract: In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the matching. A crucial component of the approach is a training loss based on spatial relative differences. Compared to traditional two-frame structure from motion methods, results are more accurate and more robust. In contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and, thus, better generalizes to structures not seen during training.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (678)

View on Semantic Scholar

Summary

The paper introduces a learning-based framework that jointly estimates depth and motion from monocular sequences, outperforming traditional SfM approaches.
The methodology employs a three-stage architecture—bootstrap, iterative, and refinement nets—to progressively enhance prediction accuracy on benchmarks like SUN3D and RGB-D.
The results demonstrate improved recovery of depth discontinuities and motion edges, paving the way for advancements in 3D scene reconstruction and autonomous navigation.

Overview of DeMoN: Depth and Motion Network for Learning Monocular Stereo

The paper presents a novel approach to the "structure from motion" (SfM) problem, framing it as a learning task with the introduction of DeMoN, a convolutional network designed to estimate depth and camera motion from monocular image sequences. Unlike traditional SfM methods relying on meticulously engineered pipelines, DeMoN leverages deep learning to provide a more robust and accurate solution.

Network Architecture

DeMoN employs a multi-component deep learning architecture composed of:

Bootstrap Net: This network initializes predictions of depth and egomotion, utilizing dense optical flow as input. It integrates spatial information through encoder-decoder components to generate initial estimates.
Iterative Net: Building upon initial predictions, this network refines depth and motion estimates iteratively, enhancing the robustness and accuracy of the predictions.
Refinement Net: The final stage operates at full image resolution, producing improved high-resolution depth maps through further processing of refined outputs.

Numerical Results and Empirical Findings

DeMoN displays a significant improvement over traditional SfM methods and single-image depth networks:

On multiple datasets such as SUN3D, RGB-D, and MVS, DeMoN demonstrates lower L1-relative, scale-invariant errors, and more accurate motion estimates than baseline methods.
The architecture's iterative design allows recovery and refinement of motion edges and depth discontinuities, providing notable improvements even under challenging scenarios with minimal motion.

Practical and Theoretical Implications

The implications of DeMoN extend beyond mere depth estimation. By demonstrating effective motion parallax exploitation, this work posits a shift towards more integrated and robust learning-based SfM pipelines. The depth and motion predictions enable applications from 3D scene reconstruction to autonomous navigation, especially where traditional methods struggle due to sparse or ambiguous visual information.

Future Directions

Future research could explore extending DeMoN to handle varying camera intrinsics and multi-image sequences, potentially enhancing its applicability across diverse datasets and conditions. There is also scope for integrating DeMoN within SLAM systems, leveraging its strengths in depth and motion estimation to improve localization and mapping accuracy.

In summary, DeMoN addresses limitations of traditional SfM through a unified learning-based framework, marking a substantial advance in monocular depth and motion estimation. The network's design and results suggest a promising trajectory for future advancements in visual perception tasks.

Markdown Report Issue