MegaDepth: Learning Single-View Depth Prediction from Internet Photos (1804.00607v4)

Published 2 Apr 2018 in cs.CV

Abstract: Single-view depth prediction is a fundamental problem in computer vision. Recently, deep learning methods have led to significant progress, but such methods are limited by the available training data. Current datasets based on 3D sensors have key limitations, including indoor-only images (NYU), small numbers of training examples (Make3D), and sparse sampling (KITTI). We propose to use multi-view Internet photo collections, a virtually unlimited data source, to generate training data via modern structure-from-motion and multi-view stereo (MVS) methods, and present a large depth dataset called MegaDepth based on this idea. Data derived from MVS comes with its own challenges, including noise and unreconstructable objects. We address these challenges with new data cleaning methods, as well as automatically augmenting our data with ordinal depth relations generated using semantic segmentation. We validate the use of large amounts of Internet data by showing that models trained on MegaDepth exhibit strong generalization-not only to novel scenes, but also to other diverse datasets including Make3D, KITTI, and DIW, even when no images from those datasets are seen during training.

Citations (912)

View on Semantic Scholar

Summary

The paper introduces the MegaDepth dataset, leveraging SfM and MVS from Internet photos to support robust single-view depth prediction.
It presents innovative depth map refinement techniques, including semantic filtering and conservative MVS, to reduce noise and errors.
A depth prediction network with a scale-invariant loss, multi-scale gradient matching, and ordinal depth loss achieves superior generalization across benchmarks.

MegaDepth: Learning Single-View Depth Prediction from Internet Photos

The paper "MegaDepth: Learning Single-View Depth Prediction from Internet Photos" by Zhengqi Li and Noah Snavely investigates a novel approach to the problem of single-view depth prediction, leveraging large-scale datasets derived from Internet photos. This research attempts to address the limitations inherent in existing depth datasets, such as those restricted to indoor images or limited by the number of samples and coverage.

Key Contributions

The primary contributions of this paper are three-fold:

MegaDepth Dataset: The introduction of the MegaDepth (MD) dataset, a large-scale depth dataset generated using multi-view Internet photo collections through structure-from-motion (SfM) and multi-view stereo (MVS). This dataset contains diverse and extensive data, overcoming constraints seen in traditional datasets like NYU, Make3D, and KITTI.
Depth Map Refinement Techniques: To counter challenges such as noise and unreconstructable objects inherent in MVS data, novel data cleaning methods are proposed. These include a modified MVS algorithm and semantic segmentation to remove spurious depth information and to generate ordinal depth relationships automatically.
Depth Prediction Network: A robust depth prediction network architecture trained using the MD dataset. The network employs a unique loss function that includes a scale-invariant data term, multi-scale gradient matching, and an ordinal depth loss term to enhance depth prediction accuracy and generalization across different scenes.

Methodological Details

Dataset Creation

The MD dataset was curated from well-photographed landmarks across the globe. Using advanced SfM and MVS methods, the authors reconstructed these scenes to produce dense depth maps. They introduced two novel depth map refinement methods to mitigate depth prediction errors:

Semantic Filtering: By categorizing pixels into foreground, background, and sky through semantic segmentation, the authors effectively filtered out spurious depths attached to transient foreground objects and erroneous sky depths.
Depth Map Cleaning: To refine depth continuity and remove noise, the authors proposed a conservative variation of MVS that keeps closer depth estimates and integrated semantic segmentation to derive reliable ordinal depth data automatically.

Loss Function

A key innovation in the proposed network is the scale-invariant loss function, which is crucial given the scale ambiguity in SfM-based 3D reconstructions. The loss function comprises:

Scale-invariant Data Term (L_si): Measures the error between predicted and ground truth depth maps in a log-depth domain.
Multi-scale Gradient Matching Term (L_mgm): Encourages smooth depth transitions and sharp depth discontinuities by penalizing differences in depth gradients across multiple scales.
Ordinal Depth Loss (L_ord): Leverages automatically generated ordinal depth relationships to improve the network's ability to predict relative depths accurately, especially for dynamic or unreconstructable objects.

Evaluation and Results

The network, trained exclusively on the MD dataset, demonstrates substantial generalization capabilities. The results show competitive performance not only on images of unseen scenes but also across diverse datasets such as Make3D, KITTI, and DIW, significantly outperforming prior methods trained on traditional datasets. For instance, the network exhibits improved RMS and absolute relative error metrics on both Make3D and KITTI datasets, indicating better accuracy and robustness.

Implications and Future Work

Practically, this research suggests a paradigm shift in training depth prediction models: harnessing vast and varied Internet photo collections offers a scalable and diverse dataset that can enhance model performance across different environments. Theoretically, it underscores the importance of incorporating scene semantics and ordinal relationships in depth prediction to address reconstruction limits in MVS-based training data.

Future avenues for this research might include extending the dataset to encompass more varied types of scenes beyond landmarks, and integrating learning-based SfM to handle scale ambiguity and improve depth map accuracy. Moreover, refining methods to handle oblique and complex surfaces will further enhance prediction reliability and applicability in real-world scenarios.

In summary, "MegaDepth: Learning Single-View Depth Prediction from Internet Photos" offers a comprehensive methodology and dataset that pave the way for more accurate and generalizable single-view depth prediction models, utilizing the internet's vast image resources to achieve enriched training diversity.

PDF Markdown