- The paper introduces the MegaDepth dataset, leveraging SfM and MVS from Internet photos to support robust single-view depth prediction.
- It presents innovative depth map refinement techniques, including semantic filtering and conservative MVS, to reduce noise and errors.
- A depth prediction network with a scale-invariant loss, multi-scale gradient matching, and ordinal depth loss achieves superior generalization across benchmarks.
MegaDepth: Learning Single-View Depth Prediction from Internet Photos
The paper "MegaDepth: Learning Single-View Depth Prediction from Internet Photos" by Zhengqi Li and Noah Snavely investigates a novel approach to the problem of single-view depth prediction, leveraging large-scale datasets derived from Internet photos. This research attempts to address the limitations inherent in existing depth datasets, such as those restricted to indoor images or limited by the number of samples and coverage.
Key Contributions
The primary contributions of this paper are three-fold:
- MegaDepth Dataset: The introduction of the MegaDepth (MD) dataset, a large-scale depth dataset generated using multi-view Internet photo collections through structure-from-motion (SfM) and multi-view stereo (MVS). This dataset contains diverse and extensive data, overcoming constraints seen in traditional datasets like NYU, Make3D, and KITTI.
- Depth Map Refinement Techniques: To counter challenges such as noise and unreconstructable objects inherent in MVS data, novel data cleaning methods are proposed. These include a modified MVS algorithm and semantic segmentation to remove spurious depth information and to generate ordinal depth relationships automatically.
- Depth Prediction Network: A robust depth prediction network architecture trained using the MD dataset. The network employs a unique loss function that includes a scale-invariant data term, multi-scale gradient matching, and an ordinal depth loss term to enhance depth prediction accuracy and generalization across different scenes.
Methodological Details
Dataset Creation
The MD dataset was curated from well-photographed landmarks across the globe. Using advanced SfM and MVS methods, the authors reconstructed these scenes to produce dense depth maps. They introduced two novel depth map refinement methods to mitigate depth prediction errors:
- Semantic Filtering: By categorizing pixels into foreground, background, and sky through semantic segmentation, the authors effectively filtered out spurious depths attached to transient foreground objects and erroneous sky depths.
- Depth Map Cleaning: To refine depth continuity and remove noise, the authors proposed a conservative variation of MVS that keeps closer depth estimates and integrated semantic segmentation to derive reliable ordinal depth data automatically.
Loss Function
A key innovation in the proposed network is the scale-invariant loss function, which is crucial given the scale ambiguity in SfM-based 3D reconstructions. The loss function comprises:
- Scale-invariant Data Term (L_si): Measures the error between predicted and ground truth depth maps in a log-depth domain.
- Multi-scale Gradient Matching Term (L_mgm): Encourages smooth depth transitions and sharp depth discontinuities by penalizing differences in depth gradients across multiple scales.
- Ordinal Depth Loss (L_ord): Leverages automatically generated ordinal depth relationships to improve the network's ability to predict relative depths accurately, especially for dynamic or unreconstructable objects.
Evaluation and Results
The network, trained exclusively on the MD dataset, demonstrates substantial generalization capabilities. The results show competitive performance not only on images of unseen scenes but also across diverse datasets such as Make3D, KITTI, and DIW, significantly outperforming prior methods trained on traditional datasets. For instance, the network exhibits improved RMS and absolute relative error metrics on both Make3D and KITTI datasets, indicating better accuracy and robustness.
Implications and Future Work
Practically, this research suggests a paradigm shift in training depth prediction models: harnessing vast and varied Internet photo collections offers a scalable and diverse dataset that can enhance model performance across different environments. Theoretically, it underscores the importance of incorporating scene semantics and ordinal relationships in depth prediction to address reconstruction limits in MVS-based training data.
Future avenues for this research might include extending the dataset to encompass more varied types of scenes beyond landmarks, and integrating learning-based SfM to handle scale ambiguity and improve depth map accuracy. Moreover, refining methods to handle oblique and complex surfaces will further enhance prediction reliability and applicability in real-world scenarios.
In summary, "MegaDepth: Learning Single-View Depth Prediction from Internet Photos" offers a comprehensive methodology and dataset that pave the way for more accurate and generalizable single-view depth prediction models, utilizing the internet's vast image resources to achieve enriched training diversity.