BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks (1911.10127v2)

Published 22 Nov 2019 in cs.CV

Abstract: While deep learning has recently achieved great success on multi-view stereo (MVS), limited training data makes the trained model hard to be generalized to unseen scenarios. Compared with other computer vision tasks, it is rather difficult to collect a large-scale MVS dataset as it requires expensive active scanners and labor-intensive process to obtain ground truth 3D structures. In this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. To create the dataset, we apply a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, we render these mesh models to color images and depth maps. To introduce the ambient lighting information during training, the rendered color images are further blended with the input images to generate the training input. Our dataset contains over 17k high-resolution images covering a variety of scenes, including cities, architectures, sculptures and small objects. Extensive experiments demonstrate that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets. The dataset and pretrained models are available at \url{https://github.com/YoYo000/BlendedMVS}.

Citations (401)

View on Semantic Scholar

Summary

The paper introduces BlendedMVS, a synthetic dataset that tackles generalization challenges in MVS networks by blending real-world lighting with rendered imagery.
It employs a low-cost pipeline to generate over 17,000 high-resolution images and ground truth depth maps, enhancing training for diverse 3D reconstructions.
Experimental results show that models trained on BlendedMVS achieve lower depth errors and improved point cloud reconstructions compared to traditional datasets.

BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

The research outlined in the paper introduces BlendedMVS, a large-scale synthetic dataset explicitly designed to address generalization barriers in learning-based multi-view stereo (MVS) approaches. This dataset seeks to rectify the limitations in training data that have historically constrained the efficacy of deep learning models when applied to diverse and unseen scenarios.

Dataset Composition and Methodology

BlendedMVS contains over 17,000 high-resolution images encompassing varied scenes such as urban landscapes, architectural structures, sculptures, and small objects. The dataset generation involves a novel low-cost pipeline, pivoting on rendered high-quality textured meshes derived from real-world image sets. These mesh models yield both color images and ground truth depth maps via a 3D reconstruction pipeline.

The innovative aspect of BlendedMVS is its image blending technique. By fusing ambient lighting from input images with structured visual cues from rendered images using frequency domain filtering (where high-pass and low-pass filters extract specific information), the blended images maintain realistic lighting effects contributing to enhanced generalization capabilities for models trained on this dataset.

Significance in Multi-view Stereo Networks

Learning-based MVS methods, unlike classical methods, have shown potentialities in addressing intricate semantics like specularity and lighting intricacies, which are critical for reconstructing textureless or non-Lambertian surfaces. However, their dependency on datasets like DTU, which contain limited and structured camera trajectories, strains their adaptability across unstructured and diverse scenes. The generalized nature of BlendedMVS, with its diverse scenarios and unstructured trajectories, provides data that better imitate real-world conditions, bolstering the generalization capabilities of trained models.

Experimentation with state-of-the-art networks such as MVSNet, R-MVSNet, and Point-MVSNet demonstrates that models trained on BlendedMVS outperform those trained on alternative datasets including DTU, ETH3D, and MegaDepth, evidenced by improved performance on diverse validation sets.

Numerical Results and Evaluation

The numerical evaluation, including metrics like endpoint error (EPE), pixel error thresholds, and f-scores from Tanks and Temples benchmarking, underscores the robustness of models trained on BlendedMVS. Compared to other datasets, models utilizing BlendedMVS exhibit reduced depth estimation errors and superior point cloud reconstruction accuracy, confirming the dataset's effectiveness in enhancing learning-based approach flexibility.

Implications and Future Directions

BlendedMVS's ability to enhance the generalization of MVS networks has significant implications for computer vision tasks beyond MVS, such as image feature detection, semantic segmentation, and 3D object reconstruction. Its integration of synthetically generated, yet highly realistic, training data represents a leap forward in dataset curation practices for vision-related machine learning tasks.

Future avenues could explore refining the texture and blending techniques to incorporate more detailed and dynamic environmental interactions. Furthermore, expanding BlendedMVS with auxiliary data types like occlusion and normal information could enrich training paradigms for next-generation MVS systems.

Conclusively, the introduction of BlendedMVS is a vital step in advancing the general applicability and robustness of MVS networks, setting a precedent for future synthetic datasets in dynamically evolving computer vision landscapes.

PDF Markdown