Papers
Topics
Authors
Recent
Search
2000 character limit reached

BlendedMVS: Photorealistic MVS Benchmark

Updated 1 June 2026
  • BlendedMVS is a comprehensive photorealistic MVS dataset offering over 17,000 high-resolution image-depth map pairs aligned through a detailed frequency-based blending pipeline.
  • It employs a three-stage construction pipeline integrating textured mesh generation, rendering, and frequency-based image fusion to produce precise images and depth maps at 1536×2048 resolution.
  • Empirical evaluations demonstrate that models trained on BlendedMVS improve generalization performance across benchmarks, yielding higher geometric accuracy and more robust 3D reconstructions.

BlendedMVS is a large-scale, photo-realistic dataset designed to address the generalization limitations of deep learning approaches for multi-view stereo (MVS). It unites ground-truth depth supervision, produced via photogrammetric mesh reconstruction and rendering, with the photometric complexity of real-world imagery using a frequency-based image fusion strategy. Comprising over 17,000 high-resolution, pixel-aligned image and depth-map pairs across 113 scenes, BlendedMVS enables robust learning and evaluation of generalized MVS networks across diverse real-world scales, scene types, and lighting conditions (Yao et al., 2019).

1. Construction Pipeline and Methodology

The BlendedMVS dataset is constructed through a three-stage pipeline:

  1. Textured Mesh Generation: For each of 113 selected scenes (106 for training, 7 for validation), between 20 and 1,000 high-resolution hand-held photographs are collected with unstructured camera trajectories. These images are processed by the Altizure-powered online SfM + MVS service, which performs feature extraction, incremental bundle adjustment, dense stereo fusion, and mesh texturing. The resulting artifacts are watertight, textured triangular mesh models with calibrated camera intrinsics and extrinsics, obviating the need for active scanning hardware.
  2. Rendering: Each mesh is rendered from original camera viewpoints to produce a “rendered image” (IrI_r), containing pure surface albedo under nominal lighting, and an aligned ground-truth depth map (DrD_r) in camera pixel coordinates. Renderings are performed at 1536×2048 pixels and centrally cropped to this resolution, ensuring preservation of geometric detail.
  3. Frequency-based Image Fusion:

Since IrI_r lacks realistic, view-dependent illumination, BlendedMVS combines the rendered image IrI_r and the original photograph IoI_o using frequency separation: - Low-frequency components (ambient lighting) are extracted from IoI_o by Gaussian filtering (LL), where the kernel parameter D0=5000D_0 = 5000. - High-frequency components (texture/shading) are retained from IrI_r via the complementary high-pass filter H=1LH = 1-L. The blended training image DrD_r0 is computed as:

DrD_r1

where DrD_r2 and DrD_r3 denote the FFT and inverse FFT, DrD_r4, and DrD_r5. This produces images exhibiting realistic lighting while remaining pixel-aligned with their synthetic depth maps (Yao et al., 2019).

2. Composition and Scene Diversity

BlendedMVS encompasses:

  • Scenes: 113 total, stratified as 106 for training and 7 for validation (scene lists and trajectories are provided as supplemental files).
  • Images and Depth Maps: 17,818 blended image/depth map pairs.
  • Image Resolution: Uniform 1536×2048, after resampling and central cropping.
  • Scene Categories: Urban street-views and cityscapes, architectural facades (courtyards, ruins), sculptures/statues, and small objects such as jewelry and mechanical parts.
  • Depth Ranges: Per-view DrD_r6 (typically 1–100 m, depending on scene scale) are specified in accompanying metadata.
  • Camera Geometry: All camera baselines and view angles are unstructured, reflecting unconstrained exploratory photogrammetry.

This diversity ensures broad coverage of real-world phenomena and supports analysis of generalization across environments.

3. Data Organization, Formats, and Loading

The dataset’s hierarchical file structure provides explicit separation of modality and metadata:

Folder/File Contents Format/Notes
/scene_xxxx/ Data for one scene
/images/ Blended color inputs (DrD_r7) .jpg
/depths/ Ground-truth depth maps (DrD_r8) .pfm (floating-point)
/cams/ Camera intrinsics/extrinsics cam_*.txt; 4×4 extrinsics, 3×3 intrinsics
depth_range.txt Per-view DrD_r9, IrI_r0 Two-column ASCII
scene_list.txt Train/validation split enumeration

Official loaders (e.g., PyTorch-based in dataset.py) provide direct access to images, depths, intrinsics (IrI_r1), extrinsics (IrI_r2), and view-dependent depth bounds. The standard usage pattern follows the MVSNet-style Dataset API.

4. Benchmark Splits and Generalization Protocol

BlendedMVS is split for robust supervised and generalization benchmarks:

  • Training: 106 scenes (IrI_r316,500 views)
  • Validation: 7 scenes (IrI_r41,300 views), held out for hyperparameter search and early stopping
  • External Generalization: No overlap with Tanks & Temples, ETH3D, or DTU benchmarks, which serve as independent cross-dataset generalization targets

Validation metrics include mean L₁ endpoint error (EPE), and proportions of pixels with error exceeding 1 or 3 pixels. Point cloud reconstructions are evaluated using F₁-scores, precision, and recall on standard benchmarks (e.g., Tanks & Temples) (Yao et al., 2019).

5. Empirical Performance and Significance

Empirical assessments demonstrate that models trained on BlendedMVS consistently generalize better than those trained on prior datasets:

  • Depth-map Validation: Across DTU, ETH3D, and BlendedMVS validation, models trained with BlendedMVS data show lower mean L₁ EPE, and reduced percentages of high-error pixels. By contrast, training only on DTU or ETH3D induces overfitting to small indoor scenes, while MegaDepth’s greater image volume is offset by less reliable (SfM-derived) depth maps (Yao et al., 2019).
  • Point Cloud Reconstruction: Training R-MVSNet on BlendedMVS increases average F₁-score from 0.475 (DTU-trained) to 0.532 (+5.7 percentage points) on seven Tanks & Temples outdoor scenes. Both precision and recall improve, indicating better geometric accuracy and completeness.

This suggests that BlendedMVS’s blended photometric realism and mesh-derived geometry facilitate superior out-of-distribution MVS generalization.

6. Access, Integration, and Usage Recommendations

BlendedMVS is publicly available at https://github.com/YoYo000/BlendedMVS, accompanied by scripts (download_blendedmvs.sh) for bulk download and preprocessing (e.g., cropping, PyTorch .pt file generation). The dataset is directly compatible with Python/PyTorch MVS pipelines:

IrI_r5

Best practices include applying random brightness (±50), contrast (×[0.3–1.5]), and motion blur (kernel size 1 or 3) augmentations to blended images during training to enhance model robustness. Researchers are advised to observe the provided depth_range.txt for per-view depth bounds and can extend the dataset by following the pipeline: SfM → mesh → render → blend, using any compatible mesh reconstruction and rendering tools. Additional resources such as precomputed occlusion masks and surface normals facilitate advanced visibility-aware loss design.

BlendedMVS constitutes a scalable framework for creating photorealistic, depth-supervised MVS benchmarks. Its methodology and public availability support continued advancements in generalizable, learning-based 3D reconstruction (Yao et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlendedMVS Dataset.