Cost Volume Pyramid Based Depth Inference for Multi-View Stereo (1912.08329v3)

Published 18 Dec 2019 in cs.CV

Abstract: We propose a cost volume-based neural network for depth inference from multi-view images. We demonstrate that building a cost volume pyramid in a coarse-to-fine manner instead of constructing a cost volume at a fixed resolution leads to a compact, lightweight network and allows us inferring high resolution depth maps to achieve better reconstruction results. To this end, we first build a cost volume based on uniform sampling of fronto-parallel planes across the entire depth range at the coarsest resolution of an image. Then, given current depth estimate, we construct new cost volumes iteratively on the pixelwise depth residual to perform depth map refinement. While sharing similar insight with Point-MVSNet as predicting and refining depth iteratively, we show that working on cost volume pyramid can lead to a more compact, yet efficient network structure compared with the Point-MVSNet on 3D points. We further provide detailed analyses of the relation between (residual) depth sampling and image resolution, which serves as a principle for building compact cost volume pyramid. Experimental results on benchmark datasets show that our model can perform 6x faster and has similar performance as state-of-the-art methods. Code is available at https://github.com/JiayuYANG/CVP-MVSNet

Citations (292)

View on Semantic Scholar

Summary

The paper presents a hierarchical cost volume pyramid that incrementally refines depth estimates from coarse to fine resolution.
It achieves 6x faster runtime and significant memory savings compared to state-of-the-art methods like Point-MVSNet.
Experimental results on DTU and Tanks and Temples benchmarks demonstrate competitive accuracy and robust generalization in 3D reconstruction.

Cost Volume Pyramid Based Depth Inference for Multi-View Stereo

The paper introduces a novel method for depth inference in multi-view stereo (MVS) using a cost volume pyramid-based network. This approach, termed CVP-MVSNet, tackles the shortcomings of fixed-resolution cost volumes by constructing a pyramid of cost volumes in a coarse-to-fine manner, thereby enabling the inference of high-resolution depth maps more efficiently and compactly.

Key Contributions and Methodology

The authors propose a hierarchical method to build cost volumes across varying resolutions to infer depth information from multi-view images. The core contributions are:

Cost Volume Pyramid: The paper demonstrates that constructing cost volume pyramids starting from the coarsest resolution and progressively refining depth predictions allows the network to remain lightweight and agile. Depth information is initially inferred using a cost volume at the coarsest image resolution. Subsequent layers use the current depth estimate to build new cost volumes to refine depth maps incrementally.
Efficient Network Architecture: By leveraging a compact network structure facilitated by the cost volume pyramid, the method successfully reduces runtime by 6 times compared to comparable state-of-the-art systems like Point-MVSNet. Memory efficiency is also remarkably improved, allowing handling of high-resolution images with less computational demand.
Depth Residual Estimation: For depth refinement, the researchers introduce a method of predicting depth residuals iteratively, allowing more precise depth inference without resorting to point cloud processing as done in Point-MVSNet. This is facilitated by a regular grid and multi-scale 3D convolutions on the image plane, providing more efficient computations than 3D point-based convolutions.

Experimental Results

The authors perform extensive evaluations on the DTU dataset and Tanks and Temples benchmark. Notably, their method achieves competitive performance:

On the DTU dataset, the proposed approach achieved an overall reconstruction score of 0.351 mm, outperforming leading methods such as Point-MVSNet in terms of accuracy and completeness.
For Tanks and Temples, the framework demonstrated robust generalization from the DTU-trained model, yielding one of the best f-scores among tested methods.
The method proved over six times faster in runtime while maintaining comparable or superior accuracy and required significantly less GPU memory.

Implications and Future Work

The implications of this research are manifold. Practically, this efficient depth inference approach can be critical in applications requiring real-time 3D scene reconstruction, such as augmented reality, robotics, and autonomous driving. Theoretically, the cost volume pyramid introduces a new dimension in handling scale variation and computational load, potentially inspiring further research into multi-scale processing within deep learning frameworks.

The integration of this technique into broader vision systems, such as structure-from-motion pipelines, stands as a promising area for future development. This could further streamline the process of 3D reconstruction while extending the efficiency gains demonstrated in the paper.

Overall, the CVP-MVSNet presents a highly promising direction for improving both the efficiency and scalability of MVS systems, bridging a key gap in handling high-resolution imagery within dense depth inference frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - JiayuYANG/CVP-MVSNet: Cost Volume Pyramid Based Depth Inference for Multi-View Stereo (CVPR 2020 Oral) (237 stars)