Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference (1902.10556v1)

Published 27 Feb 2019 in cs.CV

Abstract: Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. Code is available at https://github.com/YoYo000/MVSNet.

Citations (517)

View on Semantic Scholar

Summary

The paper presents R-MVSNet, a recurrent framework that overcomes memory inefficiency in depth inference with a GRU-based sequential regularization of cost volumes.
It replaces conventional 3D CNNs with a GRU approach, reducing memory demands from cubic to quadratic scaling for high-resolution scenes.
Benchmark results on DTU, Tanks and Temples, and ETH3D demonstrate the model's state-of-the-art performance and robust handling of wide depth ranges.

Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

The paper "Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference" presents a novel methodology for scalable high-resolution multi-view stereo (MVS) reconstruction via a recurrent neural network framework. The authors address a significant limitation in current learning-based MVS methods: memory inefficiency associated with cost volume regularization. This inefficiency constrains the applicability of these methods to high-resolution scenes.

Core Contributions

The central innovation in this work is the introduction of the Recurrent Multi-view Stereo Network (R-MVSNet), which utilizes a recurrent neural network-based approach for depth inference. By sequentially processing 2D cost maps along the depth axis using a convolutional Gated Recurrent Unit (GRU), R-MVSNet effectively reduces memory consumption. This allows for the handling of large, high-resolution datasets that are prohibitively expensive for traditional 3D convolutional networks.

Methodological Overview

The R-MVSNet builds on the foundational architecture of MVSNet but modifies the regularization strategy:

Cost Volume Construction: Similar to existing techniques, the network constructs a cost volume by warping deep image features within the reference camera's frustum.
Recurrent Cost Volume Regularization: Instead of applying 3D CNNs, which are memory intensive, this method uses a stacked GRU to regularize the cost information in a depth-wise sequential manner, offering a significant reduction in runtime memory requirements—shifting from cubic to quadratic with respect to model resolution.
Training and Inference: Depth hypotheses are processed in an end-to-end fashion using a cross-entropy loss for classification. The use of inverse depth sampling allows it to handle wide depth ranges effectively.

Results and Benchmarking

The R-MVSNet was benchmarked against known datasets such as DTU, Tanks and Temples, and ETH3D, achieving performance that is either on par with or surpasses state-of-the-art methods. Specifically:

DTU Dataset: Demonstrated superior reconstruction completeness and overall performance metrics compared to previous methodologies.
Tanks and Temples: Successfully handled complex, large-scale scenes, ranking highly on both intermediate and advanced sets.
ETH3D: Achieved competitive performance without requiring additional fine-tuning.

Practical and Theoretical Implications

Practically, R-MVSNet breaks the memory barrier, allowing for the application of learned MVS on scenes with extensive depth ranges and higher resolutions. The model's efficiency has profound implications for fields requiring detailed 3D reconstructions, such as autonomous driving, augmented reality, and robotics.

Theoretically, this approach highlights the potential for sequential processing methods within deep learning paradigms to optimize computational resources, suggesting a shift toward exploring recurrent structures for spatial data processing.

Future Directions

The paper indicates potentials for further exploration:

Optimization and Scalability: Future research could enhance the scalability of input image resolutions, addressing current limits due to GPU constraints.
Application Diversity: Training on more diverse datasets could improve generalizability, which would be critical for varied real-world applications.

In conclusion, R-MVSNet represents a substantial advancement in high-resolution MVS reconstructions through its innovative use of recurrent structures, offering both an efficient and effective alternative to traditional deep learning approaches in this space.

PDF Markdown

Related Papers

GitHub

GitHub - YoYo000/MVSNet: MVSNet (ECCV2018) & R-MVSNet (CVPR2019) (1,347 stars)