Generalized Binary Search Network for Highly-Efficient Multi-View Stereo (2112.02338v1)

Published 4 Dec 2021 in cs.CV

Abstract: Multi-view Stereo (MVS) with known camera parameters is essentially a 1D search problem within a valid depth range. Recent deep learning-based MVS methods typically densely sample depth hypotheses in the depth range, and then construct prohibitively memory-consuming 3D cost volumes for depth prediction. Although coarse-to-fine sampling strategies alleviate this overhead issue to a certain extent, the efficiency of MVS is still an open challenge. In this work, we propose a novel method for highly efficient MVS that remarkably decreases the memory footprint, meanwhile clearly advancing state-of-the-art depth prediction performance. We investigate what a search strategy can be reasonably optimal for MVS taking into account of both efficiency and effectiveness. We first formulate MVS as a binary search problem, and accordingly propose a generalized binary search network for MVS. Specifically, in each step, the depth range is split into 2 bins with extra 1 error tolerance bin on both sides. A classification is performed to identify which bin contains the true depth. We also design three mechanisms to respectively handle classification errors, deal with out-of-range samples and decrease the training memory. The new formulation makes our method only sample a very small number of depth hypotheses in each step, which is highly memory efficient, and also greatly facilitates quick training convergence. Experiments on competitive benchmarks show that our method achieves state-of-the-art accuracy with much less memory. Particularly, our method obtains an overall score of 0.289 on DTU dataset and tops the first place on challenging Tanks and Temples advanced dataset among all the learning-based methods. The trained models and code will be released at https://github.com/MiZhenxing/GBi-Net.

Citations (48)

View on Semantic Scholar

Summary

The paper introduces a binary search framework that reduces the 3D cost volume size and memory usage while maintaining accurate depth estimation.
It implements error tolerance bins and gradient masking in the GBi-Net to effectively handle discrete classification errors during training.
Extensive experiments validate that GBi-Net achieves state-of-the-art results, cutting memory consumption by about 48% on benchmarks like DTU.

Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

The presented work tackles the crucial challenge of memory efficiency in multi-view stereo (MVS) depth estimation, introducing an innovative methodology termed the Generalized Binary Search Network (GBi-Net). Multi-view stereo, which aims to reconstruct the 3D geometry of a scene from multiple overlapping images, is fundamentally a one-dimensional search problem within a certain depth range. Recent deep learning-based MVS approaches predominantly construct dense 3D cost volumes using depth hypotheses sampled over a given depth range. This process, while improving depth prediction accuracy, inherently involves substantial memory consumption, particularly for large-scale scenes.

Traditionally, strategies like coarse-to-fine depth hypothesis sampling have been adopted to mitigate this memory overhead. However, these strategies still demand significant memory and do not necessarily offer the optimal balance between memory efficiency and depth estimation accuracy. Herein lies the contribution of this research: it introduces a memory-efficient search strategy by conceptualizing MVS as a binary search problem, subsequently extending it to a generalized binary search model.

Key Contributions

Binary Search Formulation: The authors reformulate the MVS depth estimation problem by employing a binary search algorithm. At each search stage, the depth range is bisected into two equal bins, with one additional tolerance bin added on each side to provide room for minor prediction discrepancies. This reduces the number of depth hypotheses sharply and therefore the size of the 3D cost volume, facilitating a notable decrease in memory usage without compromising the depth estimation accuracy.
Generalized Binary Search Network (GBi-Net): Beyond merely binary searching, the proposed GBi-Net introduces several mechanisms to handle potential classification errors that are inherent in discrete bin-based approaches. This includes error tolerance bins, a gradient masking strategy which only propagates gradients for valid pixels, and an efficient gradient updating scheme aimed at economizing memory during the training procedure.
Substantial Experimental Validation: The effectiveness of GBi-Net is backed by extensive experiments demonstrating state-of-the-art performance with significantly lower memory consumption on benchmarks such as the DTU dataset and the Tanks and Temples datasets. For instance, on the DTU dataset, GBi-Net reduces memory consumption by approximately 48% compared to previous best-performing methods while improving depth prediction accuracy.
Implications and Future Directions: This work presents both practical and theoretical implications. Practically, it offers a pathway to efficient large-scale 3D reconstructions which are particularly crucial in resource-constrained environments. Theoretically, the work introduces an efficient search methodology in the domain of deep learning-based geometric perception, paving the way for further exploration into search-based approaches for other computer vision applications.

Practical Considerations

The provided results are significant, especially in applications where high-resolution image processing is necessary but computational resources are limited. For industries reliant on quick and accurate 3D modeling—such as autonomous vehicles, immersive gaming, and augmented reality—GBi-Net provides a viable solution to enhance performance while reducing computational costs.

Ultimately, the proposed GBi-Net offers a paradigm shift towards a more memory-efficient approach to 3D scene understanding using deep learning, setting the stage for future work to build upon its architecture and principles. Future developments may further explore hybrid search strategies or integrate the proposed model with real-time systems, thereby expanding its applicability across a broader spectrum of real-world scenarios in computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - MiZhenxing/GBi-Net: Codes for GBi-Net (CVPR2022) (125 stars)