Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation (2203.11483v1)

Published 22 Mar 2022 in cs.CV

Abstract: With the advent of convolutional neural networks, stereo matching algorithms have recently gained tremendous progress. However, it remains a great challenge to accurately extract disparities from real-world image pairs taken by consumer-level devices like smartphones, due to practical complicating factors such as thin structures, non-ideal rectification, camera module inconsistencies and various hard-case scenes. In this paper, we propose a set of innovative designs to tackle the problem of practical stereo matching: 1) to better recover fine depth details, we design a hierarchical network with recurrent refinement to update disparities in a coarse-to-fine manner, as well as a stacked cascaded architecture for inference; 2) we propose an adaptive group correlation layer to mitigate the impact of erroneous rectification; 3) we introduce a new synthetic dataset with special attention to difficult cases for better generalizing to real-world scenes. Our results not only rank 1st on both Middlebury and ETH3D benchmarks, outperforming existing state-of-the-art methods by a notable margin, but also exhibit high-quality details for real-life photos, which clearly demonstrates the efficacy of our contributions.

Citations (198)

View on Semantic Scholar

Summary

The paper introduces a cascaded recurrent network, CREStereo, that hierarchically refines stereo disparity for fine-detail recovery.
It implements an adaptive group correlation layer using localized deformable convolutions to mitigate distortions from imperfect rectification.
Comprehensive evaluations on multiple benchmarks demonstrate superior accuracy and robustness in challenging real-world conditions.

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

The problem of stereo matching, focused on calculating disparity from a pair of images, is critical for applications across areas like autonomous driving and augmented reality. This paper proposes an advanced approach for practical stereo matching through the development of a Cascaded Recurrent Network, named CREStereo, which incorporates innovative techniques to address challenges associated with fine detail recovery, non-ideal rectification, and dataset generalization.

Contribution Summary

Hierarchical Refinement Architecture: The paper introduces a hierarchical network architecture that refines disparity estimates recurrently in a coarse-to-fine scale. This is particularly effective in capturing fine-detail deviations, which are crucial for real-world images that exhibit thin structures or high resolutions.
Adaptive Group Correlation Layer: An adaptive correlation mechanism is deployed to mitigate geometric distortions from imperfect stereo image pair rectification. By using localized group correlation and incorporating deformable convolutions, this approach significantly enhances matching accuracy, especially under variances introduced by different camera modules.
Comprehensive Synthetic Dataset: Enhancements in training data are made by creating a robust synthetic dataset designed specifically for hard-case scenarios, including challenging lighting conditions and varied textures. This targeted creation helps enforce better generalization capacities for the model on real-world data.

Experimental Outcomes

The CREStereo model achieves noteworthy performance on several benchmarks. On the Middlebury and ETH3D datasets, it outperforms existing state-of-the-art models, highlighting significant improvements in metrics relating to average error and disparity accuracy at thresholds. The model demonstrates competitive results on the KITTI 2012/2015 dataset as well.

Technical Insights

The paper explores:

Recurrent Update Mechanisms: By implementing recurrent units, the model enhances its iterative refinement capabilities to incrementally improve disparity predictions.
Alternative Correlation Approaches: It evaluates different strategies for computing correlation, emphasizing the effectiveness of localized versus all-pairs methods in varying cascade scenarios.
Inference Pipeline Innovations: A stacked cascade architecture allows leveraging multi-resolution inputs during the inference phase, offering improved predictions for high-resolution images without retraining the model.

Implications and Future Directions

The practical implications of this research are profound—providing enhanced stereo matching fidelity for consumer cameras opens new avenues in directly deploying these algorithms in smartphone applications or real-time environments such as VR/AR devices. However, the computational demands indicate potential challenges for mobile deployment, suggesting a rich area for further exploration into optimizing these models for efficiency without compromising accuracy.

The comprehensive evaluations across disturbances like blur, chromatic variations, and alignment errors underscore the robustness of CREStereo in real-world conditions. Future work could expand focus on reducing latency and power consumption while maintaining precision, thereby broadening the applicability in mobile or edge-computing scenarios.

In summary, this paper contributes significantly to stereo vision, offering technical advancements that bridge gaps in current methodologies and open up more practical use cases in various consumer electronics.

PDF Markdown