Attention Concatenation Volume for Accurate and Efficient Stereo Matching (2203.02146v3)

Published 4 Mar 2022 in cs.CV

Abstract: Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this paper, we present a novel cost volume construction method which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume. To generate reliable attention weights, we propose multi-level adaptive patch matching to improve the distinctiveness of the matching cost at different disparities even for textureless regions. The proposed cost volume is named attention concatenation volume (ACV) which can be seamlessly embedded into most stereo matching networks, the resulting networks can use a more lightweight aggregation network and meanwhile achieve higher accuracy, e.g. using only 1/25 parameters of the aggregation network can achieve higher accuracy for GwcNet. Furthermore, we design a highly accurate network (ACVNet) based on our ACV, which achieves state-of-the-art performance on several benchmarks.

Citations (169)

View on Semantic Scholar

Summary

The paper introduces the Attention Concatenation Volume (ACV) to improve stereo matching accuracy by filtering out irrelevant information.
It employs Multi-level Adaptive Patch Matching (MAPM) and attention weight generation to construct a robust cost volume.
ACVNet and its fast variant achieve state-of-the-art performance with fewer parameters and faster inference, setting a new standard for efficiency.

Attention Concatenation Volume for Accurate and Efficient Stereo Matching

The paper "Attention Concatenation Volume for Accurate and Efficient Stereo Matching" introduces an innovative methodology for cost volume construction in stereo matching tasks, a critical aspect in computer vision applications such as autonomous driving, augmented reality, and robotics. The proposed method aims to enhance the accuracy and efficiency of stereo matching networks by reducing the burden on cost aggregation processes.

Overview of Methodology

Stereo matching traditionally relies on constructing a cost volume to measure the similarity between stereo image pairs. This paper presents a novel approach that introduces the Attention Concatenation Volume (ACV), which utilizes attention mechanisms derived from correlation features. The ACV is engineered to suppress redundant information while highlighting matching-related content, ultimately reducing the computational requirements of the aggregation network. The key components of the proposed methodology include:

Multi-level Adaptive Patch Matching (MAPM):
- The MAPM facilitates the construction of correlation volumes by using atrous patches with adaptive weights, tailored to different feature levels. This approach improves the robustness of similarity measures, particularly in textureless regions where traditional methods may falter.
Attention Weights Generation:
- Attention weights are derived from the correlation volume and serve to enhance the concatenation volume by filtering out irrelevant information. These weights are supervised by ground-truth disparities to optimize performance.
Disparity Prediction and Cost Aggregation:
- The architecture of the stereo matching network, ACVNet, leverages the ACV for disparity estimation. The ACV significantly reduces the parameters required for cost aggregation, consequently enhancing the network's efficiency without compromising on accuracy.

Numerical Results and Implications

Experimental evaluations demonstrate the ACV's superior performance across multiple datasets, including Scene Flow, KITTI 2012 and 2015, and ETH3D. Notably, the ACVNet achieves state-of-the-art results on these benchmarks, marking significant improvements in disparity accuracy metrics such as End-Point Error (EPE) and percentage of disparity outliers (D1). The ACV enables substantial reductions in model parameters and inference time, further validating its efficiency.

The paper also introduces ACVNet-Fast, a real-time variant optimized for computational speed. ACVNet-Fast balances performance and efficiency, outperforming other established real-time stereo matching networks in terms of both accuracy and speed.

Implications for Future Research

The proposed ACV contributes to advancing stereo matching technologies, proving instrumental in real-world applications demanding high precision and low latency. By pioneering an effective combination of feature correlations and attention mechanisms, this research expands the possibilities for more compact and efficient stereo network architectures.

Future developments may delve into the refinement of attention mechanisms or explore the integration of ACV within diverse stereo matching paradigms. Additionally, extending the ACV's applicability to more complex and varied scenes holds promise for broader adoption in practical visual tasks.

In summary, this paper provides an insightful advancement in stereo matching, emphasizing robust cost volume construction and reduced computational overhead, paving the way for improved performance in vision-driven applications.