Accurate and Efficient Stereo Matching via Attention Concatenation Volume (2209.12699v3)

Published 23 Sep 2022 in cs.CV

Abstract: Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this paper, we present a novel cost volume construction method, named attention concatenation volume (ACV), which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume. The ACV can be seamlessly embedded into most stereo matching networks, the resulting networks can use a more lightweight aggregation network and meanwhile achieve higher accuracy. We further design a fast version of ACV to enable real-time performance, named Fast-ACV, which generates high likelihood disparity hypotheses and the corresponding attention weights from low-resolution correlation clues to significantly reduce computational and memory cost and meanwhile maintain a satisfactory accuracy. The core idea of our Fast-ACV is volume attention propagation (VAP) which can automatically select accurate correlation values from an upsampled correlation volume and propagate these accurate values to the surroundings pixels with ambiguous correlation clues. Furthermore, we design a highly accurate network ACVNet and a real-time network Fast-ACVNet based on our ACV and Fast-ACV respectively, which achieve the state-of-the-art performance on several benchmarks (i.e., our ACVNet ranks the 2nd on KITTI 2015 and Scene Flow, and the 3rd on KITTI 2012 and ETH3D among all the published methods; our Fast-ACVNet outperforms almost all state-of-the-art real-time methods on Scene Flow, KITTI 2012 and 2015 and meanwhile has better generalization ability)

Citations (31)

View on Semantic Scholar

Summary

The paper introduces ACV, a novel attention-based cost volume method that enhances stereo matching accuracy while reducing computational demands.
It employs an attention filtering mechanism to refine feature concatenation, optimizing performance in existing stereo networks.
Fast-ACV integrates Volume Attention Propagation and a Fine-to-Important strategy to achieve real-time processing on standard benchmarks.

Accurate and Efficient Stereo Matching via Attention Concatenation Volume

Stereo matching is instrumental in numerous computer vision applications, such as 3D reconstruction, autonomous driving, and robot navigation. The paper, authored by Gangwei Xu et al., introduces a novel method known as Attention Concatenation Volume (ACV) that enhances the efficiency and accuracy of stereo matching tasks. The addition of an efficient real-time adaptation, Fast-ACV, further supports applications requiring immediate data processing, such as in robotics.

Introduction and Methodology

In stereo matching, constructing a reliable and concise cost volume is pivotal for obtaining accurate depth or disparity estimations. Traditional methods, such as correlation and concatenation volumes, provide either limited detail or require extensive computational resources, respectively. The proposed ACV aims to balance these aspects by leveraging attention weights generated from the correlation clues to improve the performance of a concatenation volume, effectively reducing redundancy and focusing on relevant details.

Attention Concatenation Volume

The ACV method integrates with most existing stereo networks, optimizing them for higher accuracy while reducing the need for complex aggregation networks. The authors demonstrate that by replacing the existing cost volume in networks like GwcNet, significant gains in accuracy can be realized with fewer 3D convolutions needed for cost aggregation.

The ACV process consists of:

Attention Weights Generation: Applying lightweight correlation volume to derive attention weights that prioritise pertinent information.
Initial Concatenation Volume Construction: Creating a concatenated representation from left and right image features.
Attention Filtering: Refining the concatenation volume with derived attention weights to enhance meaningful content and suppress irrelevant features.

Fast-ACV: Real-Time Adaptation

To facilitate real-time applications, Fast-ACV enhances ACV with Volume Attention Propagation (VAP) and a Fine-to-Important (F2I) strategy. This reduces computational and memory demands significantly:

Volume Attention Propagation (VAP): Automatically propagates accurate correlation values, addressing ambiguity and enhancing the fidelity of interpolation.
Fine-to-Important (F2I) Strategy: Constructs a compact cost volume focused on high-likelihood disparity hypotheses, further optimizing computational efficiency.

Results and Implications

The effectiveness of ACV and Fast-ACV is evident in their performance across various benchmarks, including KITTI and Scene Flow datasets. Notably, ACVNet achieves top rankings in KITTI 2015 and Scene Flow, demonstrating concurrent improvements in both accuracy and computational load. Fast-ACVNet further excels in real-time scenarios, outperforming many existing methods while maintaining strong generalization capabilities.

The integration of these models into existing architectures offers a pathway for enhancing stereo matching performance and can influence implementations across a wide spectrum of artificial intelligence applications, from self-driving vehicles to augmented reality systems.

Conclusion and Future Directions

This research underscores a significant advancement in stereo matching methodologies, striking an adept balance between efficiency and accuracy. The adaptability of ACV and Fast-ACV ensures their applicability across diverse stereo matching frameworks, potentially ushering in new development phases within this domain.

Future research could explore deeper integrations of attention mechanisms within various neural architectures, enhancing performance across related tasks in volumetric estimation and scene understanding. Considering the foundational impact of stereo matching in AI and robotics, these innovations may propel further advancements, driving practical and theoretical progress in 3D computer vision.

PDF Markdown