Papers
Topics
Authors
Recent
Search
2000 character limit reached

ACVNet: Efficient Stereo Matching Architecture

Updated 1 June 2026
  • ACVNet is a stereo matching architecture that uses an Attention Concatenation Volume (ACV) to filter dense cost volumes using geometric correlation cues.
  • It achieves state-of-the-art accuracy with reduced computational cost, cutting parameters by up to 60% compared to similar methods.
  • The modular design enables real-time variants and seamless integration with existing frameworks for improved disparity predictions.

ACVNet is a stereo matching architecture based on the novel Attention Concatenation Volume (ACV), designed for high-accuracy and efficient depth estimation from rectified image pairs. ACVNet addresses the redundancy present in dense concatenation cost volumes by introducing an attention-driven filtering mechanism that leverages geometric priors derived from correlation cues. The method yields strong accuracy across major benchmarks, with empirically validated improvements in both computational efficiency and error reduction, and is also extensible to real-time and lightweight variants.

1. Network Architecture and Pipeline

ACVNet is structured as a sequential pipeline of four sub-networks, aligning with canonical deep stereo matching approaches but incorporating a specialized attention mechanism:

  1. Feature Extraction: A ResNet-like encoder generates three levels of 1/4-resolution features: l1R64×H/4×W/4l_1 \in \mathbb{R}^{64 \times H/4 \times W/4}, l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}, l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}. The concatenated feature [l1;l2;l3][l_1; l_2; l_3] forms a 320-channel tensor for attention prediction, while unary features fl,frR32×H/4×W/4f_l, f_r \in \mathbb{R}^{32 \times H/4 \times W/4} are synthesized through two 3×3 convolutional layers for left and right images respectively.
  2. Attention Concatenation Volume (ACV) Construction: A lightweight, multi-level correlation volume is constructed from l1l_1, l2l_2, and l3l_3 using group-wise, patch-based inner-product correlations. A 4D concatenation volume CconcatC_{concat} is built by stacking [fl(x,y);fr(xd,y)][f_l(x,y); f_r(x-d,y)] for all disparities l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}0, and subsequently, the learned attention map l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}1 filters l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}2 to yield the attention-concatenation volume l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}3.
  3. Cost Aggregation: Only two stacked 3D hourglass modules plus a pre-hourglass (four 3D convolutions) are required for spatial-disparity context fusion, a reduction enabled by the attention-driven filtering.
  4. Disparity Prediction: Three intermediate cost volumes are passed through final 3D convolutions, followed by softmax over the disparity axis and soft-argmin to regress disparities l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}4. Supervision is also imposed on the attention-branch’s soft-argmin estimate l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}5.

2. Construction of the Attention Concatenation Volume

The central innovation in ACVNet is the ACV, which integrates multi-level adaptive patch matching (MAPM), attention weighting, and concatenation filtering:

  • MAPM Correlation Volume: The 320-channel concatenated feature tensor is divided into l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}6 groups (allocation: 8 from l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}7, 16 from l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}8, 16 from l2R128×H/4×W/4l_2 \in \mathbb{R}^{128 \times H/4 \times W/4}9). For each group l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}0 at feature scale l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}1, correlation is computed via:

l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}2

where l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}3 denotes the 3×3 patch with dilation l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}4 and l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}5 are learned weights.

  • Attention Weight Map: The concatenated multi-level correlation l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}6 is regularized by two 3D convolutions, a 3D hourglass, and reduced to a single channel via another 3D convolution. The final attention map l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}7 is normalized (softmax over disparity) and subjected to soft-argmin supervision.
  • Filtering Concatenation Volume: Unary features are used to construct the standard concatenation cost volume:

l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}8

The final ACV is formed by element-wise multiplication:

l3R128×H/4×W/4l_3 \in \mathbb{R}^{128 \times H/4 \times W/4}9

3. Cost Aggregation and Lightweight Network Design

The ACV provides a pruned and information-dense cost volume, allowing ACVNet to use a drastically reduced 3D aggregation backbone. Empirical ablation demonstrates:

  • Only two hourglass modules (vs. three in GwcNet) are required for state-of-the-art performance. For GwcNet’s combined volume, reducing from three hourglasses (28 3D convs, 5.4M parameters) to one or two hourglasses (in ACVNet) achieves a 60% parameter reduction with improved D1 and EPE.
  • Even with zero hourglass modules, ACVNet significantly outperforms the corresponding GwcNet configuration.

This efficient aggregation is made possible because the attention filtering step concentrates the network’s capacity on a small subset of high-likelihood disparities.

4. Real-Time Fast-ACV and Volume Attention Propagation

To further accelerate inference, Fast-ACVNet incorporates additional architectural and algorithmic optimizations for real-time stereo:

  • Multi-scale Feature Fusion: Features from a MobileNetV2 backbone at multiple scales are upsampled and fused into 1/8 and 1/4 resolution maps.
  • Low-resolution Correlation Volume and Lifting: A group-wise correlation volume is constructed at 1/8 resolution, regularized, and upsampled to 1/4 using bilinear interpolation.
  • Volume Attention Propagation (VAP): For each pixel, candidate disparities and their matching confidence are computed. Propagation weights combine similarity scores and learned confidence terms, and matching probabilities are “propagated” using softmax-weighted sums on upsampled cost volumes.
  • Fine-to-Important (F2I) Sampling: Only the top-K high-probability disparity candidates at each pixel are selected to build a compact concatenation volume. [l1;l2;l3][l_1; l_2; l_3]0 achieves optimal trade-off between accuracy and speed.
  • Final aggregation is performed using a single hourglass module, followed by two-best softmax and superpixel-based upsampling.

5. Quantitative Performance and Benchmark Results

ACVNet and Fast-ACVNet report strong results on major stereo benchmarks:

Model Scene Flow EPE KITTI 2015 D1-all KITTI 2012 3px-noc ETH3D Bad 1.0% Params (M) Runtime (ms/s)
ACVNet 0.48 px (2nd) 1.65% (2nd) 1.13% (3rd) 2.58% (3rd) 6.22–7.40 200 ms
Fast-ACVNet 0.64 px 2.17% 2.17% 5.35 39–48 ms
  • ACVNet ranks 2nd on KITTI 2015 and Scene Flow, and 3rd on KITTI 2012 and ETH3D among published methods.
  • Ablations indicate that the attention-driven constructions offer EPE and D1 improvements to other backbones (e.g., GwcNet, PSMNet, CFNet).
  • Fast-ACVNet outperforms most published sub-50 ms methods, including DeepPrunerFast, AANet, and CoEx, in both accuracy and generalization.

6. Ablation Studies and Universality

Extensive ablation quantifies the contribution of each ACVNet module:

  • Multi-level patch correlation reduces D1 error by 10%.
  • Attention filtering decreases D1 from 2.31% (multi-level patch only) to 2.03%.
  • Adding hourglass regularization on correlations and soft-argmin supervision on the attention map provides additional error reduction.
  • Integrating ACV into existing stereo matching backbones universally improves EPE and D1. For example, PSMNet’s EPE drops from 1.09 to 0.63 and D1 from 3.89% to 2.17% when augmented with ACV.

The modularity and generality of the ACV mechanism suggest applicability in a broad range of stereo matching frameworks.

7. Significance and Practical Considerations

ACVNet demonstrates that precise, learned attention mechanisms can substantially reduce both the redundancy and size of cost aggregation networks in stereo matching, with empirical evidence across a wide range of datasets. Its design yields state-of-the-art or near state-of-the-art accuracy with fewer parameters, lower memory needs, and real-time capability without accuracy degradation. The combination of attention-derived filtering with compact aggregation structures presents a scalable strategy for future stereo vision architectures (Xu et al., 2022, Xu et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ACVNet.