Papers
Topics
Authors
Recent
2000 character limit reached

IGEV-Stereo: Iterative Geometry Encoding Volume

Updated 3 December 2025
  • IGEV-Stereo is a deep network architecture for stereo matching that fuses local and non-local geometric cues using a Combined Geometry Encoding Volume.
  • It employs soft arg-min disparity initialization and a ConvGRU-based iterative updater to achieve subpixel-accurate depth estimation in just 3–8 iterations.
  • The system extends to IGEV++ and IGEV-MVS, demonstrating state-of-the-art performance on benchmarks like Scene Flow and KITTI with efficient inference.

Iterative Geometry Encoding Volume (IGEV-Stereo) refers to a deep network architecture designed for stereo matching that integrates recurrent updates with a geometry-aware and context-rich cost volume. By leveraging lightweight 3D convolutional regularization, multi-scale feature aggregation, and an efficient ConvGRU-based updater, IGEV-Stereo achieves state-of-the-art accuracy and rapid convergence on established benchmarks. Its advances are further extended to multi-range (IGEV++) and multi-view (IGEV-MVS) stereo, yielding strong performance and generalization in a variety of settings (Xu et al., 2023, Xu et al., 1 Sep 2024).

1. Combined Geometry Encoding Volume Construction

The principal innovation of IGEV-Stereo is the Combined Geometry Encoding Volume (CGEV), which synthesizes both local and non-local matching cues across multiple scales, enabling effective disambiguation in ill-posed regions and refinement of fine details. CGEV is constructed by fusing three principal components:

  • Local all-pairs correlation (APC) preserves granular matching evidence.
  • 3D-CNN–filtered cost volume (GEV) encodes non-local geometry and scene context.
  • Disparity-pooled pyramids of APC and GEV capture multi-scale and large-disparity structures.

Given left (fl,4\mathbf f_{l,4}) and right (fr,4\mathbf f_{r,4}) feature maps at $1/4$ resolution, group-wise correlation volumes are computed as follows: Ccorr(g,d,x,y)=1C/Ngfl,4g(x,y),fr,4g(xd,y),g=1,,Ng\mathbf C_{\rm corr}(g,d,x,y) =\frac{1}{\,C/N_g\,}\left\langle \mathbf f^g_{l,4}(x,y)\,,\,\mathbf f^g_{r,4}(x-d,y)\right\rangle, \quad g=1,\dots,N_g This is regularized by a lightweight 3D U-Net: CG=R(Ccorr)\mathbf C_G = \mathbf R(\mathbf C_{\rm corr}) At each 3D convolution stage, a channel-wise excitation modulates responses with the sigmoid of higher-level features: Ci=σ(fl,i)Ci\mathbf C_i' = \sigma(\mathbf f_{l,i}) \odot \mathbf C_i

A parallel APC volume is built: CA(d,x,y)=fl,4(x,y),fr,4(xd,y)\mathbf C_A(d,x,y) = \langle\mathbf f_{l,4}(x,y),\,\mathbf f_{r,4}(x-d,y)\rangle

Disparity pooling forms a two-level pyramid: CGp=PooldCG,CAp=PooldCA\mathbf C_G^p = \mathrm{Pool}_d\,\mathbf C_G, \quad \mathbf C_A^p = \mathrm{Pool}_d\,\mathbf C_A

The full CGEV concatenates these at each disparity level: CCGEV(d)=[CG(d);  CA(d);  CGp(d/2);  CAp(d/2)]\mathbf C_{\rm CGEV}(d) = \left[\mathbf C_G(d);\; \mathbf C_A(d);\; \mathbf C^p_G(d/2);\; \mathbf C^p_A(d/2)\right]

This fusion scheme encodes both global geometric context and fine local details, which is critical in low-texture, reflective, or occluded regions.

2. Disparity Initialization with Soft Arg Min

IGEV-Stereo applies a soft-argmin operation over the geometry encoding volume (GEV) to regress an initial estimate d0\mathbf d_0, in contrast with standard RAFT-Stereo which starts all disparities at zero. The expression is: d0(x,y)=d=0D1d×Softmax(CG(d,x,y))\mathbf d_0(x,y) = \sum_{d=0}^{D-1} d \times \mathrm{Softmax}\left(\mathbf C_G(d,x,y)\right) A smooth-1\ell_1 loss Linit\mathcal L_{\rm init} is used to explicitly supervise this initialization: Linit=Smooth1(d0dgt)\mathcal L_{\rm init} = \mathrm{Smooth}_{\ell_1}(\mathbf d_0-\mathbf d_{\rm gt}) On Scene Flow, this yields d0\mathbf d_0 within $1$–$2$ pixels of ground-truth. This accurate starting state ensures that the subsequent ConvGRU-based iterative updater requires fewer updates, significantly accelerating convergence.

3. ConvGRU-based Iterative Disparity Refinement

For disparity refinement, IGEV-Stereo employs a multi-level ConvGRU stack. At each iteration kk:

  1. CGEV is sampled (via linear interpolation) around the current dk\mathbf d_k for each pixel (x,y)(x,y):

Gf(x,y)=i=rrConcat{CG(dk(x,y)+i), CA(dk(x,y)+i), CGp(dk(x,y)/2+i), CAp(dk(x,y)/2+i)}\mathbf G_f(x,y) = \sum_{i=-r}^r \mathrm{Concat}\bigl\{ \mathbf C_G(\mathbf d_k(x,y)+i),\ \mathbf C_A(\mathbf d_k(x,y)+i),\ \mathbf C_G^p(\mathbf d_k(x,y)/2+i),\ \mathbf C_A^p(\mathbf d_k(x,y)/2+i) \bigr\}

  1. Features Gf\mathbf G_f and the current disparity dk\mathbf d_k are encoded by 2-layer CNNs and concatenated to form input xkx_k.
  2. The ConvGRU cell evolves the hidden state hkh_k according to:

zk=σ(Conv([hk1,xk];Wz)+cz) rk=σ(Conv([hk1,xk];Wr)+cr) h~k=tanh(Conv([rkhk1,xk];Wh)+ch) hk=(1zk)hk1+zkh~k\begin{aligned} z_k &= \sigma(\mathrm{Conv}([h_{k-1},x_k];W_z)+c_z) \ r_k &= \sigma(\mathrm{Conv}([h_{k-1},x_k];W_r)+c_r) \ \tilde h_k &= \tanh(\mathrm{Conv}([r_k \odot h_{k-1},x_k];W_h)+c_h) \ h_k &= (1-z_k) \odot h_{k-1} + z_k \odot \tilde h_k \end{aligned}

  1. A decoder produces a residual Δdk\Delta\mathbf d_k, yielding

dk+1=dk+Δdk\mathbf d_{k+1} = \mathbf d_k + \Delta\mathbf d_k

By initializing with d0\mathbf d_0, subpixel-accurate results are typically achieved in $3$–$8$ iterations, a notable reduction compared to 32 updates required by vanilla RAFT-Stereo.

4. Network Architecture and Loss Formulation

IGEV-Stereo comprises several tightly integrated modules:

  • Feature extractor: MobileNetV2 backbone pretrained on ImageNet, upsampling with skip connections to deliver $1/4$-scale feature maps, with side outputs at $1/8$, $1/16$, $1/32$ to guide 3D-CNNs.
  • Context network: A compact ResNet trunk provides multi-scale context maps (width=128), used for ConvGRU initialization and recurrent updates.
  • Volume builder: Encodes group-wise correlation, all-pairs correlation, disparity pooling, and concatenates to form CGEV.
  • Iterative updater: Three ConvGRUs ($128$-dim hidden state each), recurrently updating disparity.
  • Upsampling head: Predicts a learned 3×33 \times 3 convex combination kernel per-pixel to upsample from $1/4$-scale.
  • Loss: Total loss is

L=Linit+k=1NγNkdkdgt1,γ=0.9\mathcal L = \mathcal L_{\rm init} + \sum_{k=1}^N \gamma^{N-k} \|\mathbf d_k - \mathbf d_{\rm gt}\|_1, \quad \gamma=0.9

The model comprises \sim12.6M parameters and achieves $0.18$s inference on 1242×3751242\times375 KITTI images.

5. Empirical Results and Comparative Performance

IGEV-Stereo demonstrates high accuracy and speed across established benchmarks:

  • Scene Flow (test): EPE = $0.47$px (cf. PSMNet $1.09$, GwcNet $0.76$).
  • KITTI 2012 (2px, noc): 1.71%1.71\% (best among published methods).
  • KITTI 2015 D1-all: 1.59%1.59\% (ranked first at submission).
  • Inference Time: $0.18$s, fastest among top 10.
  • Ill-posed/reflective (KITTI 2012): <10%<10\% out-Noc with $8$ iterations, vs 13%13\% for RAFT-Stereo ($32$ iterations).
  • Cross-dataset performance: Middlebury half-res EPE $7.1$px vs. $8.7$px (RAFT); ETH3D $3.6$px vs. $3.2$px (RAFT).

This suggests that the architecture not only accelerates convergence but also provides robustness to cross-domain transfer and difficult regions (Xu et al., 2023).

6. Extensions: IGEV++, Multi-view & Multi-range Encoding

IGEV++ (Xu et al., 1 Sep 2024) generalizes the IGEV framework to Multi-range Geometry Encoding Volumes (MGEV), better handling large disparities and ill-posed regions:

  • MGEV encodes geometry at three scales: small (Ds192D^s\sim192), medium (Dm384D^m\sim384), and large (Dl768D^l\sim768).
  • Adaptive Patch Matching (APM): Efficient matching in large disparity regimes by coarsely quantized, weighted-patch correlation:

Cl(g,dl,x,y)=1Nc/Ngi=0P1ωifl,4g(x,y),fr,4g(x(dl+i),y)C^l(g, d^l, x, y) = \frac{1}{N_c/N_g} \sum_{i=0}^{P-1} \omega_i \langle f_{l,4}^g(x,y), f_{r,4}^g(x-(d^l+i), y) \rangle

  • Selective Geometry Feature Fusion (SGFF): Per-pixel gating of contributions from GsG^s, GmG^m, GlG^l based on learned weights from image features and initial disparities:

fG(x,y)=ssfGs+smfGm+slfGlf_G(x,y) = s_s \odot f_G^s + s_m \odot f_G^m + s_l \odot f_G^l

  • The ConvGRU updater is retained, with each iteration using fused features for robust updates.

Quantitative improvements are substantial, including:

  • EPE $0.67$ (Scene Flow, <768<768px), Bad3.0=2.21%3.0=2.21\% ($32$ iters), outperforming RAFT-Stereo EPE $0.98$.
  • KITTI 2012 (2-nocc): 1.56%1.56\%; KITTI 2015 (D1-all): 1.51%1.51\% (in $0.28$s).
  • Middlebury “large-disp” Bad2.0: 3.23%3.23\% (zero-shot), 31.9%31.9\% error reduction over RAFT-Stereo.
  • Reflective regions (KITTI2012 3-noc): 3.71%3.71\% (IGEV++), vs 5.40%5.40\% (RAFT).

IGEV-MVS extends the approach to multi-view stereo by stacking pairwise CGEVs from NN views, evaluated on the DTU benchmark with overall $0.324$mm accuracy (best among learned methods at time of publication) (Xu et al., 2023).

7. Ablations and Design Insights

Ablation studies (Xu et al., 1 Sep 2024) show that:

  • Adding a single-range GEV to baseline RAFT brings a \sim15\% reduction in Scene Flow EPE.
  • Incorporating MGEV with APM improves accuracy on large disparities by $2$–3%3\% relative.
  • Selective feature fusion (SGFF) further reduces errors, especially for ill-posed regions.

Each component thus contributes quantifiably to IGEV’s convergence speed and generalizability:

  • Geometric regularization with lightweight 3D-CNN is crucial for non-local reasoning.
  • Multi-scale, adaptive patch matching is necessary for handling large search spaces without prohibitive memory.
  • Learned per-pixel fusion provides context-sensitive updates essential for robust estimation in challenging scenes.

In summary, IGEV-Stereo and its derivatives (IGEV++, IGEV-MVS) leverage an overview of geometry-stage volumetric encoding, efficient recurrent updating, and adaptive multi-scale strategies to set new accuracy and speed benchmarks in stereo and multi-view depth estimation (Xu et al., 2023, Xu et al., 1 Sep 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Iterative Geometry Encoding Volume (IGEV-Stereo).