IGEV-MVS: Iterative Geometry Encoding in MVS

Updated 3 December 2025

IGEV-MVS is a deep learning framework that employs cost-volume geometry encoding and iterative ConvGRU updates to accurately estimate per-pixel depth for dense 3D reconstruction.
It combines group-wise correlation, pyramid representations, and soft-argmin initialization to efficiently refine depth estimates across multiple views.
Benchmark results and ablation studies demonstrate competitive accuracy on the DTU dataset while maintaining fast processing speeds on high-resolution images.

Multi-View Stereo (MVS) addresses dense 3D scene reconstruction from multiple calibrated images, requiring the recovery of per-pixel depth or disparity consistent across views. The IGEV-MVS ("Iterative Geometry Encoding Volume for Multi-View Stereo," Editor's term) presents an efficient deep learning approach that combines cost-volume geometry encoding with iterative update mechanisms to advance the accuracy and computational efficiency of state-of-the-art multi-view stereo pipelines (Xu et al., 2023). IGEV-MVS builds on robust concepts from stereo matching, generalizing them to the MVS domain, and achieves competitive results on evaluations such as the DTU benchmark.

1. Geometry Encoding Volume Construction

IGEV-MVS employs group-wise correlation or feature concatenation to generate a multi-view cost volume at a discrete set of depth hypotheses. For each input image $I_i$ ( $i=1,\ldots,N$ ), features $f_i(x,y)$ are extracted at 1/4 scale via a shared backbone (MobileNetV2). Using camera intrinsics and extrinsics, pixels from the reference image $I_1$ are reprojected into source images for each hypothesized depth $d$ :

$\pi_i(x, y; d) = K_i [R_i\,|\,t_i] \cdot [d \cdot K_1^{-1}(x, y, 1)^\top]$

Group-wise correlation then forms the cost volume $C_{\text{corr}}(g, d, x, y)$ across channel groups $g$ :

$C_{\text{corr}}(g, d, x, y) = \frac{1}{N_c / N_g} \frac{1}{N-1} \sum_{i=2}^N \langle f_1^g(x, y),\, f_i^g(\pi_i(x, y; d)) \rangle$

This cost volume is processed by a lightweight 3D U-Net using guided cost excitation to yield the Geometry Encoding Volume ( $C_G$ ), leveraging multi-scale convolutional filtering to encode both geometric and contextual information.

2. Combined Cost Volume and Pyramid Representation

To model both non-local and local cues, IGEV-MVS constructs a two-level pyramid by pooling $C_G$ and the original all-pairs correlation volume (APC) along the depth axis. The resulting combined geometry encoding volume (CGEV) is a concatenation of multi-scale filtered geometry and correlation features, enabling the network to refine depth hypotheses with contextual awareness from multiple views.

3. Iterative Depth Update via ConvGRU

Depth estimation proceeds in two stages:

Initial Depth via Soft-Argmin: The geometry encoding volume $C_G$ is collapsed along depth using softmax-weighted averaging:

$d_0(x, y) = \sum_{d=1}^D d \cdot \mathrm{softmax}(C_G(d,x,y))$

Multi-Scale Iterative Update: A sequence of ConvGRUs at 1/16, 1/8, and 1/4 spatial resolutions iteratively updates the depth map over $N_\text{it}$ $N_{it}$ iterations. At each step $k$ $k$ :
- CGEV is indexed around current estimate $d_k(x, y)$ via linear interpolation in a small depth window (radius $r$ ), producing a feature tensor $G_f$ .
- Geometry and depth features are encoded, concatenated, and fed to the ConvGRU cell.
- Residual depth $\Delta d_k$ is decoded from the GRU output and applied to the current estimate: $d_{k+1} = d_k + \Delta d_k$ .

The initial estimate $d_0$ is directly supervised via Smooth-L1 loss against ground truth, facilitating rapid convergence of recurrent updates.

4. Training and Evaluation Protocols

Training is conducted on standard benchmarks (e.g., DTU), typically with 5 input views per sample and image sizes (640×512 for training, 1600×1152 for testing). The optimizer is AdamW with standard learning rate scheduling. Losses during training include:

Initial depth supervision: $\mathcal{L}_{init} = \mathrm{Smooth}_{L1}(d_0 - D_{gt})$
Iteratively weighted L1 supervision on all update steps:

$\mathcal{L}_{total} = \mathcal{L}_{init} + \sum_{i=1}^{N_{it}} \gamma^{N_{it}-i} \|d_i - D_{gt}\|_1$

Additional optional losses include smoothness and photometric consistency, though the main reported results use only L1 depth supervision.

Benchmarking on DTU yields the following performance (error in mm; lower is better):

Method	Accuracy	Completeness	Overall
Camp (2008)	0.835	0.554	0.695
MVSNet	0.396	0.527	0.462
PatchMatchNet	0.427	0.277	0.352
CER-MVS	0.359	0.305	0.332
IGEV-MVS	0.331	0.316	0.324

5. Computational Efficiency and Ablation Analysis

IGEV-MVS is designed for high efficiency: a full 1600×1152 test sample (N=5, 32 update steps) runs in ≈0.6 s, consuming 7–8 GB GPU memory on RTX 3090 hardware. The geometry encoding U-Net stage adds only ≈20 ms overhead compared to raw correlation volume construction.

Model ablations reveal:

Removing the 3D U-Net (using only APC) increases error to ≈0.355 mm.
Omitting soft-argmin start degrades accuracy by 4–6% and slows convergence.
Reducing updates from 32→16→8 yields errors from 0.332 mm to ≈0.345 mm; even with 8 updates performance remains competitive.

6. Integration with Classical and Hybrid Methods

Recent research explores the synergy between neural methods like IGEV-MVS and classical geometric pipelines (Orsingher et al., 2022). For instance:

SLAM-based initialization can provide sparse priors to reduce hypothesis search range.
Differentiable PatchMatch modules and geometric regularization (e.g., depth-normal consistency losses) may integrate with the IGEV framework for greater robustness in low-texture or planar regions.
Trainable CRF refinement and strict multi-view geometric consistency checks enhance point-cloud fusion.

This suggests future directions favor hybrid architectures leveraging both data-driven learning and explicit geometric constraints.

7. Extensions to Non-Rigid Scenes

Classical MVS assumes static scenes, but NRMVS ("Non-Rigid Multi-View Stereo") extends MVS to deformable objects by jointly optimizing per-view depth and deformation fields via deformation graphs (Innmann et al., 2019). This paradigm incorporates sparse and dense photometric consistency and as-rigid-as-possible regularization, enabling 4D reconstruction of dynamic scenes. A plausible implication is that future MVS networks, including IGEV-based designs, may benefit from explicit deformation modeling for broader applicability to real-world, non-static environments.

IGEV-MVS establishes a new baseline for efficient, accurate multi-view stereo by encoding robust geometry via cost volumes and iteratively refining estimates with recurrent neural network mechanisms. Continued integration of SLAM priors, PatchMatch refinement, and non-rigid modeling represent active areas for further research and application in 3D vision.