MVSMamba: SSM-Driven Multi-View Stereo

Updated 10 November 2025

MVSMamba is a multi-view stereo network that integrates a state-space model (Mamba) to achieve efficient, linear-complexity global feature aggregation.
It employs a novel Dynamic Mamba module with reference-centered dynamic scanning and hierarchical, multi-scale feature aggregation for accurate depth estimation.
Evaluations on standard 3D reconstruction benchmarks demonstrate MVSMamba's state-of-the-art accuracy and efficiency compared to Transformer-based methods.

MVSMamba is a Multi-View Stereo (MVS) network that integrates a state space model (SSM) backbone—specifically, the Mamba architecture—into the MVS pipeline, enabling efficient global feature aggregation and omnidirectional multi-view feature interaction with linear computational complexity. MVSMamba is characterized by its novel Dynamic Mamba (DM) module with reference-centered dynamic scanning, a hierarchical multi-scale feature aggregation strategy, and its coarse-to-fine depth estimation framework. This architecture delivers state-of-the-art accuracy and efficiency on standard 3D reconstruction benchmarks, establishing Mamba-based SSMs as a compelling alternative to Transformer-based approaches in MVS (Jiang et al., 3 Nov 2025).

1. Architectural Overview and Underlying State-Space Model

MVSMamba operates on a set of $K$ calibrated images $\{I_0, \ldots, I_{K-1}\}$ , with $I_0$ always designated as the reference view. The feature extraction front-end employs a standard 4-level Feature Pyramid Network (FPN) encoder, producing feature maps $F^{enc}_{k,s}$ for each view $k$ and pyramid scale $s$ . The design diverges from conventional Transformer-MVS pipelines through its state-space model backbone.

Each Mamba block models a 1D feature sequence via a continuous-time linear SSM, discretized as follows: $\begin{aligned} h'(t) &= A h(t) + B x(t), \ y(t) &= C h(t), \end{aligned}$ which, after discretization and unrolling, produces an efficient sequence-wise convolution: $K = [CB, CAB, CA^2B, \ldots, CA^{N-1}B], \qquad y = x * K.$ For a flattened input length $L$ , this yields $\mathcal{O}(L)$ computational complexity (linear in sequence length), in stark contrast to the $\mathcal{O}(L^2)$ scaling of self-attention in Transformers. The Mamba architecture further supports content-aware, global feature mixing due to its input-dependent, token-wise SSM parameters.

2. Dynamic Mamba (DM) Module and Reference-Centered Dynamic Scanning

The DM-module is the centerpiece enabling cross-view and omnidirectional feature interaction at the coarsest FPN scale ( $s=0$ ). The core procedure is as follows:

For each source view $k$ and scale $s=0$ , pairs of reference/source features $\big(F^{enc}_{0,s}, F^{enc}_{k,s}\big)$ are concatenated in four spatial arrangements: horizontal right/left and vertical top/bottom.
Each concatenated map $X^\ast_{k,s}$ ( $\ast \in \{\mathrm{hr}, \mathrm{hl}, \mathrm{vb}, \mathrm{vt}\}$ ) is flattened into a 1D sequence via four canonical “skip-scan” orderings (N, flipped-N, Z, flipped-Z), controlled by dynamic view-dependent offsets $(h_k,w_k)$ .
This results in four sequences $S^j_{k,s} = \mathcal{R}_j(X^\ast_{k,s}; (h_k, w_k))$ of length $L_s = H_s W_s / 2$ for each source $k$ and direction $j$ .
Each sequence $S^j_{k,s}$ is processed by a Mamba block (1D-SSM scan), followed by an MLP+LayerNorm post-processing: $\overline{S}^j_{k,s} = \hat{S}^j_{k,s} + \mathrm{LN}\left(\mathrm{MLP}(\hat{S}^j_{k,s})\right).$
The four processed sequences are then inversely reshaped to recover updated reference and source features for recursive downstream processing.

By concatenating the reference to each source, dynamically arranging concatenation and scan patterns, and jointly processing all directions, the DM-module performs both inter-view (reference-source) and intra-view (self) global context aggregation in a single $\mathcal{O}(L)$ pass, achieving true omnidirectional fusion.

3. Multi-Scale Feature Aggregation

MVSMamba applies a hierarchical scheme across FPN levels:

At $s=0$ (1/8 input resolution), the full DM-module aggregates K views.
At $s=1$ , a “Simplified Dynamic Mamba” (SDM) module operates using the same multi-direction scan+Mamba mechanism, but restricted to individual views (no concatenation of reference and source).
Scales $s=2,3$ (finest resolutions) use standard 3×3 convolutions.
The decoder outputs $\{\overline{F}^{dec}_{k,s}\}$ are subsequently warped into the reference view at $D$ discrete depths per scale, enabling cost volume construction.

Cost volumes are fused via attention-based weighting and regularized by a compact 3D U-Net, producing a depth probability volume $P(d,h,w)$ . Final per-pixel depth is estimated using softmax and winner-take-all selection. The overall design inherits coarse-to-fine regularization and resolution refinement typical of leading MVS architectures.

4. Computational Complexity

Let $L = H_s W_s$ denote the number of spatial locations per FPN scale. Consider the following complexity per view:

Aggregation Type	Complexity	Scaling in L, K, C
Self-attention	$\mathcal{O}(L^2C)$	Quadratic in $L$
Cross-attention ( $K-1$ )	$\mathcal{O}(K L^2 C)$	Quadratic in $L$
Mamba DM-module	$\mathcal{O}(K L C)$	Linear in $L$

While Transformer-based aggregation grows rapidly with image resolution, MVSMamba's Mamba-based DM-module ensures linear scaling for both inter- and intra-view global context. This efficiency enables state-of-the-art performance using only a fraction of memory and computing resources compared to Transformer approaches.

5. Training Protocol and Optimization

MVSMamba is trained in a multi-stage regime:

Stage 1: Pretraining on DTU dataset, 5-view inputs at $512 \times 640$ , batch=4, for 15 epochs, initial learning rate $1 \times 10^{-3}$ , decay by half at epochs 10, 12, 14.
Stage 2: Fine-tuning on BlendedMVS, 11-view $576 \times 768$ , batch=2, for 15 epochs, learning rate $5\times 10^{-4}$ , decay at epochs 6, 8, 10, 12.
Stage 3: High-resolution DTU, 5-view $1024\times 1280$ , for 10 epochs with staged learning rate decay.

Inverse-depth hypotheses at each scale: {32, 16, 8, 4} with corresponding intervals {2, 1, 1, 0.5}; group correlation sizes {4, 4, 4, 4}.

Loss function: At each scale, the cross-entropy is applied to predicted depth probabilities: $L = \sum_{s=0}^3 \alpha_s\, \mathrm{CrossEntropy}\left(P_s, D_{\mathrm{gt}}\right),$ with uniform weights $\alpha_s$ and ground-truth $D_{\mathrm{gt}}$ as one-hot depth indices. Cross-entropy is empirically superior to $L_1$ depth supervision.

Optimizer: Adam $(\beta_1=0.9, \beta_2=0.999)$ , weight decay $1 \times 10^{-4}$ , standard augmentation (horizontal flip, color jitter), and gradient clipping.

6. Performance Benchmarks and Ablation

Quantitative results (DTU dataset, 5 views, $832 \times 1152$ inputs):

Model Variant	Overall (mm)	Accuracy (mm)	Completeness (mm)	GPU Memory (GB)	Runtime (s)	Params (M)
Low-res MVSMamba	0.287	0.314	0.260	2.82	0.11	1.31
High-res MVSMamba*	0.280	—	—	—	—	—

MVSMamba surpasses all Transformer-based MVS in the accuracy-efficiency space. On Tanks-and-Temples (21 views, 2K resolution), MVSMamba posts F-score: 67.67% (intermediate mean), 43.32% (advanced mean), outperforming all CNN/Transformer baselines.

Ablation Studies: Removing the DM-module degrades the DTU overall metric from 0.287 to 0.295 mm; omitting SDM yields 0.289 mm; removing the MLP head leads to 0.293 mm. Comparisons to deformable CNNs, FMT, ET, VMamba, EVMamba, and JamMa confirm the superiority of the reference-centered dynamic DM scan. Using DM at $s=0$ and SDM at $s=1$ is justified; further ablations on concatenation, weight-sharing, and feature arrangement show the necessity of independent, dynamically scanned Mamba blocks for optimal fusion.

7. Significance within MVS and SSM Research

MVSMamba is the first network to integrate a S4-derived, content-aware SSM (Mamba) for multi-view feature aggregation in MVS, dramatically reducing computational burden without sacrificing global context or accuracy. Unlike previous approaches reliant on explicit self-attention or hand-crafted cost aggregation, MVSMamba's SSM kernel efficiently absorbs information across spatial locations and views—demonstrating both hardware and sample efficiency at large input sizes. Its omnidirectional, reference-centered dynamic scanning strategy corrects for directional bias and leverages redundancy across multiple views, a property not previously exploited in SSM-based or Transformer-based MVS.

The overall design sets a new state-of-the-art for efficiency and accuracy in reconstructing dense 3D geometry from multi-view images and is extensible to higher-resolution, multi-scale, and multi-source (e.g., multi-camera, multi-modal) scenarios. Its architecture informs future directions for linear-complexity, globally aware visual inference, broadening the reach of SSMs within the domain of 3D computer vision.

PDF Markdown Chat (Pro)

References (1)

MVSMamba: Multi-View Stereo with State Space Model (2025)

Follow Topic

Get notified by email when new papers are published related to MVSMamba.