Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MVSMamba: SSM-Driven Multi-View Stereo

Updated 10 November 2025
  • MVSMamba is a multi-view stereo network that integrates a state-space model (Mamba) to achieve efficient, linear-complexity global feature aggregation.
  • It employs a novel Dynamic Mamba module with reference-centered dynamic scanning and hierarchical, multi-scale feature aggregation for accurate depth estimation.
  • Evaluations on standard 3D reconstruction benchmarks demonstrate MVSMamba's state-of-the-art accuracy and efficiency compared to Transformer-based methods.

MVSMamba is a Multi-View Stereo (MVS) network that integrates a state space model (SSM) backbone—specifically, the Mamba architecture—into the MVS pipeline, enabling efficient global feature aggregation and omnidirectional multi-view feature interaction with linear computational complexity. MVSMamba is characterized by its novel Dynamic Mamba (DM) module with reference-centered dynamic scanning, a hierarchical multi-scale feature aggregation strategy, and its coarse-to-fine depth estimation framework. This architecture delivers state-of-the-art accuracy and efficiency on standard 3D reconstruction benchmarks, establishing Mamba-based SSMs as a compelling alternative to Transformer-based approaches in MVS (Jiang et al., 3 Nov 2025).

1. Architectural Overview and Underlying State-Space Model

MVSMamba operates on a set of KK calibrated images {I0,,IK1}\{I_0, \ldots, I_{K-1}\}, with I0I_0 always designated as the reference view. The feature extraction front-end employs a standard 4-level Feature Pyramid Network (FPN) encoder, producing feature maps Fk,sencF^{enc}_{k,s} for each view kk and pyramid scale ss. The design diverges from conventional Transformer-MVS pipelines through its state-space model backbone.

Each Mamba block models a 1D feature sequence via a continuous-time linear SSM, discretized as follows: h(t)=Ah(t)+Bx(t), y(t)=Ch(t),\begin{aligned} h'(t) &= A h(t) + B x(t), \ y(t) &= C h(t), \end{aligned} which, after discretization and unrolling, produces an efficient sequence-wise convolution: K=[CB,CAB,CA2B,,CAN1B],y=xK.K = [CB, CAB, CA^2B, \ldots, CA^{N-1}B], \qquad y = x * K. For a flattened input length LL, this yields O(L)\mathcal{O}(L) computational complexity (linear in sequence length), in stark contrast to the O(L2)\mathcal{O}(L^2) scaling of self-attention in Transformers. The Mamba architecture further supports content-aware, global feature mixing due to its input-dependent, token-wise SSM parameters.

2. Dynamic Mamba (DM) Module and Reference-Centered Dynamic Scanning

The DM-module is the centerpiece enabling cross-view and omnidirectional feature interaction at the coarsest FPN scale (s=0s=0). The core procedure is as follows:

  • For each source view kk and scale s=0s=0, pairs of reference/source features (F0,senc,Fk,senc)\big(F^{enc}_{0,s}, F^{enc}_{k,s}\big) are concatenated in four spatial arrangements: horizontal right/left and vertical top/bottom.
  • Each concatenated map Xk,sX^\ast_{k,s} ({hr,hl,vb,vt}\ast \in \{\mathrm{hr}, \mathrm{hl}, \mathrm{vb}, \mathrm{vt}\}) is flattened into a 1D sequence via four canonical “skip-scan” orderings (N, flipped-N, Z, flipped-Z), controlled by dynamic view-dependent offsets (hk,wk)(h_k,w_k).
  • This results in four sequences Sk,sj=Rj(Xk,s;(hk,wk))S^j_{k,s} = \mathcal{R}_j(X^\ast_{k,s}; (h_k, w_k)) of length Ls=HsWs/2L_s = H_s W_s / 2 for each source kk and direction jj.
  • Each sequence Sk,sjS^j_{k,s} is processed by a Mamba block (1D-SSM scan), followed by an MLP+LayerNorm post-processing: Sk,sj=S^k,sj+LN(MLP(S^k,sj)).\overline{S}^j_{k,s} = \hat{S}^j_{k,s} + \mathrm{LN}\left(\mathrm{MLP}(\hat{S}^j_{k,s})\right).
  • The four processed sequences are then inversely reshaped to recover updated reference and source features for recursive downstream processing.

By concatenating the reference to each source, dynamically arranging concatenation and scan patterns, and jointly processing all directions, the DM-module performs both inter-view (reference-source) and intra-view (self) global context aggregation in a single O(L)\mathcal{O}(L) pass, achieving true omnidirectional fusion.

3. Multi-Scale Feature Aggregation

MVSMamba applies a hierarchical scheme across FPN levels:

  • At s=0s=0 (1/8 input resolution), the full DM-module aggregates K views.
  • At s=1s=1, a “Simplified Dynamic Mamba” (SDM) module operates using the same multi-direction scan+Mamba mechanism, but restricted to individual views (no concatenation of reference and source).
  • Scales s=2,3s=2,3 (finest resolutions) use standard 3×3 convolutions.
  • The decoder outputs {Fk,sdec}\{\overline{F}^{dec}_{k,s}\} are subsequently warped into the reference view at DD discrete depths per scale, enabling cost volume construction.

Cost volumes are fused via attention-based weighting and regularized by a compact 3D U-Net, producing a depth probability volume P(d,h,w)P(d,h,w). Final per-pixel depth is estimated using softmax and winner-take-all selection. The overall design inherits coarse-to-fine regularization and resolution refinement typical of leading MVS architectures.

4. Computational Complexity

Let L=HsWsL = H_s W_s denote the number of spatial locations per FPN scale. Consider the following complexity per view:

Aggregation Type Complexity Scaling in L, K, C
Self-attention O(L2C)\mathcal{O}(L^2C) Quadratic in LL
Cross-attention (K1K-1) O(KL2C)\mathcal{O}(K L^2 C) Quadratic in LL
Mamba DM-module O(KLC)\mathcal{O}(K L C) Linear in LL

While Transformer-based aggregation grows rapidly with image resolution, MVSMamba's Mamba-based DM-module ensures linear scaling for both inter- and intra-view global context. This efficiency enables state-of-the-art performance using only a fraction of memory and computing resources compared to Transformer approaches.

5. Training Protocol and Optimization

MVSMamba is trained in a multi-stage regime:

  • Stage 1: Pretraining on DTU dataset, 5-view inputs at 512×640512 \times 640, batch=4, for 15 epochs, initial learning rate 1×1031 \times 10^{-3}, decay by half at epochs 10, 12, 14.
  • Stage 2: Fine-tuning on BlendedMVS, 11-view 576×768576 \times 768, batch=2, for 15 epochs, learning rate 5×1045\times 10^{-4}, decay at epochs 6, 8, 10, 12.
  • Stage 3: High-resolution DTU, 5-view 1024×12801024\times 1280, for 10 epochs with staged learning rate decay.

Inverse-depth hypotheses at each scale: {32, 16, 8, 4} with corresponding intervals {2, 1, 1, 0.5}; group correlation sizes {4, 4, 4, 4}.

Loss function: At each scale, the cross-entropy is applied to predicted depth probabilities: L=s=03αsCrossEntropy(Ps,Dgt),L = \sum_{s=0}^3 \alpha_s\, \mathrm{CrossEntropy}\left(P_s, D_{\mathrm{gt}}\right), with uniform weights αs\alpha_s and ground-truth DgtD_{\mathrm{gt}} as one-hot depth indices. Cross-entropy is empirically superior to L1L_1 depth supervision.

Optimizer: Adam (β1=0.9,β2=0.999)(\beta_1=0.9, \beta_2=0.999), weight decay 1×1041 \times 10^{-4}, standard augmentation (horizontal flip, color jitter), and gradient clipping.

6. Performance Benchmarks and Ablation

Quantitative results (DTU dataset, 5 views, 832×1152832 \times 1152 inputs):

Model Variant Overall (mm) Accuracy (mm) Completeness (mm) GPU Memory (GB) Runtime (s) Params (M)
Low-res MVSMamba 0.287 0.314 0.260 2.82 0.11 1.31
High-res MVSMamba* 0.280

MVSMamba surpasses all Transformer-based MVS in the accuracy-efficiency space. On Tanks-and-Temples (21 views, 2K resolution), MVSMamba posts F-score: 67.67% (intermediate mean), 43.32% (advanced mean), outperforming all CNN/Transformer baselines.

Ablation Studies: Removing the DM-module degrades the DTU overall metric from 0.287 to 0.295 mm; omitting SDM yields 0.289 mm; removing the MLP head leads to 0.293 mm. Comparisons to deformable CNNs, FMT, ET, VMamba, EVMamba, and JamMa confirm the superiority of the reference-centered dynamic DM scan. Using DM at s=0s=0 and SDM at s=1s=1 is justified; further ablations on concatenation, weight-sharing, and feature arrangement show the necessity of independent, dynamically scanned Mamba blocks for optimal fusion.

7. Significance within MVS and SSM Research

MVSMamba is the first network to integrate a S4-derived, content-aware SSM (Mamba) for multi-view feature aggregation in MVS, dramatically reducing computational burden without sacrificing global context or accuracy. Unlike previous approaches reliant on explicit self-attention or hand-crafted cost aggregation, MVSMamba's SSM kernel efficiently absorbs information across spatial locations and views—demonstrating both hardware and sample efficiency at large input sizes. Its omnidirectional, reference-centered dynamic scanning strategy corrects for directional bias and leverages redundancy across multiple views, a property not previously exploited in SSM-based or Transformer-based MVS.

The overall design sets a new state-of-the-art for efficiency and accuracy in reconstructing dense 3D geometry from multi-view images and is extensible to higher-resolution, multi-scale, and multi-source (e.g., multi-camera, multi-modal) scenarios. Its architecture informs future directions for linear-complexity, globally aware visual inference, broadening the reach of SSMs within the domain of 3D computer vision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MVSMamba.