Papers
Topics
Authors
Recent
Search
2000 character limit reached

Surface Vision Mamba (SiM) Overview

Updated 22 February 2026
  • Surface Vision Mamba (SiM) is a vision architecture that uses Selective Structured State Space Models to capture long-range dependencies in high-resolution surface data with reduced computational complexity.
  • SiM integrates spherical and planar adaptations, employing specialized neuroimaging models and multi-directional scans to address tasks in cortical mapping, remote sensing, and crack segmentation.
  • SiM achieves significant efficiency gains, demonstrating up to 4.8× faster inference and over 90% reduction in GPU memory use compared to traditional attention-based models.

Surface Vision Mamba (SiM) is a class of vision architectures that leverage the Selective Structured State Space Models (S6/S4/SSM) in place of traditional attention-based modules, enabling efficient modeling of long-range dependencies in surface data modalities. SiM achieves linear or sub-quadratic computational complexity, facilitating scalable processing of high-resolution cortical, remote sensing, or natural surface data. The framework encompasses both specialized spherical-manifold models for neuroimaging and general-purpose 2D/3D surface analysis in Earth observation, crack detection, and related fields (He et al., 24 Jan 2025, Chen et al., 2024, Liu et al., 2024, Rahman et al., 2024).

1. Mathematical Foundations and Core Architecture

SiM inherits from the Mamba family, employing state-space models defined by the continuous-time ODE

ddth(t)=Ah(t)+Bu(t),y(t)=Ch(t),\frac{d}{dt}h(t) = A\,h(t) + B\,u(t), \quad y(t) = C\,h(t),

where h(t)RNh(t)\in\mathbb{R}^N is the latent state, u(t)Ru(t)\in\mathbb{R} the input, and y(t)Ry(t)\in\mathbb{R} the output. Discretization via zero-order hold yields the updates

ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,

with Ad=eAΔA_d = e^{A\Delta} and Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I). Here, A,B,CA, B, C are learned matrices, and Δ\Delta is a tunable step size.

The SSM update can be expressed as a 1D convolution, y=uKy = u * K with kernel h(t)RNh(t)\in\mathbb{R}^N0, enabling efficient, global dependency capture over long input sequences. For bidirectional modeling, SiM layers perform both forward and backward scans, summing their outputs and applying standard normalization, residual, and MLP stages: h(t)RNh(t)\in\mathbb{R}^N1 (He et al., 24 Jan 2025, Rahman et al., 2024). The Selective SSM (S6) variant further incorporates input-dependent, time-varying h(t)RNh(t)\in\mathbb{R}^N2 and h(t)RNh(t)\in\mathbb{R}^N3 parameters via shallow MLPs, introducing content-awareness (Liu et al., 2024).

2. Surface Domain Adaptations

Spherical Surface Representation

For neuroimaging, cortical hemispheres are represented as subdivided icospheres with genus-zero connectivity. Triangular patches are formed by grouping neighboring vertices, yielding h(t)RNh(t)\in\mathbb{R}^N4 patches for an Ico-h(t)RNh(t)\in\mathbb{R}^N5 mesh (e.g., h(t)RNh(t)\in\mathbb{R}^N6 at Ico-4), each with h(t)RNh(t)\in\mathbb{R}^N7 vertices. Features per vertex (e.g., curvature, depth, thickness, myelination) are projected to fixed-dimensional patch embeddings via a learnable matrix h(t)RNh(t)\in\mathbb{R}^N8. Canonical patch ordering is applied per hemisphere, with a global “class token” interposed for global context. The initial sequence is: h(t)RNh(t)\in\mathbb{R}^N9 where u(t)Ru(t)\in\mathbb{R}0 is a 1D positional encoding (He et al., 24 Jan 2025).

2D Patch and Multi-Directional Scan

In remote sensing, industrial crack segmentation, and general planar images, non-overlapping u(t)Ru(t)\in\mathbb{R}1 patches are extracted from u(t)Ru(t)\in\mathbb{R}2 and linearized into a sequence. Four main scan orders (↘, ↙, ↗, ↖) reorder patches for diversity in receptive field (Select-Scan). Each directional scan feeds into the SSM block, outputs are merged, and spatial topology is preserved through scanning and reassembly (Chen et al., 2024, Liu et al., 2024).

Hardware-Aware Implementation

SiM exploits two primary implementation modes: batched parallel “convolution” for training and memory-conserving, sequential scan for inference. Sequences are segmented into SRAM-fitting windows for CUDA efficiency. Dynamic adaptation of u(t)Ru(t)\in\mathbb{R}3 and step sizes u(t)Ru(t)\in\mathbb{R}4 can be realized through MLPs over input tokens (Liu et al., 2024).

3. Computational Complexity and Efficiency

Standard attention yields u(t)Ru(t)\in\mathbb{R}5 computational and memory complexity (u(t)Ru(t)\in\mathbb{R}6 sequence length); SiM’s SSM/S4 modules reduce this to u(t)Ru(t)\in\mathbb{R}7 or u(t)Ru(t)\in\mathbb{R}8 using scan algorithms or Fourier transforms. In the SiM neuroimaging setting (u(t)Ru(t)\in\mathbb{R}9 for Ico-4), SiM achieves:

  • Inference speedup: y(t)Ry(t)\in\mathbb{R}0 FPS vs. y(t)Ry(t)\in\mathbb{R}1 FPS (Surface Vision Transformer, SiT) → y(t)Ry(t)\in\mathbb{R}2 faster
  • Peak GPU memory: y(t)Ry(t)\in\mathbb{R}3 GB (SiM) vs. y(t)Ry(t)\in\mathbb{R}4 GB (SiT) → y(t)Ry(t)\in\mathbb{R}5 reduction

For surface crack segmentation (VMamba-UNet), parameter and floating-point operation (FLOP) reductions are:

  • y(t)Ry(t)\in\mathbb{R}6 fewer parameters than CNN/Transformer comparators
  • y(t)Ry(t)\in\mathbb{R}7 fewer MACs on y(t)Ry(t)\in\mathbb{R}8 input; up to y(t)Ry(t)\in\mathbb{R}9 lower FLOPs at ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,0k%%%%37u(t)Ru(t)\in\mathbb{R}38%%%% resolution (Chen et al., 2024, Rahman et al., 2024).

These gains make SiM practical for high-resolution, high-throughput vision tasks that are infeasible for Transformer-based networks on contemporary hardware.

4. Practical Instantiations and Benchmark Performance

Cortical Surface Neuroimaging

Using metrics such as postmenstrual age (PMA) and Bayley III language/motor scores on neonatal brain surfaces (dHCP dataset, Ico-4), SiM is trained with mean squared error (MSE) loss and evaluated via mean absolute error (MAE) and MSE. Sensitivity analyses (feature-zeroing per patch/channel) highlight developmentally-relevant cortex regions, revealing biological interpretability (He et al., 24 Jan 2025). Model scales include:

  • SiM-Tiny: ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,3 layers, ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,4, ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,5M params
  • SiM-Small: ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,6 layers, ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,7, ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,8M params
  • SiM-Base: ht=Adht1+Bdut,yt=Cht,h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,9 layers, Ad=eAΔA_d = e^{A\Delta}0, Ad=eAΔA_d = e^{A\Delta}1M params

Surface Crack Segmentation

VMamba-UNet, the SiM crack-segmentation pipeline, achieves:

  • mDS improvements: Crack500 Ad=eAΔA_d = e^{A\Delta}2, Ozgenel Ad=eAΔA_d = e^{A\Delta}3, MC448 Ad=eAΔA_d = e^{A\Delta}4 over CNNs
  • Comparable mDS to Transformer baselines with much lower computation and parameter count

The segmentation head predicts crack probability maps via linear projection. Training uses Dice loss with AdamW, and extensive augmentation on datasets such as Crack500 and Ozgenel (Chen et al., 2024).

Remote Sensing

SiM variants such as Pan-Mamba, RSMamba, and Samba achieve state-of-the-art results on pan-sharpening, image classification, and segmentation benchmarks:

Task Dataset SiM metric Comparator metric
Pan-sharpening WorldView-II PSNR Ad=eAΔA_d = e^{A\Delta}5 dB; SSIM Ad=eAΔA_d = e^{A\Delta}6 MTF-GLP: Ad=eAΔA_d = e^{A\Delta}7 dB; Ad=eAΔA_d = e^{A\Delta}8
Image Classification UCMerced, AID, NWPU OA Ad=eAΔA_d = e^{A\Delta}9; Kappa Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)0 ResNet50: Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)1; Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)2
Change Detection LEVIR-CD, WHU F1 Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)3, IoU Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)4 SNUNet: Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)5; Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)6
Segmentation ISPRS Potsdam, Vaihingen mIoU Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)7 DeepLabV3+: Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)8

(Liu et al., 2024).

5. Model Variants, Hyperparameter Choices, and Training Protocols

Key design and optimization choices for SiM include:

  • State dimensionality Bd=(AΔ)1(eAΔI)B_d = (A\Delta)^{-1}(e^{A\Delta}-I)9 balances expressivity vs. efficiency; A,B,CA, B, C0 controls SSM compute load
  • Discretization step size A,B,CA, B, C1 typically constrained to A,B,CA, B, C2 for stable memory influence
  • 1D/2D scan pattern selection: bidirectional, multi-directional (four-way for 2D grids), and windowed local scans for performance–latency trade-offs
  • 4-stage hierarchical backbones are recommended for segmentation and patch-level classification
  • In hybrid models, 3A,B,CA, B, C33 convolutions before/after Mamba blocks increase local feature sensitivity

Optimization follows standard practice with AdamW, cosine or linear learning rate decay, dropout/stochastic depth (A,B,CA, B, C4–A,B,CA, B, C5), label smoothing, and aggressive data augmentation (He et al., 24 Jan 2025, Liu et al., 2024, Rahman et al., 2024).

6. Limitations and Future Directions

Notable challenges for SiM include:

  • Optimal scan ordering is empirical and task-dependent; learnable or data-adaptive scanning may enhance spatial structural preservation
  • Pre-trained SiM backbone diversity lags behind CNNs and Transformers; large-scale pretraining could improve transfer performance
  • Interpretability remains an open issue as S4 kernels do not yield straightforward input–output attention maps analogous to self-attention
  • Robustness to adversarial perturbations is underexplored; targeted regularization of SSM parameters may alleviate some vulnerabilities
  • Extending SiM to multi-dimensional data (e.g., volumetric, video) necessitates devising efficient multi-axis SSM or scan schemes

Ongoing work is expected to address these aspects and further establish SiM as a foundation for scalable, context-rich surface vision modeling (Rahman et al., 2024, Liu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surface Vision Mamba (SiM).