Surface Vision Mamba (SiM) Overview

Updated 22 February 2026

Surface Vision Mamba (SiM) is a vision architecture that uses Selective Structured State Space Models to capture long-range dependencies in high-resolution surface data with reduced computational complexity.
SiM integrates spherical and planar adaptations, employing specialized neuroimaging models and multi-directional scans to address tasks in cortical mapping, remote sensing, and crack segmentation.
SiM achieves significant efficiency gains, demonstrating up to 4.8× faster inference and over 90% reduction in GPU memory use compared to traditional attention-based models.

Surface Vision Mamba (SiM) is a class of vision architectures that leverage the Selective Structured State Space Models (S6/S4/SSM) in place of traditional attention-based modules, enabling efficient modeling of long-range dependencies in surface data modalities. SiM achieves linear or sub-quadratic computational complexity, facilitating scalable processing of high-resolution cortical, remote sensing, or natural surface data. The framework encompasses both specialized spherical-manifold models for neuroimaging and general-purpose 2D/3D surface analysis in Earth observation, crack detection, and related fields (He et al., 24 Jan 2025, Chen et al., 2024, Liu et al., 2024, Rahman et al., 2024).

1. Mathematical Foundations and Core Architecture

SiM inherits from the Mamba family, employing state-space models defined by the continuous-time ODE

$\frac{d}{dt}h(t) = A\,h(t) + B\,u(t), \quad y(t) = C\,h(t),$

where $h(t)\in\mathbb{R}^N$ is the latent state, $u(t)\in\mathbb{R}$ the input, and $y(t)\in\mathbb{R}$ the output. Discretization via zero-order hold yields the updates

$h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$

with $A_d = e^{A\Delta}$ and $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ . Here, $A, B, C$ are learned matrices, and $\Delta$ is a tunable step size.

The SSM update can be expressed as a 1D convolution, $y = u * K$ with kernel $h(t)\in\mathbb{R}^N$ 0, enabling efficient, global dependency capture over long input sequences. For bidirectional modeling, SiM layers perform both forward and backward scans, summing their outputs and applying standard normalization, residual, and MLP stages: $h(t)\in\mathbb{R}^N$ 1 (He et al., 24 Jan 2025, Rahman et al., 2024). The Selective SSM (S6) variant further incorporates input-dependent, time-varying $h(t)\in\mathbb{R}^N$ 2 and $h(t)\in\mathbb{R}^N$ 3 parameters via shallow MLPs, introducing content-awareness (Liu et al., 2024).

2. Surface Domain Adaptations

Spherical Surface Representation

For neuroimaging, cortical hemispheres are represented as subdivided icospheres with genus-zero connectivity. Triangular patches are formed by grouping neighboring vertices, yielding $h(t)\in\mathbb{R}^N$ 4 patches for an Ico- $h(t)\in\mathbb{R}^N$ 5 mesh (e.g., $h(t)\in\mathbb{R}^N$ 6 at Ico-4), each with $h(t)\in\mathbb{R}^N$ 7 vertices. Features per vertex (e.g., curvature, depth, thickness, myelination) are projected to fixed-dimensional patch embeddings via a learnable matrix $h(t)\in\mathbb{R}^N$ 8. Canonical patch ordering is applied per hemisphere, with a global “class token” interposed for global context. The initial sequence is: $h(t)\in\mathbb{R}^N$ 9 where $u(t)\in\mathbb{R}$ 0 is a 1D positional encoding (He et al., 24 Jan 2025).

2D Patch and Multi-Directional Scan

In remote sensing, industrial crack segmentation, and general planar images, non-overlapping $u(t)\in\mathbb{R}$ 1 patches are extracted from $u(t)\in\mathbb{R}$ 2 and linearized into a sequence. Four main scan orders (↘, ↙, ↗, ↖) reorder patches for diversity in receptive field (Select-Scan). Each directional scan feeds into the SSM block, outputs are merged, and spatial topology is preserved through scanning and reassembly (Chen et al., 2024, Liu et al., 2024).

Hardware-Aware Implementation

SiM exploits two primary implementation modes: batched parallel “convolution” for training and memory-conserving, sequential scan for inference. Sequences are segmented into SRAM-fitting windows for CUDA efficiency. Dynamic adaptation of $u(t)\in\mathbb{R}$ 3 and step sizes $u(t)\in\mathbb{R}$ 4 can be realized through MLPs over input tokens (Liu et al., 2024).

3. Computational Complexity and Efficiency

Standard attention yields $u(t)\in\mathbb{R}$ 5 computational and memory complexity ( $u(t)\in\mathbb{R}$ 6 sequence length); SiM’s SSM/S4 modules reduce this to $u(t)\in\mathbb{R}$ 7 or $u(t)\in\mathbb{R}$ 8 using scan algorithms or Fourier transforms. In the SiM neuroimaging setting ( $u(t)\in\mathbb{R}$ 9 for Ico-4), SiM achieves:

Inference speedup: $y(t)\in\mathbb{R}$ 0 FPS vs. $y(t)\in\mathbb{R}$ 1 FPS (Surface Vision Transformer, SiT) → $y(t)\in\mathbb{R}$ 2 faster
Peak GPU memory: $y(t)\in\mathbb{R}$ 3 GB (SiM) vs. $y(t)\in\mathbb{R}$ 4 GB (SiT) → $y(t)\in\mathbb{R}$ 5 reduction

For surface crack segmentation (VMamba-UNet), parameter and floating-point operation (FLOP) reductions are:

$y(t)\in\mathbb{R}$ 6 fewer parameters than CNN/Transformer comparators
$y(t)\in\mathbb{R}$ 7 fewer MACs on $y(t)\in\mathbb{R}$ 8 input; up to $y(t)\in\mathbb{R}$ 9 lower FLOPs at $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 0k%%%%37 $u(t)\in\mathbb{R}$ 38%%%% resolution (Chen et al., 2024, Rahman et al., 2024).

These gains make SiM practical for high-resolution, high-throughput vision tasks that are infeasible for Transformer-based networks on contemporary hardware.

4. Practical Instantiations and Benchmark Performance

Cortical Surface Neuroimaging

Using metrics such as postmenstrual age (PMA) and Bayley III language/motor scores on neonatal brain surfaces (dHCP dataset, Ico-4), SiM is trained with mean squared error (MSE) loss and evaluated via mean absolute error (MAE) and MSE. Sensitivity analyses (feature-zeroing per patch/channel) highlight developmentally-relevant cortex regions, revealing biological interpretability (He et al., 24 Jan 2025). Model scales include:

SiM-Tiny: $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 3 layers, $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 4, $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 5M params
SiM-Small: $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 6 layers, $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 7, $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 8M params
SiM-Base: $h_t = A_d\,h_{t-1} + B_d\,u_t, \quad y_t = C\,h_t,$ 9 layers, $A_d = e^{A\Delta}$ 0, $A_d = e^{A\Delta}$ 1M params

Surface Crack Segmentation

VMamba-UNet, the SiM crack-segmentation pipeline, achieves:

mDS improvements: Crack500 $A_d = e^{A\Delta}$ 2, Ozgenel $A_d = e^{A\Delta}$ 3, MC448 $A_d = e^{A\Delta}$ 4 over CNNs
Comparable mDS to Transformer baselines with much lower computation and parameter count

The segmentation head predicts crack probability maps via linear projection. Training uses Dice loss with AdamW, and extensive augmentation on datasets such as Crack500 and Ozgenel (Chen et al., 2024).

Remote Sensing

SiM variants such as Pan-Mamba, RSMamba, and Samba achieve state-of-the-art results on pan-sharpening, image classification, and segmentation benchmarks:

Task	Dataset	SiM metric	Comparator metric
Pan-sharpening	WorldView-II	PSNR $A_d = e^{A\Delta}$ 5 dB; SSIM $A_d = e^{A\Delta}$ 6	MTF-GLP: $A_d = e^{A\Delta}$ 7 dB; $A_d = e^{A\Delta}$ 8
Image Classification	UCMerced, AID, NWPU	OA $A_d = e^{A\Delta}$ 9; Kappa $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 0	ResNet50: $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 1; $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 2
Change Detection	LEVIR-CD, WHU	F1 $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 3, IoU $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 4	SNUNet: $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 5; $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 6
Segmentation	ISPRS Potsdam, Vaihingen	mIoU $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 7	DeepLabV3+: $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 8

(Liu et al., 2024).

5. Model Variants, Hyperparameter Choices, and Training Protocols

Key design and optimization choices for SiM include:

State dimensionality $B_d = (A\Delta)^{-1}(e^{A\Delta}-I)$ 9 balances expressivity vs. efficiency; $A, B, C$ 0 controls SSM compute load
Discretization step size $A, B, C$ 1 typically constrained to $A, B, C$ 2 for stable memory influence
1D/2D scan pattern selection: bidirectional, multi-directional (four-way for 2D grids), and windowed local scans for performance–latency trade-offs
4-stage hierarchical backbones are recommended for segmentation and patch-level classification
In hybrid models, 3 $A, B, C$ 33 convolutions before/after Mamba blocks increase local feature sensitivity

Optimization follows standard practice with AdamW, cosine or linear learning rate decay, dropout/stochastic depth ( $A, B, C$ 4– $A, B, C$ 5), label smoothing, and aggressive data augmentation (He et al., 24 Jan 2025, Liu et al., 2024, Rahman et al., 2024).

6. Limitations and Future Directions

Notable challenges for SiM include:

Optimal scan ordering is empirical and task-dependent; learnable or data-adaptive scanning may enhance spatial structural preservation
Pre-trained SiM backbone diversity lags behind CNNs and Transformers; large-scale pretraining could improve transfer performance
Interpretability remains an open issue as S4 kernels do not yield straightforward input–output attention maps analogous to self-attention
Robustness to adversarial perturbations is underexplored; targeted regularization of SSM parameters may alleviate some vulnerabilities
Extending SiM to multi-dimensional data (e.g., volumetric, video) necessitates devising efficient multi-axis SSM or scan schemes

Ongoing work is expected to address these aspects and further establish SiM as a foundation for scalable, context-rich surface vision modeling (Rahman et al., 2024, Liu et al., 2024).