Vision State-Space Module (VSSM)

Updated 22 June 2026

Vision State-Space Module (VSSM) is a deep neural sequence model that processes flattened visual feature maps via adaptive, linear-time state-space recurrences.
It generalizes attention and convolution by interpreting feature maps as sequences, enabling efficient global context aggregation and robust performance on high-resolution tasks.
VSSMs are implemented in diverse architectures, enhancing applications from dense prediction and medical imaging to image restoration with balanced global and local feature integration.

A Vision State-Space Module (VSSM) is a deep neural sequence modeling primitive derived from continuous- and discrete-time state-space models, engineered for image and dense visual data processing. In the VSSM paradigm, feature maps are interpreted as sequences (or collections of sequences) and processed by input-adaptive, linear-time recurrences. This approach generalizes attention, convolution, and classical SSMs to the vision domain and has enabled architectures with global context aggregation, high efficiency, and robust inductive bias for high-resolution and dense prediction tasks (Zhang et al., 2024).

1. Core Principles and Mathematical Formulation

A VSSM is fundamentally defined by a state-space recurrence that processes a spatial sequence (typically flattened from a 2D feature map) according to:

$s_0 = 0,\quad s_n = A_n\,s_{n-1} + K_n\,x_n,\quad y_n = C^\top_n s_n$

$x_n \in \mathbb{R}^C$ is the token (e.g., a pixel or patch embedding) at sequence position $n$ .
$A_n$ , $K_n$ , $C_n$ are (potentially input-dependent) matrices computed at each step from $x_n$ .
The recurrence adapts memory depth and context mixing at each token; for typical Mamba-variant VSSMs, $A_n = \exp(\Delta_n)$ , $K_n = (\Delta_n)^{-1}(\exp(\Delta_n) - I)$ , with small “hyper” networks producing $\Delta_n$ , $x_n \in \mathbb{R}^C$ 0 from $x_n \in \mathbb{R}^C$ 1.
The scan is applied along one or more spatial directions (“selective 2D scan,” SS2D), with output sequences reassembled back to the original grid structure (Zhang et al., 2024, Liu et al., 2024, Zafari et al., 22 Jul 2025).

The VSSM may be augmented with additional context or spatial mixing layers—such as multi-scale depthwise convolution or structure-aware state fusion—to restore locality and strengthen inductive bias (Zhang et al., 2024, Xiao et al., 2024).

2. Architectural Instantiations

VSSMs are used as the backbone or key building block in various visual architectures:

Dynamic Visual State-Space (DVSS) Block: In HRVMamba, the DVSS combines a 2D-selective-scan SSM (SS2D) with a deformable convolution (DCN), followed by a grouped multi-scale depthwise convolution and a feedforward MLP head. This enables both input-adaptive global context and explicit local feature extraction. The DVSS is embedded within a multi-resolution (HRNet-style) parallel network for dense prediction (Zhang et al., 2024).
Hybrid CNN-VSSM Pipelines: In multi-view mammography analysis, a ResNet provides local features to a stack of VSSM blocks, capturing global dependencies efficiently and yielding substantially improved AUC/F1 scores compared to CNN-only or pure VSSM backbones (Zafari et al., 22 Jul 2025).
Grouped and Multi-Scan Variants: Architectures such as GroupMamba and Multi-scale VMamba split channels into groups or scan full and downsampled features across multiple directions, improving efficiency and stability while maintaining high accuracy (Shaker et al., 2024, Shi et al., 2024).
Variants for Low-Level Vision: For single-image deblurring, XYScanNet employs slice-and-scan VSSMs which alternate between intra-slice (pixel-level) and inter-slice (cross-row or cross-column) SSM recurrences, thus preserving locality and reducing spatial misalignment (Liu et al., 2024).
Spectral-Domain SSMs: HAMSA eliminates spatial scanning altogether, applying a learnable complex kernel in the Fourier (spectral) domain, with input-adaptive frequency gating and nonlinearity for $x_n \in \mathbb{R}^C$ 2 global mixing (Patro et al., 16 Apr 2026).

3. Spatial and Structural Inductive Bias

VSSMs originally achieved their global receptive field at linear cost, but early instantiations suffered from two primary limitations:

Long-Range Forgetting: Due to the recurrent decay (negative $x_n \in \mathbb{R}^C$ 3), information from distant tokens can vanish, especially in deep networks.
Weak Local Inductive Bias: Flattened 1D sequences lose explicit 2D spatial relationships, degrading performance on structurally sensitive tasks.

Recent work directly addresses these issues:

HRVMamba augments the SS2D scan with deformable convolution layers (DCNv4) to inject local and non-local spatial context at each scan step, significantly mitigating long-range forgetting. The MultiDW module introduces multi-scale local filters immediately after the SSM scan, enabling efficient detection of edges, textures, and fine structures that would otherwise be inaccessible to token-only sequential models (Zhang et al., 2024).
Spatial-Mamba employs a structure-aware state fusion (SASF) operator: a multi-scale, dilated, depthwise convolution in the latent state space following the SSM scan. This mechanism fuses neighborhood information in 2D, enabling explicit grid-graph adjacency for the hidden representation (Xiao et al., 2024).
Multi-scale VMamba flattens both “full-resolution” and downsampled feature maps, each scanned in several directions, and fuses via re-assembled and upsampled projections. This hierarchical, parameter-sharing scheme enables robust global mixing while nearly halving token count for reduced cost (Shi et al., 2024).

4. Complexity, Efficiency, and Empirical Performance

The central promise of VSSMs is linear complexity with respect to sequence length $x_n \in \mathbb{R}^C$ 4, in contrast to quadratically-scaling self-attention:

Each SSM scan per direction costs $x_n \in \mathbb{R}^C$ 5, but in practice $x_n \in \mathbb{R}^C$ 6, and modern implementations use low-rank or diagonal parameterizations, dramatically reducing compute (Zhang et al., 2024, Liu et al., 2024).
Multi-scan, grouped, or spectral approaches further cut cost: e.g., GroupMamba partitions channels (four-way scan) and achieves up to 26% parameter reduction vs. standard Mamba, while HAMSA achieves $x_n \in \mathbb{R}^C$ 7 complexity and 2–3× speedup relative to conventional transform-based models (Shaker et al., 2024, Patro et al., 16 Apr 2026).

Empirical results across benchmarks:

Model	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Notable Gains
HRVMamba-B	47	14.2	76.4 AP (pose)	+1.6 AP over VMamba-B
GroupMamba-T	23	4.6	83.3	+0.8% over VMamba-T
HAMSA-L	–	–	85.7	+1.3% over VMamba-B
XYScanNet	–	–	KID=0.073	–17% vs. nearest competitor
L²FMamba	1.09	38	39.485 dB/0.9873	50% fewer params, faster

VSSMs excel in tasks requiring both global context and fine-grained detail: dense prediction (COCO pose/segmentation), medical imaging (multi-view mammography), image restoration, and neural JSCC (Zhang et al., 2024, Zafari et al., 22 Jul 2025, Liu et al., 2024, Wu et al., 2024).

5. Interpretability and Theoretical Underpinnings

Recent research provides principled understanding and interpretable attributions for VSSMs:

Controllability Analysis: X-VMamba establishes a framework for measuring the influence of each token on downstream representations using Jacobian and Gramian-based metrics. This reveals a coarse-to-fine hierarchy: early layers distribute control diffusely, later layers concentrate influence on semantically salient regions. Closed-form influence scores are available for diagonal SSM parameterizations (Mabrok et al., 16 Nov 2025).
Matrix Multiplication Perspective: Both Mamba and its vision derivatives can be represented as “structured” matrix–vector products $x_n \in \mathbb{R}^C$ 8 where $x_n \in \mathbb{R}^C$ 9 encodes scan, decay, and contextual fusion; variants like Spatial-Mamba and Deformba explicitly modify this structure to encode 2D adjacency and adaptive, deformable spatial fusion (Xiao et al., 2024, Ke et al., 20 May 2026).
Spectral Methods: HAMSA replaces time/space recurrences with frequency-domain convolution, sidestepping discretization instabilities and enabling interpretable frequency gating (Patro et al., 16 Apr 2026).

This theoretical underpinning facilitates principled extensions, architectural diagnosis, and connects VSSMs to established dynamical systems theory.

6. Application Domains and Representative Implementations

VSSMs are now foundational in a diverse range of visual and multimodal pipelines:

Dense Prediction: HRVMamba, Spatial-Mamba, and GroupMamba achieve state-of-the-art on semantic segmentation, pose estimation, and object detection (Zhang et al., 2024, Xiao et al., 2024, Shaker et al., 2024).
Medical Imaging: CNN-VSSM hybrids and interpretability methods provide high-performance, robust diagnostics, and domain-aligned spatial selectivity in mammography and multimodal image registration pipelines (Zafari et al., 22 Jul 2025, Wang et al., 2024).
Image Restoration / Low-level Vision: XYScanNet’s VSSM yields leading perceptual quality in deblurring with greatly reduced computational and memory cost compared to flatten-and-scan MambaIR baselines (Liu et al., 2024).
Light Field Super-Resolution: L²FMamba leverages progressive spatial-angular VSSM modules to outperform Transformer and older Mamba-based approaches in accuracy, speed, and parameter count (Wei et al., 25 Mar 2025).
Neural Joint Source–Channel Coding: VSSM-CA (Visual State-Space Module with Channel Adaptation) provides an efficient, CSI-adaptive backbone for semantic communication, outperforming Transformer-based SwinJSCC while halving computational load and parameter count (Wu et al., 2024, Wu et al., 2024).
Scanning-Free Vision SSMs: HAMSA directly eliminates spatial scan overhead, affording FFT-based global mixing applicable to large-scale classification and dense transfer learning tasks (Patro et al., 16 Apr 2026).

7. Limitations and Ongoing Directions

Despite their substantial progress, VSSMs exhibit specific challenges:

Long-Range Context Decay: While mitigated by deformable or adaptive fusion modules, strong spatial decay remains an architectural concern for deep SSMs (Zhang et al., 2024).
Loss of Local Structure in Naive Flattening: Flatten-and-scan approaches can misalign spatially local features; slice-and-scan or structure-aware fusions are required for low-level tasks (Liu et al., 2024, Xiao et al., 2024).
Hardware Utilization and Channel Compression: Linear recurrence, while theoretically efficient, can be under-optimized on existing hardware due to bandwidth and memory constraints. Approaches like VMeanba compress activations to a single channel per scan, permitting up to 293× kernel-level speedup with negligible loss in accuracy, but may break down if too aggressively applied (Chi et al., 2024).
Interpretability in Nonlinear/Residual Architectures: While controllability is well-defined for linear or diagonal SSM blocks, attribution across complex residual, nonlinear, or hierarchical designs is an open problem (Mabrok et al., 16 Nov 2025).
Spectral Limitations: Spectral-domain models such as HAMSA may not natively encode certain fine-grained 2D or geometric priors, e.g., in medical segmentation or detection (Patro et al., 16 Apr 2026).

Research is ongoing to unify scanning-based and spectral methods, develop fully 2D analogues of SSMs, improve multimodal and cross-attention integration, and further bridge to hardware-efficient implementations.

For complete derivations, ablation metrics, and layerwise implementation specifics, consult the cited works (Zhang et al., 2024, Zafari et al., 22 Jul 2025, Chi et al., 2024, Shi et al., 2024, Ke et al., 20 May 2026, Patro et al., 16 Apr 2026, Xiao et al., 2024, Liu et al., 2024, Liu et al., 2024, Wei et al., 25 Mar 2025, Shaker et al., 2024, Wang et al., 2024, Wu et al., 2024, Wu et al., 2024, Wang et al., 2024).