Selective Scan (SS2D) in Vision Models

Updated 21 February 2026

Selective Scan (SS2D) is a family of linear-complexity, state-space operators that replace global self-attention with dynamic, content-aware recurrences.
It employs various scan orders—global, windowed, and ring-based—to effectively combine local and non-local features, enhancing spatial and spatiotemporal modeling.
SS2D has been applied in vision tasks, medical imaging, video summarization, and hardware security, demonstrating superior scalability and efficiency over quadratic attention methods.

Selective Scan (SS2D)

Selective Scan (SS2D) refers to a family of linear-complexity, state-space-based operators for spatial or spatiotemporal data that replace global self-attention by directional, content-aware, state-space recurrences. Originally motivated by the need to efficiently propagate global context in vision models without incurring quadratic complexity, SS2D and its variants scan visual or spatiotemporal feature grids in one or more directions, using input-dependent gating and dynamic recurrence to combine information. Recent vision architectures—including VMamba, LocalMamba, H-vmunet, SfMamba, BIMBA, and cross-modal state-space models—employ SS2D as their primary mechanism for combining local and non-local features while maintaining scalability to high-resolution images or long-form videos (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026, Islam et al., 12 Mar 2025, Guo et al., 22 Jun 2025).

1. Mathematical and Algorithmic Foundations

The core of SS2D is a discretized state-space model (SSM) applied to sequences derived from 2D (or higher-dimensional) data. For a 1D input sequence $\{x_k\}_{k=1}^N$ , the SSM follows

$h_k = \bar{A}_k h_{k-1} + \bar{B}_k x_k, \qquad y_k = C_k h_k + D_k x_k,$

where $\bar{A}_k, \bar{B}_k, C_k, D_k$ may be token-dependent (content-aware) matrices, typically generated by small neural networks conditioned on $x_k$ (Liu et al., 2024, Islam et al., 12 Mar 2025). The update is performed serially or in parallel (using prefix-scan algorithms) along a scan order imposed on the spatial (or spatiotemporal) domain.

For vision tasks, SS2D operates on image/tensor inputs of shape $(C,H,W)$ by flattening the $2$D spatial grid into one or more 1D sequences according to a structured scan order (e.g., row-major, column-major, diagonal, windowed, or more sophisticated traversals), running the SSM recurrence per sequence, and then merging the resulting outputs back to the $2$D grid (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026).

Pseudocode for 4-way 2D SS2D (VMamba/SfMamba/CM-SSM):

def SS2D(F, A, B, C, D):
    H, W, C = dims(F)
    N = H*W
    result = zeros(C, H, W)
    for d in {1,2,3,4}:
        seq = extract_sequence(F, direction=d)  # shape C x N
        h = zeros(C)
        y = zeros(C, N)
        for k in 1..N:
            f_k = seq[:, k]
            h = A @ h + B @ f_k
            y[:, k] = C @ h + D @ f_k
        result += reshape(y, C, H, W)
    return result

(Guo et al., 22 Jun 2025, Liu et al., 2024, Chen et al., 13 Jan 2026)

The dynamic parameters ( $A,B,C,D$ or their content-dependent equivalents) enable adaptive mixing of context at every spatial location.

2. Scan Order Strategies and Locality

The scan order defines how 2D (or 3D) data is linearized for recurrences:

Global cross-scan: Four complementary traversals—top-left to bottom-right, etc.—are summed or averaged (Liu et al., 2024, Guo et al., 22 Jun 2025, Wu et al., 2024).
Windowed/local scan: The grid is partitioned into fixed-size (or adaptive) windows; each window is scanned independently using a local (e.g., $p\times p$ ) ordering followed by global merges (Huang et al., 2024).
Ring-based/rotation-robust: Features are grouped into concentric rings; angular and radial SSMs provide order-invariant and rotation-robust traversals (Hsieh et al., 4 Feb 2026).
Similarity- or feature-aware: Tokens are sorted by data-dependent saliency or similarity scores before applying the scan—for instance, using patchwise similarity in correspondence (Kim et al., 29 Sep 2025) or deformable feature activations in exposure correction (Dong et al., 2024).

Table: Notable SS2D Variants

Architecture	Scan Pattern(s)	Merge/Fusion Strategy
VMamba (Liu et al., 2024)	4-way global	Element-wise sum
LocalMamba (Huang et al., 2024)	Windowed + global	Weighted fusion (search/attention)
H-vmunet (Wu et al., 2024)	Diagonal, multi-order	Chained gating+scan, conv fusion
BIMBA (Islam et al., 12 Mar 2025)	Bidirectional (1D)	Interleaved queries + residual
SfMamba (Chen et al., 13 Jan 2026)	4-way spatial + ch-vss	Additive, channel SSM, SCS shuffle
PRISMamba (Hsieh et al., 4 Feb 2026)	Ring & radial	Averaging, per-channel gating
MambaMatcher (Kim et al., 29 Sep 2025)	Similarity-sorted	Sequence-wise scan, unshuffle
ECMamba (Dong et al., 2024)	Activation-sorted	S6 on salient tokens, reordering

SS2D readily generalizes:

Cross-modal fusion (CM-SS2D): Alternating scan between modalities (e.g., RGB and thermal), linking hidden state updates across modalities, enabling bidirectional context propagation with linear cost. For each direction, the hidden state for RGB at step $k$ depends on the thermal state at $k-1$ , and vice versa (Guo et al., 22 Jun 2025).
Video and spatiotemporal compression: In BIMBA, SS2D enables bidirectional gated scans over tens of thousands of video tokens to extract a compact set of summary query tokens for LLM processing, with dynamic gating providing information selectivity (Islam et al., 12 Mar 2025).
High-order interactions: H-vmunet chains multiple SS2D passes (gated/pruned streams) to progressively filter redundant information, integrating local convolutional detail via Local-SS2D modules (Wu et al., 2024).
Ring or deformable scans: PRISMamba replaces global paths with order-agnostic ring traversals and a radial SSM to achieve rotation invariance (Hsieh et al., 4 Feb 2026); ECMamba uses deformable convolutions to rank token saliency and sorts the scan accordingly (Dong et al., 2024).

4. Complexity Analysis and Scalability

A principal advantage of SS2D is its linear time complexity relative to spatial size ( $O(N)$ for $N=H\times W$ ). Each 1D scan costs $O(NC^2)$ if using full channel SSMs (often diagonal or low-rank for efficiency), and usually a small number of scans ( $k=2$ for row+col, $k=4$ for diagonals/corners) are performed in parallel (Liu et al., 2024, Guo et al., 22 Jun 2025, Chen et al., 13 Jan 2026).

By contrast:

Conventions: $K\times K$ convolution is $O(K^2 N C)$ (linear in $N$ but with larger constant).
Self-attention: $O(N^2 C)$ for full spatial attention, often prohibitive for high resolution.
Content-aware sorting adds $O(N \log N)$ , dominated by scan cost for typical $N$ (Kim et al., 29 Sep 2025, Dong et al., 2024).

Empirically, SS2D models (VMamba, LocalMamba, H-vmunet, PRISMamba) outperform or match ViT and CNN baselines on classification, segmentation, and detection while using fewer FLOPs and parameters, with throughput scaling linearly with image resolution (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Hsieh et al., 4 Feb 2026).

5. Applications and Empirical Results

SS2D and its variants have been used as the core recurrence module in vision backbone networks, medical image segmentation, video summarization/compression, domain adaptation, semantic correspondence, and robotics:

Image Classification/Segmentation/Detection: VMamba, LocalMamba, SF-Mamba, H-vmunet, and PRISMamba report superior or comparable Top-1/ImageNet performance with reduced compute, confirming the scalability and global modeling power of SS2D (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026, Hsieh et al., 4 Feb 2026).
Long-form Video QA: BIMBA achieves >10 $\times$ token compression while preserving VQA accuracy above competitive LLM-based alternatives (Islam et al., 12 Mar 2025).
Cross-modal Semantic Segmentation: CM-SSM (with cross-modal SS2D) attains SOTA on RGB-thermal datasets at lower computation (Guo et al., 22 Jun 2025).
Semantic Correspondence: MambaMatcher leverages similarity-aware SS2D to refine 4D correlation maps efficiently and surpass transformer correlation-based models (Kim et al., 29 Sep 2025).
Robust/Adaptive Vision: PRISMamba's ring-based SS2D is robust to large input rotations, with higher accuracy and increased throughput relative to fixed-path scan variants (Hsieh et al., 4 Feb 2026).
Medical Imaging: H-vmunet's high-order SS2D outperforms both vanilla VMamba-based and advanced U-Net variants, with substantial parameter reduction (Wu et al., 2024).
Exposure Correction: Retinex-SS2D demonstrates superior PSNR and SSIM for multi-exposure correction compared to four-way scan approaches (Dong et al., 2024).

6. Limitations and Evolving Research Directions

While SS2D offers compelling trade-offs, limitations and ongoing areas include:

SS2D imposes scan-induced adjacency; without sufficiently rich scan patterns or local-global fusion, subtle spatial relationships may be under-modeled (Huang et al., 2024, Hsieh et al., 4 Feb 2026).
Explicit hierarchy (e.g., as in video or multiresolution tasks) may require multi-level or memory-augmented SSMs to handle extremely long-range dependencies (Islam et al., 12 Mar 2025).
Certain variants, such as simple two-route SS2D in VM-UNet, may bottleneck on complex 2D spatial interactions, motivating multi-head or multi-route fusion (Ji, 2024).
Dynamic scan pattern selection (e.g., differentiable search in LocalMamba) and adaptive scan order (e.g., sorting by relevance or similarity) are active research frontiers (Huang et al., 2024, Kim et al., 29 Sep 2025).
Hardware and kernel optimization remain critical for full realization of the theoretical efficiency gains.

7. Variants Outside Perception: Hardware Security and Robotics

The term “Selective Scan” also appears outside neural modeling, notably:

Hardware Security: SeqL uses SS2D to securely lock scan chains in sequential circuits, providing formal guarantees against SAT and multi-cycle attacks with minor area and power overhead. Its “selective scan” locking selects flip-flops for dual-key protection, rendering decryption of functionally correct keys exponentially unlikely for attackers, as demonstrated on ISCAS, MCNC, ITC, and RISC-V designs (Potluri et al., 2020).
Robotics: SS2D governs selective “key scan” selection for mapping and navigation, clustering star-convex regions into a metric-topological pose graph and using selection policies (frontier, bridging) to ensure provable safety and coverage in real environments (Latha et al., 2024).

References

(Liu et al., 2024) – VMamba: Visual State Space Model
(Huang et al., 2024) – LocalMamba: Visual State Space Model with Windowed Selective Scan
(Wu et al., 2024) – H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation
(Guo et al., 22 Jun 2025) – Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
(Islam et al., 12 Mar 2025) – BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
(Kim et al., 29 Sep 2025) – Similarity-Aware Selective State-Space Modeling for Semantic Correspondence
(Chen et al., 13 Jan 2026) – SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling
(Hsieh et al., 4 Feb 2026) – Partial Ring Scan: Revisiting Scan Order in Vision State Space Models
(Dong et al., 2024) – ECMamba: Consolidating Selective State Space Model with Retinex Guidance for Efficient Multiple Exposure Correction
(Ji, 2024) – MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba
(Potluri et al., 2020) – SeqL: Secure Scan-Locking for IP Protection
(Latha et al., 2024) – Key-Scan-Based Mobile Robot Navigation: Integrated Mapping, Planning, and Control using Graphs of Scan Regions