Selective Scan (SS2D) in Vision Models
- Selective Scan (SS2D) is a family of linear-complexity, state-space operators that replace global self-attention with dynamic, content-aware recurrences.
- It employs various scan orders—global, windowed, and ring-based—to effectively combine local and non-local features, enhancing spatial and spatiotemporal modeling.
- SS2D has been applied in vision tasks, medical imaging, video summarization, and hardware security, demonstrating superior scalability and efficiency over quadratic attention methods.
Selective Scan (SS2D)
Selective Scan (SS2D) refers to a family of linear-complexity, state-space-based operators for spatial or spatiotemporal data that replace global self-attention by directional, content-aware, state-space recurrences. Originally motivated by the need to efficiently propagate global context in vision models without incurring quadratic complexity, SS2D and its variants scan visual or spatiotemporal feature grids in one or more directions, using input-dependent gating and dynamic recurrence to combine information. Recent vision architectures—including VMamba, LocalMamba, H-vmunet, SfMamba, BIMBA, and cross-modal state-space models—employ SS2D as their primary mechanism for combining local and non-local features while maintaining scalability to high-resolution images or long-form videos (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026, Islam et al., 12 Mar 2025, Guo et al., 22 Jun 2025).
1. Mathematical and Algorithmic Foundations
The core of SS2D is a discretized state-space model (SSM) applied to sequences derived from 2D (or higher-dimensional) data. For a 1D input sequence , the SSM follows
where may be token-dependent (content-aware) matrices, typically generated by small neural networks conditioned on (Liu et al., 2024, Islam et al., 12 Mar 2025). The update is performed serially or in parallel (using prefix-scan algorithms) along a scan order imposed on the spatial (or spatiotemporal) domain.
For vision tasks, SS2D operates on image/tensor inputs of shape by flattening the $2$D spatial grid into one or more 1D sequences according to a structured scan order (e.g., row-major, column-major, diagonal, windowed, or more sophisticated traversals), running the SSM recurrence per sequence, and then merging the resulting outputs back to the $2$D grid (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026).
Pseudocode for 4-way 2D SS2D (VMamba/SfMamba/CM-SSM):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def SS2D(F, A, B, C, D): H, W, C = dims(F) N = H*W result = zeros(C, H, W) for d in {1,2,3,4}: seq = extract_sequence(F, direction=d) # shape C x N h = zeros(C) y = zeros(C, N) for k in 1..N: f_k = seq[:, k] h = A @ h + B @ f_k y[:, k] = C @ h + D @ f_k result += reshape(y, C, H, W) return result |
The dynamic parameters ( or their content-dependent equivalents) enable adaptive mixing of context at every spatial location.
2. Scan Order Strategies and Locality
The scan order defines how 2D (or 3D) data is linearized for recurrences:
- Global cross-scan: Four complementary traversals—top-left to bottom-right, etc.—are summed or averaged (Liu et al., 2024, Guo et al., 22 Jun 2025, Wu et al., 2024).
- Windowed/local scan: The grid is partitioned into fixed-size (or adaptive) windows; each window is scanned independently using a local (e.g., ) ordering followed by global merges (Huang et al., 2024).
- Ring-based/rotation-robust: Features are grouped into concentric rings; angular and radial SSMs provide order-invariant and rotation-robust traversals (Hsieh et al., 4 Feb 2026).
- Similarity- or feature-aware: Tokens are sorted by data-dependent saliency or similarity scores before applying the scan—for instance, using patchwise similarity in correspondence (Kim et al., 29 Sep 2025) or deformable feature activations in exposure correction (Dong et al., 2024).
Table: Notable SS2D Variants
| Architecture | Scan Pattern(s) | Merge/Fusion Strategy |
|---|---|---|
| VMamba (Liu et al., 2024) | 4-way global | Element-wise sum |
| LocalMamba (Huang et al., 2024) | Windowed + global | Weighted fusion (search/attention) |
| H-vmunet (Wu et al., 2024) | Diagonal, multi-order | Chained gating+scan, conv fusion |
| BIMBA (Islam et al., 12 Mar 2025) | Bidirectional (1D) | Interleaved queries + residual |
| SfMamba (Chen et al., 13 Jan 2026) | 4-way spatial + ch-vss | Additive, channel SSM, SCS shuffle |
| PRISMamba (Hsieh et al., 4 Feb 2026) | Ring & radial | Averaging, per-channel gating |
| MambaMatcher (Kim et al., 29 Sep 2025) | Similarity-sorted | Sequence-wise scan, unshuffle |
| ECMamba (Dong et al., 2024) | Activation-sorted | S6 on salient tokens, reordering |
3. Extensions: Cross-modal, Very Long Sequences, and High-Order
SS2D readily generalizes:
- Cross-modal fusion (CM-SS2D): Alternating scan between modalities (e.g., RGB and thermal), linking hidden state updates across modalities, enabling bidirectional context propagation with linear cost. For each direction, the hidden state for RGB at step depends on the thermal state at , and vice versa (Guo et al., 22 Jun 2025).
- Video and spatiotemporal compression: In BIMBA, SS2D enables bidirectional gated scans over tens of thousands of video tokens to extract a compact set of summary query tokens for LLM processing, with dynamic gating providing information selectivity (Islam et al., 12 Mar 2025).
- High-order interactions: H-vmunet chains multiple SS2D passes (gated/pruned streams) to progressively filter redundant information, integrating local convolutional detail via Local-SS2D modules (Wu et al., 2024).
- Ring or deformable scans: PRISMamba replaces global paths with order-agnostic ring traversals and a radial SSM to achieve rotation invariance (Hsieh et al., 4 Feb 2026); ECMamba uses deformable convolutions to rank token saliency and sorts the scan accordingly (Dong et al., 2024).
4. Complexity Analysis and Scalability
A principal advantage of SS2D is its linear time complexity relative to spatial size ( for ). Each 1D scan costs if using full channel SSMs (often diagonal or low-rank for efficiency), and usually a small number of scans ( for row+col, for diagonals/corners) are performed in parallel (Liu et al., 2024, Guo et al., 22 Jun 2025, Chen et al., 13 Jan 2026).
By contrast:
- Conventions: convolution is (linear in but with larger constant).
- Self-attention: for full spatial attention, often prohibitive for high resolution.
- Content-aware sorting adds , dominated by scan cost for typical (Kim et al., 29 Sep 2025, Dong et al., 2024).
Empirically, SS2D models (VMamba, LocalMamba, H-vmunet, PRISMamba) outperform or match ViT and CNN baselines on classification, segmentation, and detection while using fewer FLOPs and parameters, with throughput scaling linearly with image resolution (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Hsieh et al., 4 Feb 2026).
5. Applications and Empirical Results
SS2D and its variants have been used as the core recurrence module in vision backbone networks, medical image segmentation, video summarization/compression, domain adaptation, semantic correspondence, and robotics:
- Image Classification/Segmentation/Detection: VMamba, LocalMamba, SF-Mamba, H-vmunet, and PRISMamba report superior or comparable Top-1/ImageNet performance with reduced compute, confirming the scalability and global modeling power of SS2D (Liu et al., 2024, Huang et al., 2024, Wu et al., 2024, Chen et al., 13 Jan 2026, Hsieh et al., 4 Feb 2026).
- Long-form Video QA: BIMBA achieves >10 token compression while preserving VQA accuracy above competitive LLM-based alternatives (Islam et al., 12 Mar 2025).
- Cross-modal Semantic Segmentation: CM-SSM (with cross-modal SS2D) attains SOTA on RGB-thermal datasets at lower computation (Guo et al., 22 Jun 2025).
- Semantic Correspondence: MambaMatcher leverages similarity-aware SS2D to refine 4D correlation maps efficiently and surpass transformer correlation-based models (Kim et al., 29 Sep 2025).
- Robust/Adaptive Vision: PRISMamba's ring-based SS2D is robust to large input rotations, with higher accuracy and increased throughput relative to fixed-path scan variants (Hsieh et al., 4 Feb 2026).
- Medical Imaging: H-vmunet's high-order SS2D outperforms both vanilla VMamba-based and advanced U-Net variants, with substantial parameter reduction (Wu et al., 2024).
- Exposure Correction: Retinex-SS2D demonstrates superior PSNR and SSIM for multi-exposure correction compared to four-way scan approaches (Dong et al., 2024).
6. Limitations and Evolving Research Directions
While SS2D offers compelling trade-offs, limitations and ongoing areas include:
- SS2D imposes scan-induced adjacency; without sufficiently rich scan patterns or local-global fusion, subtle spatial relationships may be under-modeled (Huang et al., 2024, Hsieh et al., 4 Feb 2026).
- Explicit hierarchy (e.g., as in video or multiresolution tasks) may require multi-level or memory-augmented SSMs to handle extremely long-range dependencies (Islam et al., 12 Mar 2025).
- Certain variants, such as simple two-route SS2D in VM-UNet, may bottleneck on complex 2D spatial interactions, motivating multi-head or multi-route fusion (Ji, 2024).
- Dynamic scan pattern selection (e.g., differentiable search in LocalMamba) and adaptive scan order (e.g., sorting by relevance or similarity) are active research frontiers (Huang et al., 2024, Kim et al., 29 Sep 2025).
- Hardware and kernel optimization remain critical for full realization of the theoretical efficiency gains.
7. Variants Outside Perception: Hardware Security and Robotics
The term “Selective Scan” also appears outside neural modeling, notably:
- Hardware Security: SeqL uses SS2D to securely lock scan chains in sequential circuits, providing formal guarantees against SAT and multi-cycle attacks with minor area and power overhead. Its “selective scan” locking selects flip-flops for dual-key protection, rendering decryption of functionally correct keys exponentially unlikely for attackers, as demonstrated on ISCAS, MCNC, ITC, and RISC-V designs (Potluri et al., 2020).
- Robotics: SS2D governs selective “key scan” selection for mapping and navigation, clustering star-convex regions into a metric-topological pose graph and using selection policies (frontier, bridging) to ensure provable safety and coverage in real environments (Latha et al., 2024).
References
- (Liu et al., 2024) – VMamba: Visual State Space Model
- (Huang et al., 2024) – LocalMamba: Visual State Space Model with Windowed Selective Scan
- (Wu et al., 2024) – H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation
- (Guo et al., 22 Jun 2025) – Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
- (Islam et al., 12 Mar 2025) – BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
- (Kim et al., 29 Sep 2025) – Similarity-Aware Selective State-Space Modeling for Semantic Correspondence
- (Chen et al., 13 Jan 2026) – SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling
- (Hsieh et al., 4 Feb 2026) – Partial Ring Scan: Revisiting Scan Order in Vision State Space Models
- (Dong et al., 2024) – ECMamba: Consolidating Selective State Space Model with Retinex Guidance for Efficient Multiple Exposure Correction
- (Ji, 2024) – MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba
- (Potluri et al., 2020) – SeqL: Secure Scan-Locking for IP Protection
- (Latha et al., 2024) – Key-Scan-Based Mobile Robot Navigation: Integrated Mapping, Planning, and Control using Graphs of Scan Regions