Cross-Modal 2D-Selective-Scan (CM-SS2D)

Updated 30 July 2025

CM-SS2D is a cross-modal technique that fuses heterogeneous sensor data into 2D spatial feature maps using selective scanning to capture long-range dependencies.
It sequentializes and updates states from modalities such as RGB, thermal, depth, and LiDAR, yielding enhanced efficiency in semantic segmentation and multi-object discovery.
The approach achieves linear computational complexity and real-time performance, benefiting applications in robotics, autonomous vehicles, and surveillance.

Cross-Modal 2D-Selective-Scan (CM-SS2D) refers to a family of techniques and neural modules that perform information aggregation and fusion across multiple sensing modalities (e.g., RGB, thermal, LiDAR, depth) by organizing data into sequences suited for state space modeling or related architectures, using spatially ordered selective scans that efficiently and explicitly model long-range dependencies in two-dimensional (2D) domains. These advances have enabled real-time, resource-efficient semantic segmentation, multi-modal object discovery, and robust cross-modal correspondence under challenging sensor and environmental conditions, with demonstrated applications in robotics, autonomous vehicles, and surveillance.

1. Foundations and Motivation

The core challenge in cross-modal visual perception lies in harmoniously integrating signals from heterogeneous sensors—such as RGB cameras, depth maps, LiDAR, or thermal imagers—while maintaining computational and memory efficiency. Classical approaches (e.g., Transformer-based models) often perform dense pairwise attention, which incurs quadratic complexity in spatial dimensions. Traditional scan operations in state space models (SSMs) or sequence models are inherently one-dimensional and do not account for the spatial structure of images or the cross-modal interactions required in multi-sensor applications.

CM-SS2D was developed to overcome these bottlenecks. It generalizes the notion of "selective scan" from 1D sequences to 2D spatial feature maps and further extends it across modalities, enabling both efficient long-range dependency modeling and fine-grained cross-modal context propagation. This is achieved by devising scan routes or sequentializations that alternate or interleave different sensor feature channels and updating hidden states reciprocally between modalities.

2. Architectural Design and Methodological Advances

At the heart of CM-SS2D modules are two intertwined methodological pillars:

Spatial feature maps from each modality are jointly traversed in a spatially ordered sequence (commonly across four or more canonical scanning directions), often interleaving features such that, for instance, the order is {r_k, t_k, r_{k+1}, t_{k+1}, ...} for RGB (r) and thermal (t) modalities (Guo et al., 22 Jun 2025). This cross-modal spatial scanning ensures each sequential state receives information alternately from different modalities, providing a rich contextual basis for state propagation and global feature synthesis.

Once cross-modal feature sequences are established, state transitions are computed using both current input (from one modality) and the most recent hidden state (from the complementary modality). Formally, updates are expressed as

$\overline{r}_k = \bar{A} \cdot \overline{t}_{k-1} + \bar{B} \cdot r_k, \quad r_k' = \bar{C} \cdot \overline{r}_k + \bar{D} \cdot r_k$

and, symmetrically, for the other modality (Guo et al., 22 Jun 2025). Here, $\bar{A}, \bar{B}, \bar{C}, \bar{D}$ are dynamically learned parameters shaped by the cross-modal input, and hidden states propagate contextual information directionally and across modalities.

Subsequent scan fusion (e.g., cross-merge) combines outputs from all scan routes to reconstruct a 2D feature map at each spatial position, synthesizing context from all traversals.

CM-SS2D builds on the success of state space models (SSMs) such as Mamba, VMamba, and their vision variants (Liu et al., 18 Jan 2024, Ji, 10 Jun 2024). The 2D-Selective-Scan (SS2D) adopted in VMamba first unfolds images into non-overlapping patches and then performs selective scan recurrences along multiple spatial directions. CM-SS2D generalizes this to cross-modal cases, enforcing that state transitions across scan routes are not just spatially but also modality-aware.

Key architectural innovations include:

Efficient hardware-oriented implementation (custom CUDA kernels; layout optimizations)
Parallel multi-head scanning in lower-dimensional subspaces (MHS), where each head scans along distinct spatial paths, and outputs are integrated with attention mechanisms that emphasize patches exhibiting variability across scan routes (as measured by a coefficient of variation, or CV) (Ji, 10 Jun 2024)
Association modules (e.g., CM-SSA) to merge global (SS2D- or CM-SS2D-derived) features with local convolutions for spatially precise decoding (Guo et al., 22 Jun 2025)

CM-SS2D techniques have been instantiated in several domains and fusion strategies:

RGB–Thermal Semantic Segmentation: The CM-SS2D module sequentializes RGB and thermal features, computing hidden states that efficiently integrate both global context and local spatial details. Integration with local convolutions via association modules achieves state-of-the-art mean Intersection over Union (mIoU) and real-time processing capability on benchmarks such as CART and PST900 (Guo et al., 22 Jun 2025).
RGB–D Saliency Detection: Approaches using cross-modality feature modulation (cmFM) and adaptive feature selection (AFS) refine feature representations by modulating one modality (e.g., RGB) with affine parameters derived from the other (e.g., depth) and selectively fusing via channel and spatial gating (Li et al., 2020).
LiDAR–RGB Cross-Modal Segmentation: Cross-modal selective scan underpins models such as CoMoDaL, which leverage cross-domain distillation and guidance constraints (e.g., prototype-to-pixel and point-to-pixel alignments) for unsupervised 3D semantic segmentation from 2D annotated images (Chen et al., 2023).
Unsupervised Multi-Object Discovery: In multi-object discovery, cross-modal distillation with late fusion enables robust object localization from 2D motion cues projected onto 3D geometry, with each modality learning to compensate for the other's deficiencies (Lahlali et al., 19 Mar 2025).

These methods consistently demonstrate that precise scan ordering and attention-based fusion strategies can yield robustness to modality-specific noise, partial data, and sparsity while maintaining computational tractability.

Application Domain	Input Modalities	CM-SS2D Role/Benefit
Wild Scene Segmentation	RGB + Thermal	Contextual/global fusion, SSM
Salient Object Detection	RGB + Depth	Feature modulation & selection
LiDAR Target Segmentation	LiDAR + 2D	Cross-domain scan alignment
Multi-Object Discovery	2D + 3D (projected)	Distillation, late fusion

5. Empirical Performance and Computational Efficiency

Architectures featuring CM-SS2D demonstrate consistent performance and efficiency improvements:

Segmentation: CM-SSM attains mIoU of 74.6% on CART with only 12.59M parameters, 10.34 GFLOPs, and 114 FPS on RTX 4090 (Guo et al., 22 Jun 2025).
Scaling: VMamba and MHS-VM show robust performance as input resolution increases, sustaining throughput where attention-based models degrade (Liu et al., 18 Jan 2024, Ji, 10 Jun 2024).
Parameter/Compute Reduction: Substituting multi-head scan (MHS) modules for baseline SS2D blocks in VM-UNet reduces model size and FLOPs by ~50% while improving segmentation accuracy (Ji, 10 Jun 2024).
Generalizability: CM-SS2D-based models generalize well across domains (e.g., wild and underground scenes) (Guo et al., 22 Jun 2025).

6. Comparative Analysis and Innovations

Relative to prior dense self-attention and early/late fusion strategies, CM-SS2D emphasizes:

Explicit cross-modal sequence construction during scan traversal
Recurrence-based state updates that inherently propagate cross-modal context
Linear computational complexity w.r.t. spatial size, crucial for high-resolution and resource-constrained deployment

Additional innovations include scan route attention mechanisms (SRA) that leverage intra-patch response variability for improved structure extraction (Ji, 10 Jun 2024), and strategies where cross-modal pseudo-label supervision (e.g., from 2D optical flow to 3D point clouds) supplements weak or missing annotations (Lahlali et al., 19 Mar 2025).

7. Applications, Implications, and Future Directions

CM-SS2D is foundational in robotic perception, autonomous navigation, and multi-sensor surveillance, especially where efficiency, robustness to missing/weak sensor channels, and cross-modal generalization are required. The paradigm enables end-to-end trainable, real-time models for context-rich tasks without the computational burden of Transformer-based fusion.

Potential future directions include:

Extending CM-SS2D paradigms to more than two modalities and to temporally ordered data
Joint scan strategies that integrate semantic priors or task-driven constraints directly into scan routes or state update equations
Incorporating dynamic scan route adaptation informed by scene context or external factors

Cross-modal 2D-selective-scan modules continue to advance the field by providing a principled and computationally efficient foundation for integrating heterogeneous sensor data for modern computer vision systems.