State Space Channel Swapping (SSCS)
- State Space Channel Swapping (SSCS) is a neural network module for shallow fusion that interleaves RGB and infrared channels to mitigate modality disparity.
- It employs a lightweight Visual State Space (VSS) block with directional 1D scanning to aggregate spatial and contextual features from mixed-channel tensors.
- Empirical studies on the FLIR-Aligned dataset show SSCS improves mAP performance, underscoring its effectiveness in early cross-modal alignment.
State Space Channel Swapping (SSCS) is a neural network module designed for shallow, semantically meaningful cross-modal fusion of feature representations—specifically developed for combining information from RGB and infrared modalities in object detection tasks. SSCS operates as the initial sub-block within each Fusion-Mamba Block (FMB), preceding deeper state-space fusion mechanisms, with the goal of explicitly reducing modality disparities that arise from inherent differences in image style, texture, luminance, and noise patterns between modalities. By interleaving feature channels and embedding them into a lightweight state-space model, SSCS enables downstream components to process features that are already partially aligned in representation, thereby facilitating more effective and robust cross-modality fusion (Dong et al., 2024).
1. Motivation and Purpose
The principal challenge addressed by SSCS is the pervasive issue of modality disparity. Conventional fusion strategies in CNN- and Transformer-based cross-modal detection—such as channel-wise concatenation or summation of raw features—are susceptible to domination or corruption by the modality with stronger signals, due to incompatible feature statistics (e.g., textures and noise). SSCS targets this bottleneck by introducing controlled channel mixing as an explicit, shallow fusion mechanism. The aim is to ensure that each modality’s early-stage features are already influenced by their counterpart, reducing the alignment burden on deeper, more complex fusion stages (such as the Dual State Space Fusion, DSSF).
2. Mathematical Formulation and Channel-Swapping Mechanism
At a particular stage , let and denote the feature maps extracted from RGB and infrared modalities, respectively, where is batch size, the number of channels, and the spatial dimensions.
The SSCS channel-swapping operation proceeds as follows:
- Each feature map is split into four channel groups:
with each .
- The channel-swapping operator forms two mixed-channel tensors:
- This process ensures that exactly half of the feature channels in each modality are swapped at this early stage.
3. Visual State Space Block (VSS) Integration
The channel-swapped features and are then fed into a lightweight Visual State Space (VSS) block:
VSS processes each mixed-channel map through four directional 1D scans (top–down, bottom–up, left–right, right–left) using the Mamba-based S6 block as its core. These directional streams enable the model to aggregate spatial and contextual information efficiently. After separate directional processing, the results are merged back into a 2D feature map, resulting in fused features . The SSCS operation at stage is fully described by:
4. Architectural and Implementation Considerations
The SSCS module imposes several constraints and design practices:
- Intermediate tensors retain the original shape.
- must be divisible by 4 to enable uniform channel-splitting.
- VSS block leverages directional processing with Mamba S6, internal hidden dimension (typically or a small multiple), depth-wise convolutions, SiLU activations, layer normalization, etc.
- SSCS introduces no explicit learned gating—its parameterization exists entirely within the VSS’s state matrices () and projection weights.
- Initialization of all linear/convolutional layers follows standard Xavier or Kaiming routines (as implemented in PyTorch for convolution).
- The forward pass can be described succinctly in the sequence: input splitting, channel swapping, parallel VSS mapping, output of two shallow fused feature maps.
- SSCS output maps () are directly consumed by the DSSF module, whose role is deep fusion using gating weights and projection into a latent state space.
5. Empirical Performance and Synergy with Downstream Fusion
In ablation studies on the FLIR-Aligned dataset, eliminating SSCS (while retaining DSSF) reduces from 84.9% to 82.9% (a 2.0% drop) and the overall mAP by 1.1%. This demonstrates that explicit, shallow channel swapping combined with a state-space block enhances early cross-modality consistency and boosts detection performance, even before deeper fusion layers operate. The performance decrement observed when removing SSCS is smaller than when removing DSSF, highlighting their synergy—shallow channel mixing and deep state-space gating play complementary roles in modality alignment (Dong et al., 2024).
6. Advantages, Limitations, and Prospective Directions
SSCS offers the following advantages:
- Reduces raw modality bias via early, explicit interleaving of channels.
- Employs a lightweight, linear-cost state-space fusion layer that builds cross-modal context efficiently.
However, the method presents certain design choices and limitations:
- Fixed 50/50 channel swapping may not be optimal for all settings; more adaptive or learnable permutations could further alleviate modality disparities.
- VSS introduces additional computational overhead, which could be critical in resource-constrained deployments.
Potential future milestones include developing learnable channel allocation (adaptive splits or permutations), incorporating early gating structures to dynamically control swapping magnitude, or integrating shallow cross-modal attention heads to refine the mixed channels prior to state-space mapping.
7. Context within Cross-Modality Fusion Architectures
The introduction of SSCS marks a novel approach within the context of cross-modality detection architectures. Unlike standard schemes that rely on late fusion or global attention over fully concatenated modalities, SSCS strategically performs controlled, shallow interleaving and state-space encoding at the outset. This reduces the subsequent burden on deeper fusion mechanisms (such as the DSSF gating net within FMB) and leads to a measurable improvement in performance benchmarks. SSCS’s principled combination of explicit channel mixing and compact state-space modeling establishes a robust paradigm for harmonizing multi-branch feature representations prior to deeper, gated cross-modal refinement (Dong et al., 2024).