Cross-Scan Module (CSM) for Global Context
- CSM is a state-space-model-driven primitive that efficiently encodes long-range spatial dependencies via structured, sequential scans.
- It significantly reduces computational cost compared to quadratic self-attention by processing ordered sequences in linear time.
- CSM has been applied in hyperspectral image classification and multi-modality medical registration, improving accuracy and efficiency.
The Cross-Scan Module (CSM) is a state-space-model-driven architectural primitive that facilitates efficient global context modeling over images and volumetric data via structured scan operations. By linearly traversing spatial or spatiotemporal domains with axis-aligned or corner-derived sequences, CSM achieves high information throughput and efficient context aggregation at substantially reduced computational cost compared to attention-based paradigms. Its variants are central to recent advances in hyperspectral image (HSI) analysis and multi-modality medical registration, where spatial continuity, centric feature aggregation, and memory efficiency are critical.
1. Motivation and Fundamental Principle
CSM originates from the need to efficiently encode long-range, structured dependencies in image and volume data, where vanilla 2D convolutions are locality-bound and self-attention incurs quadratic cost. In hyperspectral image classification, as in "Mamba-in-Mamba" (Zhou et al., 2024), large and redundant receptive fields often obscure the semantic focus for patch-centric tasks, while in medical registration ("VMambaMorph" (Wang et al., 2024)) global context across depth/height/width axes is essential to align correspondences across modalities and scales.
Central to CSM is the decomposition of dense 2D or 3D neighborhoods into a small set of scan-based ordered sequences—either following corner-to-center "snake" traversals or parallel axis-aligned lines—so that each sequence can be processed by an efficient linear-time State Space Model (SSM) such as Mamba, capturing contextual dependencies across the scan while preserving spatial structure.
2. Architectural Variants and Processing Flow
Two principal CSM architectures dominate the literature:
a. Centralized Mamba–Cross–Scan (MCS) in HSI Classification
For each patch , MCS generates four "snake-flattened" sequences, each starting from a distinct patch corner (Type-1 through Type-4: TL→BR, TR→BL, BL→TR, BR→TL). Each full sequence () of length is split exactly in half about the central pixel, producing "forward" and "backward" subsequences (, ) each ending on the center. This yields eight total streams, each processed by a Mamba (S6) SSM block; pairs of forward/backward outputs are merged, and the results of four scan types are fused (using learnable fusion weights) before being consumed by downstream tokenization modules (Zhou et al., 2024).
b. Axis-wise Cross-Scan in VMamba and VMambaMorph
Given 2D features or 3D , CSM interleaves state-space recurrences (horizontal, vertical; or depth/height/width in 3D), so that each axis is traversed independently with a linear kernel. Outputs from each axis-wise scan are fused (typically summation plus nonlinearity, or further projection), optionally followed by residual mixing. For 3D, all three axes are covered, and the CSM module is inserted within the U-Net architecture at every encoder/decoder depth (Wang et al., 2024).
3. Mathematical Formalism and Pseudocode
Corner-to-Center Centralized MCS:
- For a patch :
- Snake-flatten along path gives .
- Split:
where , both ending at the center pixel. - Each , passed through Mamba/SSM; outputs merged and projected:
with , the post-SSM sequences and the average center output.
Pseudocode for Centralized MCS:
1 2 3 4 5 6 7 8 9 |
def MCS_Patch(X, p): scans = [] for corner in {TL, TR, BL, BR}: S_full = SnakeFlatten(X, start=corner) L = (p * p + 1) // 2 S_f = S_full[0:L] # ends at center S_b = reverse(S_full[p * p - L:p * p]) # ends at center scans.append((S_f, S_b)) return scans # four (forward, backward) pairs |
Axis-aligned CSM3D in VMambaMorph:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def CSM3D(U): # U: B×C′×D×H×W Up = Conv1x1x1_proj(U) # Depth-wise scan for k in 1…D: x_d = flatten_spatial(Up[:, :, k, :, :]) h_d = A_d @ h_d + B_d @ x_d Y_d[:, :, k, :, :] = reshape(C_d @ h_d, (B, C', H, W)) # Height-wise scan for i in 1…H: x_h = flatten_spatial(Up[:, :, :, i, :]) h_h = A_h @ h_h + B_h @ x_h Y_h[:, :, :, i, :] = reshape(C_h @ h_h, (B, C', D, W)) # Width-wise scan for j in 1…W: x_w = flatten_spatial(Up[:, :, :, :, j]) h_w = A_w @ h_w + B_w @ x_w Y_w[:, :, :, :, j] = reshape(C_w @ h_w, (B, C', D, H)) V = σ(Y_d + Y_h + Y_w) Vp = Conv1x1x1_out(V) return U + Vp # residual connection |
4. Computational Complexity and Comparison
CSM modules operate in linear time with respect to input size:
- Corner-to-center MCS (MiM): 4 snake-flatten passes over pixels (), split into 8 streams of length , each processed via SSM at ; yields only modest constant-factor overhead over raster scan flattening.
- Axis-wise CSM (VMamba/VMambaMorph): 2D: ; 3D: , as each spatial/volumetric dimension is traversed sequentially. Significantly more efficient than attention layers, which scale as (2D) or (3D) (Wang et al., 2024, Zhou et al., 2024).
A plausible implication is that for large images or volumes, CSM-based architectures can accommodate global-range modeling under practical hardware budgets, in contrast to transformer-based counterparts where quadratic cost rapidly becomes prohibitive.
5. Empirical Evaluations, Ablations, and Best Practices
Empirical ablations on HSI classification demonstrate that:
- Employing all four scan directions in MCS yields clear performance improvements—e.g., overall accuracy (OA) on Indian Pines dataset increases from 89–91% (one to three scan types) to 92.08% with four scan types (Zhou et al., 2024).
- Mamba-scan (corner-snake scan) outperforms raster, diagonal, and zig-zag scans for centric aggregation and OA.
- Incorporation of downstream modules (Gaussian Decay Mask, Semantic Token Learner, Semantic Token Fuser) within T-Mamba further improves classification accuracy.
In VMambaMorph, CSM contributes to competitive registration quality on benchmark multi-modal brain MR-CT registration datasets, with lower parameter count and linear scaling relative to TransMorph ViT baseline (Wang et al., 2024).
Key hyperparameters include: patch size (chosen per dataset), SSM hidden dimension , channel reduction ratios, kernel choices in pre/post-conv layers, scan fusion method, and the number of scan directions. Normalization (LayerNorm), nonlinearity (GeLU/SiLU), and learned low-rank or diagonal SSM parameterization are also critical for reproducibility.
6. Benefits, Limitations, and Significance
CSM modules provide global context aggregation over structured domains while maintaining linear computational complexity and high modularity. They can be inserted into existing CNN, SSM, or U-Net backbones, replacing locality-bound blocks. CSM's fusion of multiple directional or axiswise scans enhances representational richness for central features, mitigating the tendency of traditional RNNs to focus on only one direction or of self-attention to be computationally intractable at scale.
Limitations include the restriction to axis-aligned or snake-scan connectivity, so off-axis contextual interactions still require architectural fusion or stacking of CSM layers. The expressivity of state-space kernels may also be less than that of full attention for highly irregular, heterogeneous patterns. This suggests that for scenarios demanding arbitrary spatial interactions, CSM may serve best in conjunction with other global modeling components.
7. Applications and Prospects
CSM-based architectures are deployed in:
- Hyperspectral image patch-centric classification, where centralized, multi-directional context is necessary for pixel-level semantic discrimination under small sample regimes (Zhou et al., 2024).
- Multi-modality deformable medical registration, enabling efficient volumetric feature modeling in U-shaped nets, as in VMambaMorph (Wang et al., 2024).
- General-purpose vision models where global coupling and efficiency are essential, potentially catalyzing further convergence of SSM and visual backbone research.
Future directions plausibly include the exploration of cross-scan fusion schemes beyond simple axiswise or cornerwise traversals, hybridization with local attention, and principled extension to graph or non-Euclidean data domains.