CrossMamba: Efficient Cross-Modal SSM Fusion

Updated 13 September 2025

CrossMamba is a family of neural architectures that fuse state space models with cross-modal, cross-channel, and cross-sequence fusion to efficiently capture long-range dependencies.
It leverages linear-complexity operations and innovative modules like channel swapping, dual state space fusion, and cross-attention inspired techniques for scalable performance.
Empirical results demonstrate substantial improvements in accuracy and computational efficiency across diverse applications such as computer vision, time series, audio, and multi-agent perception.

CrossMamba is a term that references a family of recent neural architectures characterized by the integration of state space models, particularly the Mamba framework, with cross-modal, cross-channel, or cross-sequence information fusion mechanisms. Unlike traditional approaches limited by convolutional receptive fields or the quadratic complexity of Transformers, CrossMamba models employ linear-complexity operations to capture long-range dependencies and enable efficient, high-fidelity feature integration across modalities, views, agents, or feature hierarchies. The paradigm has been instantiated in diverse tasks across computer vision, time series analysis, audio processing, video understanding, and multi-modal learning, with empirical evidence showing substantial improvements in both prediction accuracy and computational efficiency.

1. Core Methodological Innovations

CrossMamba models stem from the selective state space modeling approach embodied in the Mamba architecture, whose fundamental recurrence is

$h_k = A h_{k-1} + B x_k, \qquad y_k = C h_k$

where input-dependent, dynamically parameterized matrices $A, B, C$ allow selective, long-range context aggregation with complexity linear in the input length. The “cross” in CrossMamba typically denotes enhancements that enable these models to leverage and fuse complementary information from different

Modalities (e.g., RGB and infrared in object detection (Dong et al., 14 Apr 2024)),
Channels (as in multivariate time series forecasting (Zeng et al., 8 Jun 2024)),
Views (in multi-view image classification (Zheng et al., 4 Mar 2025)),
Agents (for collaborative perception (Li et al., 12 Sep 2024)),
Sequences (e.g., audio mixture and clue sequence in target sound extraction (Wu et al., 7 Sep 2024)),
or layers (cross-layer token fusion for efficiency (Shen et al., 15 Sep 2024)).

The fusion mechanisms span channel swapping, cross-attention inspired hidden state updates, cross-conditioning, and deformable alignments, but are unified by the principle of using the state space backbone as a context aggregation substrate.

2. Architectural Design and Fusion Strategies

A defining characteristic of CrossMamba models is the explicit design of modules to handle cross-information propagation:

Fusion-Mamba Block (FMB) (Dong et al., 14 Apr 2024):
- State Space Channel Swapping (SSCS): Shallow fusion realized by swapping selected channels between feature maps from different modalities.
- Dual State Space Fusion (DSSF): Deep fusion in a hidden state space, where cross-modal features interact via gating and dual attention, reducing disparity and enhancing representation consistency.
CMamba (Zeng et al., 8 Jun 2024):
- Global Data-Dependent MLP (GDD-MLP): Learns global, input-dependent adaptive mixing weights and biases for channel interactions across a sample’s channels, extending beyond local or fixed MLP mixing.
- Channel Mixup: Augments training via virtual channel combinations within a sample to exploit cross-channel relationships and mitigate overfitting.
Cross-Attention Inspired State Space (Wu et al., 7 Sep 2024, Chen et al., 21 Feb 2025):
- CrossMamba for Sound Extraction: Decomposes Mamba into query, key, and value analogues, using clue sequences and audio mixtures as respective sources, adhering to cross-attention formalism but realized with state space recurrences.
- TransMamba Cross-Mamba Module: Fuses language awareness into visual features for cross-modal adaptation, using Mamba layers as the foundation for cross-attention style fusion.
Multi-View/Agent Fusion (Zheng et al., 4 Mar 2025, Li et al., 12 Sep 2024):
- Cross-View Swapping and Selective Scan: Channel-level interleaving and reweighting ensures robust information transfer and alignment in unregistered multi-view images.
- Cross-Agent Fusion via Mamba2D: BEV features sequenced and fused across agents using linear SSM modules, markedly reducing communication and computational overhead.
Cross-Layer Fusion (Shen et al., 15 Sep 2024):
- Famba-V: Identifies and fuses similar tokens across different Vim (Vision Mamba) layers based on cosine similarity, deploying selective strategies (all-layer, interleaved, upper/lower-layer) to optimize the efficiency-accuracy trade-off.

3. CrossMamba in Major Application Domains

CrossMamba’s methodology has been applied to multiple core domains:

Domain	CrossMamba Instantiation	Representative Task / Dataset	Performance/Benefit Summary
Cross-modality fusion in vision	Fusion-Mamba (Dong et al., 14 Apr 2024)	M³FD, FLIR-Aligned object detection	$+$ 5.9%/ $+$ 4.9% mAP, faster inference vs. attention-based Fusion
Multivariate time series forecasting	CMamba (Zeng et al., 8 Jun 2024)	ETT, Electricity, Weather, Traffic	Top-1 in 65/70 settings, efficient cross-channel mixing
Text summarization (sequence-to-sequence)	CrossMamba (Dat et al., 25 Jun 2024)	CNN/DailyMail, Arxiv	SoTA ROUGE, $4{-}15\times$ faster inference
Multi-view medical image classification	XFMamba (Zheng et al., 4 Mar 2025)	MURA, CheXpert, DDSM	Highest AUROC, lowest FLOPs among compared models
Target sound extraction	CrossMamba (Wu et al., 7 Sep 2024)	VoxCeleb2, FSD Kaggle, TAU Urban	$+$ 0.16dB SI-SNR, $-$ 60% MACs (AV-SepMamba vs. AV-SepFormer)
Multi-agent collaborative perception	CollaMamba (Li et al., 12 Sep 2024)	OPV2V, DAIR-V2X	$-$ 71.9% FLOPs, $-$ 1/64 comm [vs. SOTA], $+$ 4.1% AP
Spectro-temporal speech deepfake detection	BiCrossMamba-ST (Kheir et al., 20 May 2025)	ASVspoof LA21, DF21	$+$ 67.7%/ $+$ 26.3% (vs. AASIST), $+$ 6.8% (vs. RawBMamba)

4. Comparative Advantages and Efficiency

CrossMamba architectures are designed to overcome the limitations of CNNs (locality and insufficient context modeling) and Transformers (expensive quadratic attention):

Linear Complexity: Mamba-based state space processing ensures O(N) complexity for sequence length N, as opposed to O(N²) for standard self-attention and cross-attention modules. This makes such models practical for long sequences, high-resolution images, and real-time or multi-agent contexts.
Performance Gains: Substantial improvements are observed empirically. For instance, Fusion-Mamba delivers up to 5.9% higher mAP and lower inference latency compared to Transformer-based fusion (Dong et al., 14 Apr 2024). In time series, CMamba achieves top performance in almost all tested regimes (Zeng et al., 8 Jun 2024). In collaborative perception, CollaMamba yields up to 71.9% FLOP savings and reduces communication by a factor of 64 (Li et al., 12 Sep 2024).

These efficiency gains are intrinsic to the model design—due to compact sequential representation, parameter re-use, and dynamic fusion—rather than to pruning or compression.

5. Theoretical Perspective and Generalization

Recent work has formalized the correspondence between Mamba’s input-dependent parameterization and classical attention (He et al., 22 Jul 2024, Wu et al., 7 Sep 2024). For example, the calculation

$y_i = \sum_j Q_i K_j V_j$

can be mapped onto the state space convolutional view, with dynamically computed query, key, and value streams. Theoretical and empirical analyses confirm that CrossMamba blocks can achieve a global receptive field—shown by effective receptive field (ERF) visualizations and by explicit matrix roll-outs—which is on par with attention, but at reduced computational cost.

This general perspective unifies vision (cross-modal, cross-view, cross-layer), sequential data (cross-channel, cross-agent), and even multi-modal fusion (language-vision (Chen et al., 21 Feb 2025)), giving rise to a generalized design pattern for efficient, globally-aware cross-information integration.

A plausible implication is that continued advances may further shrink the gap in flexibility between SSM-based models and Transformers while preserving the superior scalability properties of the former.

6. Limitations, Technical Challenges, and Future Directions

Challenges remain in extending CrossMamba-style architectures to certain complex or highly structured domains:

Cross-modality registration: While channel swapping and selective scan suffice for unregistered views or easily-aligned modalities (Zheng et al., 4 Mar 2025, Dong et al., 14 Apr 2024), cross-modality registration in domains with subtle geometric or photometric misalignments remains difficult.
Modeling very high-frequency interdependencies: Despite robust cross-channel/agent fusion, extremely high-dimensional or relational tasks can challenge the fidelity of the fusion process, particularly if not enough global context is retained.
Domain adaptation and exogenous variable incorporation: For time series and collaborative perception, the integration of domain-specific priors, external signals, or graph-structured data is cited as a promising avenue (Zeng et al., 8 Jun 2024, Li et al., 12 Sep 2024).

Future directions highlighted in the literature include:

Adapting cross-fusion SSM architectures to register or geometric-constrained multi-view scenarios (Zheng et al., 4 Mar 2025).
Exploring adaptive or learned fusion strategies (e.g., learned masks for channel/view swapping).
Broadening CrossMamba modules for multi-modal, cross-lingual, and video understanding contexts (Chen et al., 21 Feb 2025, Tran et al., 28 Jun 2025).
Hardware-efficient deployment and optimization for resource-constrained scenarios.
Extending the interpretability of model decisions, leveraging the explicit mapping between SSM recurrences and attention.

7. Concluding Remarks

CrossMamba represents a significant evolution in the design of efficient, high-capacity neural architectures by leveraging state space models for cross-modal, cross-sequence, and cross-channel fusion. Its demonstrated empirical success across vision, audio, time series, and multi-agent domains—with documented gains in both accuracy and scalability—indicates a growing momentum for SSM-centric designs in scenarios demanding global context with tractable computational and memory footprints. Further research into the theoretical underpinnings, domain-specific challenges, and broader integration of cross-information mechanisms is anticipated to extend the impact of this paradigm.