Fusion-Mamba: Efficient Multimodal Fusion

Updated 5 March 2026

Fusion-Mamba is a class of multimodal fusion architectures that uses Mamba blocks for efficient, global, and input-adaptive integration across various modalities.
It employs dual-level feature extraction and dual-phase fusion strategies to robustly combine local details with global dependencies while reducing computational costs.
Empirical results demonstrate state-of-the-art performance in medical image fusion, remote sensing, and infrared–visible tasks with significant improvements in speed and memory efficiency.

Fusion-Mamba refers to a recent class of multimodal fusion architectures that leverage Mamba—selective scan state-space models (SSMs)—to achieve efficient and effective information integration across modalities in visual, medical, and cross-sensor fusion tasks. Mamba-based fusion models overcome key limitations of convolutional neural networks (CNNs) and Transformers: the locality bias and insufficient global context modeling of CNNs, and the quadratic computational/memory complexity of Transformer self-attention. Fusion-Mamba designs utilize linearly scalable Mamba blocks for modality-specific encoding and cross-modal interaction, employing dual-phase and dual-level fusion strategies to robustly capture both local and global dependencies at reduced computational cost. State-of-the-art results have been demonstrated in medical image fusion, remote sensing, infrared–visible fusion, and downstream applications such as object detection.

1. Mamba Block Fundamentals: Linear State Space Modeling

Fusion-Mamba architectures are grounded in the SSM framework, particularly the input-dependent, linearly parameterized Mamba block. The Mamba operator models a sequence as a state-space recurrence: $\begin{cases} h_t = \bar{A}\,h_{t-1} + \bar{B}\,x_t \ y_t = C\,h_t \end{cases}$ where $\bar{A}, \bar{B}, C$ are learned linear projections, with $\bar{A}$ and $\bar{B}$ often parameterized so as to allow fast parallel scan and low rank decomposition. The sequence $\{x_t\}$ can represent spatial patches, temporal slices, or tokens from different modalities. Mamba blocks support global input-adaptive modeling at $O(N)$ complexity, where $N$ is the sequence length, compared to the $O(N^2)$ cost of attention mechanisms in Transformers (Li et al., 2024).

The Mamba operator is typically implemented as a stack of blocks, each containing:

Layer normalization,
Parallel MLP branches producing token-wise projections,
Discretization to derive the SSM scan form,
Gating and residual connections to ensure gradient flow and stability.

This configuration enables both fine-grained locality (via convolutional MLP heads) and long-range dependency modeling critical for multimodal integration.

2. Dual-Level and Dual-Phase Fusion: MambaDFuse Architecture

The MambaDFuse model exemplifies core Fusion-Mamba concepts for cross-modal image fusion, introducing the following stages (Li et al., 2024):

2.1 Dual-Level Feature Extraction

Each modality $I_i$ ( $i\in\{\text{A},\text{B}\}$ ) undergoes:

Low-level pathway: Two $3\times3$ Conv + LeakyReLU blocks extract $F_i^l \in \mathbb{R}^{H\times W \times C_l}$ (edges, textures, contrast).
High-level pathway: Patch-embedding and normalization transform $F_i^l$ into $X_i^0$ . Stacked Mamba blocks yield $F_i^h$ capturing global input-adaptive dependencies.
Channel concatenation: $F_i = \text{Proj}([F_i^l \| \text{UnpatchEmbed}(F_i^h)]) \in \mathbb{R}^{H \times W \times C}$ fuses local and global information for each modality branch.

2.2 Dual-Phase Feature Fusion

After feature extraction:

Shallow Fusion (Channel Exchange):
- Binary mask $M$ swaps interleaved channels between $F_A$ and $F_B$ .
- Each swapped feature is reprocessed through a Mamba block, and fused by addition or $L_1$ normalization to yield $F^s$ .
Deep Fusion (Multi-modal Mamba Blocks):
- The "M3" block takes $(F_A, F_B, F^s)$ , processes each with a Mamba SSM, computes gated outputs, and merges via an MLP and residual connection.
- Multiple M3 blocks create the final deep fusion feature $\hat F^d$ .

2.3 Reconstruction and Training

The spatial fused feature passes through an invertible path: patch embedding, additional Mamba blocks, and convolution layers generate the final fused image $I_{\mathrm{fused}}$ . Losses combine intensity, texture, and SSIM terms: $\mathcal{L} = \alpha \|I_{\mathrm{fused}} - I_{\mathrm{ref}}\|_1 + \beta \|\nabla I_{\mathrm{fused}} - \nabla I_{\mathrm{ref}}\|_1 + \gamma (1 - \mathrm{SSIM}(I_{\mathrm{fused}}, I_{\mathrm{ref}}))$ with typical weights $(\alpha, \beta, \gamma) = (1.0, 5.0, 10.0)$ (Li et al., 2024).

3. Specialized Fusion-Mamba Blocks: Generalizations and Variants

Fusion-Mamba innovations span various application domains and fusion tasks:

FusionMamba Block (Remote Sensing): Implements dual-input SSMs by allowing one modality to modulate the SSM parameters for the other, enabling deep bidirectional cross-modal integration of spatial and spectral features (Peng et al., 2024). Used in spatial–spectral U-Net backbones for hyperspectral pansharpening.
Difference-based Fusion Mamba: For layer-wise RGB–TIR fusion, DFM computes explicit feature differences, gates by Mamba SSM output, and projects back for enhanced token mixing while maintaining linear complexity (Lu et al., 2024).
Dynamic Feature Fusion Module (DFFM): Combines dynamic convolution-based local enhancement with Mamba-based SSM cross-modal mixing, integrating efficient channel attention and difference maps (Xie et al., 2024).
Multi-modal and Multi-scale Extensions: Architectures incorporate interaction between parallel spatial–frequency branches, mask-aware or order-dynamic Mamba scanning for 3D and region-specific fusion (Zhang et al., 5 Aug 2025, Li et al., 18 Aug 2025).

4. Computational Complexity and Efficiency

Fusion-Mamba architectures universally emphasize linearity in sequence length and spatial size. Representative complexities:

Transformer self-attention: $O(D N^2)$ per layer for token length $N$ , embedding dim $D$ .
Mamba block (state-space): $O(D N)$ per layer due to efficient forward–backward scan implementations.
Full pipelines (with $T$ layers): $O((T+t) D N)$ , with negligible memory overhead compared to attention-based models (Li et al., 2024, Peng et al., 2024).

Ablation studies and empirical profiling show that Mamba-based pipelines are up to $15\times$ faster and use up to $92\%$ less memory than Transformer counterparts on long sequences or high-resolution images (Ye et al., 2024).

5. Empirical Performance and Applications

Fusion-Mamba models establish state-of-the-art benchmarks in a wide range of tasks:

Infrared–Visible Fusion: MambaDFuse achieves leading scores for EN, SD, SF, MI, SCD, VIF, Qabf, and SSIM on MSRS, RoadScene, M³FD, enhancing both perceptual and task-specific modalities (Li et al., 2024).
Medical Fusion: On MRI–CT, MRI–PET, MRI–SPECT, and breast ultrasound datasets, Fusion-Mamba-based architectures (WDFFU-Mamba, ClinicalFMamba) deliver top performance in Dice coefficient, HD95, and SSIM, with strong cross-dataset generalization and low model complexity (Cai et al., 19 Dec 2025, Zhou et al., 5 Aug 2025).
Multimodal Detection and Downstream Tasks: For cross-modality object detection and RGBT tracking, FMB integration improves mAP by 3–6% over state-of-the-art baselines, with fast real-time performance on large token sequences (e.g., 3840 tokens in AINet (Lu et al., 2024)).
Remote Sensing Fusion: Dual-branch FusionMamba in spatial–spectral U-Nets outperforms deep CNN and Transformer baselines on pansharpening and hyperspectral super-resolution (WV3, Botswana) (Peng et al., 2024).

Task/Dataset	Metric	MambaDFuse	Prior Best
IVF (MSRS)	EN	6.67 (best)	< 6.67
IVF (RoadScene)	SF	15.68 (best)	< 15.68
MIF (MRI–CT)	SD	90.95 (best)	< 90.95
Object Detection (YOLOv5)	[email protected]	0.935 (best)	0.767 (IR only)

6. Variants and Extensions: Cross-Domain Generality

Substantial derivative architectures validate the generality of the Fusion-Mamba paradigm:

Medical Imaging: Adaptive multilevel feature fusion via modality and channel attention, tri-plane Mamba fusion for volumetric data (Ji et al., 30 Apr 2025, Zhou et al., 5 Aug 2025).
Remote Sensing: Four-direction Mamba scanning, dual-branch U-Nets, and FusionMamba's plug-and-play dual-input SSM block for cross-band fusion (Peng et al., 2024, Wang et al., 17 Dec 2025).
Video and Sequence Fusion: Multi-Mamba stacking with temporal T-to-V and V-to-T fusion for partially relevant video retrieval (Ying et al., 4 Jun 2025).
Text and Audio-Visual Sentiment: Text-enhanced Fusion-Mamba leverages text-guided Mamba blocks for multimodal sentiment analysis under missing modalities (Li et al., 20 May 2025).

7. Limitations and Future Directions

Known challenges and avenues for exploration include:

Model robustness under significant modality misalignment or noise, for example in scenarios with poor image registration (Sun et al., 10 Nov 2025).
Further reduction of parameter overhead in hybrid or multi-branch configurations (Zhu et al., 2024).
Extension to fully Mamba backbones for both feature extraction and fusion in tracking, detection, and 3D segmentation (Lu et al., 2024, Li et al., 18 Aug 2025).
Exploration of end-to-end task-driven fusion, integrating joint detection/segmentation heads natively into Mamba-based fusion pipelines (Li et al., 2024, Zhu et al., 4 Feb 2026).

The Fusion-Mamba class provides a scalable, effective, and general solution to cross-modal fusion that rigorously unifies the local strengths of CNNs and global modeling capacity of SSMs. Advancements in linear-complexity sequence modeling, dual-input SSM blocks, and cross-modal/inter-branch interaction modules establish new baselines and deliver practical gains in medical, remote sensing, object detection, video, and multimodal analysis (Li et al., 2024, Peng et al., 2024, Dong et al., 2024, Cai et al., 19 Dec 2025, Zhou et al., 5 Aug 2025).