Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fusion-Mamba: Efficient Multimodal Fusion

Updated 5 March 2026
  • Fusion-Mamba is a class of multimodal fusion architectures that uses Mamba blocks for efficient, global, and input-adaptive integration across various modalities.
  • It employs dual-level feature extraction and dual-phase fusion strategies to robustly combine local details with global dependencies while reducing computational costs.
  • Empirical results demonstrate state-of-the-art performance in medical image fusion, remote sensing, and infrared–visible tasks with significant improvements in speed and memory efficiency.

Fusion-Mamba refers to a recent class of multimodal fusion architectures that leverage Mamba—selective scan state-space models (SSMs)—to achieve efficient and effective information integration across modalities in visual, medical, and cross-sensor fusion tasks. Mamba-based fusion models overcome key limitations of convolutional neural networks (CNNs) and Transformers: the locality bias and insufficient global context modeling of CNNs, and the quadratic computational/memory complexity of Transformer self-attention. Fusion-Mamba designs utilize linearly scalable Mamba blocks for modality-specific encoding and cross-modal interaction, employing dual-phase and dual-level fusion strategies to robustly capture both local and global dependencies at reduced computational cost. State-of-the-art results have been demonstrated in medical image fusion, remote sensing, infrared–visible fusion, and downstream applications such as object detection.

1. Mamba Block Fundamentals: Linear State Space Modeling

Fusion-Mamba architectures are grounded in the SSM framework, particularly the input-dependent, linearly parameterized Mamba block. The Mamba operator models a sequence as a state-space recurrence: {ht=Aˉ ht−1+Bˉ xt yt=C ht\begin{cases} h_t = \bar{A}\,h_{t-1} + \bar{B}\,x_t \ y_t = C\,h_t \end{cases} where Aˉ,Bˉ,C\bar{A}, \bar{B}, C are learned linear projections, with Aˉ\bar{A} and Bˉ\bar{B} often parameterized so as to allow fast parallel scan and low rank decomposition. The sequence {xt}\{x_t\} can represent spatial patches, temporal slices, or tokens from different modalities. Mamba blocks support global input-adaptive modeling at O(N)O(N) complexity, where NN is the sequence length, compared to the O(N2)O(N^2) cost of attention mechanisms in Transformers (Li et al., 2024).

The Mamba operator is typically implemented as a stack of blocks, each containing:

  • Layer normalization,
  • Parallel MLP branches producing token-wise projections,
  • Discretization to derive the SSM scan form,
  • Gating and residual connections to ensure gradient flow and stability.

This configuration enables both fine-grained locality (via convolutional MLP heads) and long-range dependency modeling critical for multimodal integration.

2. Dual-Level and Dual-Phase Fusion: MambaDFuse Architecture

The MambaDFuse model exemplifies core Fusion-Mamba concepts for cross-modal image fusion, introducing the following stages (Li et al., 2024):

2.1 Dual-Level Feature Extraction

Each modality IiI_i (i∈{A,B}i\in\{\text{A},\text{B}\}) undergoes:

  • Low-level pathway: Two 3×33\times3 Conv + LeakyReLU blocks extract Fil∈RH×W×ClF_i^l \in \mathbb{R}^{H\times W \times C_l} (edges, textures, contrast).
  • High-level pathway: Patch-embedding and normalization transform FilF_i^l into Xi0X_i^0. Stacked Mamba blocks yield FihF_i^h capturing global input-adaptive dependencies.
  • Channel concatenation: Fi=Proj([Fil∥UnpatchEmbed(Fih)])∈RH×W×CF_i = \text{Proj}([F_i^l \| \text{UnpatchEmbed}(F_i^h)]) \in \mathbb{R}^{H \times W \times C} fuses local and global information for each modality branch.

2.2 Dual-Phase Feature Fusion

After feature extraction:

  • Shallow Fusion (Channel Exchange):
    • Binary mask MM swaps interleaved channels between FAF_A and FBF_B.
    • Each swapped feature is reprocessed through a Mamba block, and fused by addition or L1L_1 normalization to yield FsF^s.
  • Deep Fusion (Multi-modal Mamba Blocks):
    • The "M3" block takes (FA,FB,Fs)(F_A, F_B, F^s), processes each with a Mamba SSM, computes gated outputs, and merges via an MLP and residual connection.
    • Multiple M3 blocks create the final deep fusion feature F^d\hat F^d.

2.3 Reconstruction and Training

The spatial fused feature passes through an invertible path: patch embedding, additional Mamba blocks, and convolution layers generate the final fused image IfusedI_{\mathrm{fused}}. Losses combine intensity, texture, and SSIM terms: L=α∥Ifused−Iref∥1+β∥∇Ifused−∇Iref∥1+γ(1−SSIM(Ifused,Iref))\mathcal{L} = \alpha \|I_{\mathrm{fused}} - I_{\mathrm{ref}}\|_1 + \beta \|\nabla I_{\mathrm{fused}} - \nabla I_{\mathrm{ref}}\|_1 + \gamma (1 - \mathrm{SSIM}(I_{\mathrm{fused}}, I_{\mathrm{ref}})) with typical weights (α,β,γ)=(1.0,5.0,10.0)(\alpha, \beta, \gamma) = (1.0, 5.0, 10.0) (Li et al., 2024).

3. Specialized Fusion-Mamba Blocks: Generalizations and Variants

Fusion-Mamba innovations span various application domains and fusion tasks:

  • FusionMamba Block (Remote Sensing): Implements dual-input SSMs by allowing one modality to modulate the SSM parameters for the other, enabling deep bidirectional cross-modal integration of spatial and spectral features (Peng et al., 2024). Used in spatial–spectral U-Net backbones for hyperspectral pansharpening.
  • Difference-based Fusion Mamba: For layer-wise RGB–TIR fusion, DFM computes explicit feature differences, gates by Mamba SSM output, and projects back for enhanced token mixing while maintaining linear complexity (Lu et al., 2024).
  • Dynamic Feature Fusion Module (DFFM): Combines dynamic convolution-based local enhancement with Mamba-based SSM cross-modal mixing, integrating efficient channel attention and difference maps (Xie et al., 2024).
  • Multi-modal and Multi-scale Extensions: Architectures incorporate interaction between parallel spatial–frequency branches, mask-aware or order-dynamic Mamba scanning for 3D and region-specific fusion (Zhang et al., 5 Aug 2025, Li et al., 18 Aug 2025).

4. Computational Complexity and Efficiency

Fusion-Mamba architectures universally emphasize linearity in sequence length and spatial size. Representative complexities:

  • Transformer self-attention: O(DN2)O(D N^2) per layer for token length NN, embedding dim DD.
  • Mamba block (state-space): O(DN)O(D N) per layer due to efficient forward–backward scan implementations.
  • Full pipelines (with TT layers): O((T+t)DN)O((T+t) D N), with negligible memory overhead compared to attention-based models (Li et al., 2024, Peng et al., 2024).

Ablation studies and empirical profiling show that Mamba-based pipelines are up to 15×15\times faster and use up to 92%92\% less memory than Transformer counterparts on long sequences or high-resolution images (Ye et al., 2024).

5. Empirical Performance and Applications

Fusion-Mamba models establish state-of-the-art benchmarks in a wide range of tasks:

  • Infrared–Visible Fusion: MambaDFuse achieves leading scores for EN, SD, SF, MI, SCD, VIF, Qabf, and SSIM on MSRS, RoadScene, M³FD, enhancing both perceptual and task-specific modalities (Li et al., 2024).
  • Medical Fusion: On MRI–CT, MRI–PET, MRI–SPECT, and breast ultrasound datasets, Fusion-Mamba-based architectures (WDFFU-Mamba, ClinicalFMamba) deliver top performance in Dice coefficient, HD95, and SSIM, with strong cross-dataset generalization and low model complexity (Cai et al., 19 Dec 2025, Zhou et al., 5 Aug 2025).
  • Multimodal Detection and Downstream Tasks: For cross-modality object detection and RGBT tracking, FMB integration improves mAP by 3–6% over state-of-the-art baselines, with fast real-time performance on large token sequences (e.g., 3840 tokens in AINet (Lu et al., 2024)).
  • Remote Sensing Fusion: Dual-branch FusionMamba in spatial–spectral U-Nets outperforms deep CNN and Transformer baselines on pansharpening and hyperspectral super-resolution (WV3, Botswana) (Peng et al., 2024).
Task/Dataset Metric MambaDFuse Prior Best
IVF (MSRS) EN 6.67 (best) < 6.67
IVF (RoadScene) SF 15.68 (best) < 15.68
MIF (MRI–CT) SD 90.95 (best) < 90.95
Object Detection (YOLOv5) [email protected] 0.935 (best) 0.767 (IR only)

6. Variants and Extensions: Cross-Domain Generality

Substantial derivative architectures validate the generality of the Fusion-Mamba paradigm:

7. Limitations and Future Directions

Known challenges and avenues for exploration include:


The Fusion-Mamba class provides a scalable, effective, and general solution to cross-modal fusion that rigorously unifies the local strengths of CNNs and global modeling capacity of SSMs. Advancements in linear-complexity sequence modeling, dual-input SSM blocks, and cross-modal/inter-branch interaction modules establish new baselines and deliver practical gains in medical, remote sensing, object detection, video, and multimodal analysis (Li et al., 2024, Peng et al., 2024, Dong et al., 2024, Cai et al., 19 Dec 2025, Zhou et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fusion-Mamba.