Mamba-based Attention Fusion

Updated 7 October 2025

Mamba-based Attention Fusion is a hybrid approach combining linear state-space models with dynamic attention mechanisms to capture both long-range and local dependencies.
It integrates dual-branch architectures, sequential fusion strategies, and token-level alignment to effectively merge multimodal and high-dimensional data.
MAF architectures achieve superior computational efficiency and scalability, delivering high accuracy in domains like medical imaging, NLP, and video analysis.

Mamba-based Attention Fusion (MAF) refers to a class of hybrid architectures and mechanisms that synergistically combine Mamba state-space models (SSMs) with attention-based neural modules for efficient and effective modeling and fusion of complex dependencies in multimodal and high-dimensional data. The overarching philosophy of MAF frameworks is to harness the linear-time, long-range modeling power of Mamba and the expressivity of learned attention for application domains where both global and local feature fusion are required, often with strong constraints on computational complexity.

1. Theoretical Foundations and Core Designs

Mamba models are structured state-space models providing linear complexity in sequence length for inference, rooted in discretized ODEs:

$h'(t) = Ah(t) + Bx(t),\quad y(t) = Ch(t)$

and, after discretization,

$h_t = \bar{A} h_{t-1} + \bar{B} x_t,\quad y_t = C h_t$

where $\bar{A}$ and $\bar{B}$ are discretized via zero-order hold. These frameworks capture global context efficiently by recursively summarizing past information via learned parameters.

Attention mechanisms, primarily self-attention and cross-attention, work by content-based weighting: for input features $X$ , queries $Q$ , keys $K$ , and values $V$ are generated (linear projections or normalized forms), and attention is computed as

$A = \text{softmax}\left( \frac{QK^T}{\sqrt{d}} \right),\quad Z = AV$

The fusion of Mamba and attention is realized in various forms:

Interleaving SSM and attention blocks (as in Matten (Gao et al., 5 May 2024), MambaCAFU (Bui et al., 4 Oct 2025))
Cross-branch fusion between SSM/Mamba and attention, with explicit gating, swapping, or cross-modal alignment (as in Fusion-Mamba (Dong et al., 14 Apr 2024), Tmamba (Zhu et al., 5 Sep 2024))
Attention-based selection or enhancement steps atop Mamba-extracted features, especially in adaptive fusion modules (e.g., DepthMamba (Meng et al., 28 Dec 2024), multi-expert strategies (Zhang et al., 21 Sep 2025))

A key unifying idea is that the state-space model in Mamba handles global or long-long dependencies at linear cost, while attention modules—often channel, spatial, or cross-attention—target more flexible, dynamic, local, or cross-modal relations.

2. Principal Variants of Mamba-based Attention Fusion

MAF is instantiated according to the data domain and task requirements. Prominent architectural variants include:

Dual-Branch and Hybrid Pipelines

Fusion-Mamba (Dong et al., 14 Apr 2024) uses a dual-path approach: one path per modality (e.g., RGB and IR), each employing Mamba-based state-space and convolutional blocks. Fusion occurs via:
- Channel swapping and shallow interaction (SSCS module)
- Deep, dual-gated fusion in a hidden state space (DSSF), with gates modulating the contribution of each modality in the hidden space.
Tmamba (Zhu et al., 5 Sep 2024) similarly maintains parallel branches: linear Transformer for channel-aware extraction and Mamba for position-aware features. Information transfer is achieved through learnable weighting, convolutional mixing, and channel-attention cross-modal fusion.

Sequential Fusion and Modular Integration

MambaCAFU (Bui et al., 4 Oct 2025) features three encoder branches: CNN (local), Transformer (global), and a MAF branch with a stack of blocks performing Mamba-based state-space transformation, co-attention gating (mutual gating between CNN/Transformer features), and channel/spatial attention gates. This orchestrates progressive cross-scale and cross-branch fusion, regularized by efficient long-range modeling.

Token- and Instance-level Alignment

Multi-expert MAF strategies, as in ME-Mamba (Zhang et al., 21 Sep 2025), process long instance sequences (e.g., slide patches, gene expression vectors) in parallel Mamba streams, with an attention-guided reordering branch that focuses on instances with high predicted relevance. Fusion across modalities is performed locally (via Optimal Transport coupling tokens) and globally (via Maximum Mean Discrepancy to align distributions).

Task-specific Fusion

In segmentation/super-resolution (SS-MAF (Zhang et al., 2022)), MAF modules operate in a dual-stream network, aligning segmentation and super-resolved features, performing multi-scale convolution (split spatial convolution), and merging via dual attention branches for each task.
DepthMamba (Meng et al., 28 Dec 2024) fuses single-view and multi-view depth predictions via an attention-driven selection mechanism using cost volumes and attention-weighted variance volumes.
For multimodal sentiment analysis (TF-Mamba (Li et al., 20 May 2025)), text-guided alignment, enhancement, and cross-modal querying are performed atop a Mamba backbone, with explicit reconstruction for missing modal data.

The fusion of multi-source information in MAF is realized by combinations of:

Attention gating: Channel, spatial, self-, and co-attention gates are used to recalibrate, enhance, or align intermediate features—often leveraging cross-branch or cross-modal context. For example, Co-Attention Gates in MambaCAFU (Bui et al., 4 Oct 2025) mutually gate CNN and Transformer branches, while depthwise or difference-perception attention is used in image fusion frameworks (Xie et al., 15 Apr 2024).
State-space alignment: Projection to a shared or hidden state space allows features to interact while neutralizing disparities in dimension, geometry, or modality. In Fusion-Mamba (Dong et al., 14 Apr 2024), this enables deep, hidden-space fusion coupled with explicit gating.
Statistical/transport-based matching: Local (Optimal Transport) or global (MMD) alignment strategies as in ME-Mamba (Zhang et al., 21 Sep 2025) enforce both token-level correspondences and consistency of distributions across modalities in the fused representation.
Tunable weight/fusion schedules: Shared weights (as in MambAttention (Kühne et al., 1 Jul 2025)) and learnable layer scheduling (as in TransMamba (Li et al., 31 Mar 2025)) govern the blending of attention and Mamba processing, enabling dynamic adaptation across sequence positions or scales.

4. Computational Efficiency and Scalability

The use of Mamba blocks with linear sequence complexity dramatically lowers computation and GPU memory cost versus pure Transformer architectures (with quadratic complexity in sequence length), especially prominent in image, video, and WSI processing. This efficiency is maintained in hybrid MAF designs without sacrificing the discriminative power of classical attention, through:

State-space kernels (e.g., selective scan 2D, bidirectional SSM)
Parameter sharing across time/frequency or modalities
Sparse or blockwise attention modules applied only at select locations, interleaved with SSM modules

Empirical analyses in works such as FusionMamba (Xie et al., 15 Apr 2024), ME-Mamba (Zhang et al., 21 Sep 2025), and TransMamba (Li et al., 31 Mar 2025, Chen et al., 21 Feb 2025) show measurable reductions in inference-time FLOPs and memory, with sustained or improved task accuracy versus attention-only baselines.

5. Experimental Performance and Quantitative Outcomes

MAF-based architectures repeatedly surpass unimodal, homogenous, or less-integrated fusion methods across domains:

Medical imaging: In segmentation tasks (e.g., brain tumors—MambaCAFU (Bui et al., 4 Oct 2025), 3D tumor maps—(Ji et al., 30 Apr 2025), hard exudate detection—(Zhang et al., 2022)), MAF models consistently elevate Dice scores (e.g., >92% on BraTS2023) and reduce boundary errors.
Multimodal fusion: On cross-modality object detection, Fusion-Mamba (Dong et al., 14 Apr 2024) yields mAP gains up to 5.9% over strong baselines. In multimodal sentiment analysis, TF-Mamba (Li et al., 20 May 2025) achieves superior accuracy and robustness under missing modality conditions.
Video and time series: Matten (Gao et al., 5 May 2024) demonstrates that layering Mamba blocks with attention preserves or improves FVD in video generation at substantially reduced computational cost. Attention Mamba (Xiong et al., 2 Apr 2025) shows significant improvements (up to 14% MSE reduction) in time-series forecasting benchmarks.
Generalization and domain transfer: Shared weight and memory converter strategies (MambAttention (Kühne et al., 1 Jul 2025), TransMamba (Chen et al., 21 Feb 2025, Li et al., 31 Mar 2025)) yield domain-invariant representations and maintain accuracy in out-of-domain conditions, supported by experimental rank tests and ablation studies.

6. Applications, Implications, and Future Directions

The adoption of MAF is growing in tasks demanding efficient, accurate, and robust fusion of heterogeneous or large-scale data:

Medical and biomedical domains: Tumor segmentation, survival analysis (ME-Mamba (Zhang et al., 21 Sep 2025)), and bioimage fusion benefit from the dual properties of detailed local integration and global long-range alignment.
Speech, audio, and language: In SER, PARROT (Phukan et al., 1 Jun 2025) leverages the complementary strengths of Mamba and attention-based PTMs, boosting performance across languages and datasets. For long-context NLP, dynamic hybridization (TransMamba) affords speed and accuracy without compromising representational depth.
Autonomous driving and scene understanding: HAMF (Mei et al., 21 May 2025) integrates scene context and motion tokens using joint-attention and Mamba-based refinement, resulting in state-of-the-art motion forecasting under strict latency and parameter budgets.
Foundational models and cross-architecture transfer: Advanced transfer, distillation, and cross-modal fusion (e.g., TransMamba (Chen et al., 21 Feb 2025)) open avenues for universal pre-trained architectures that flexibly exploit both attention and SSM paradigms.

A plausible implication is that as dataset and task scales increase—and as requirements for multi-source integration grow—MAF mechanisms will underpin new generation architectures, navigating the efficiency-accuracy frontier across domains.

Summary Table: Typical MAF Components and Roles

Component	Purpose	Example References
Mamba SSM Module	Linear long-range modeling	(Xie et al., 15 Apr 2024, Zhang et al., 21 Sep 2025)
Channel/Spatial Attention	Local and global dynamic weighting	(Bui et al., 4 Oct 2025, Xie et al., 15 Apr 2024)
Cross-Attention	Cross-modal fusion	(Dong et al., 14 Apr 2024, Li et al., 20 May 2025)
State Space Fusion	Alignment of multi-branch/hidden representations	(Dong et al., 14 Apr 2024, Zhu et al., 5 Sep 2024)
Gating/Swapping	Modality interaction and redundancy reduction	(Dong et al., 14 Apr 2024, Kühne et al., 1 Jul 2025)

7. Comparative Analysis and Distinctive Properties

MAF frameworks differ from classical fusion models mainly in three aspects:

Modeling Capacity: By decoupling local detail and long-range interactions into separate but tightly-coupled modules, MAF architectures avoid the capacity bottleneck of local CNNs and the inefficiency of global-attention transformers.
Fusion Granularity: Cross-branch/module-fusion mechanisms enable explicit token, channel, and instance-level coupling, supporting both shallow (input) and deep (hidden or output) integration—seen in strategies from channel swapping (Dong et al., 14 Apr 2024) to cross-modal alignment via OT (Zhang et al., 21 Sep 2025).
Computational Efficiency: Linear-complexity Mamba serves as a drop-in replacement or adjunct to heavy self-attention, amplifying scalability without discarding flexible dynamic weighting.

These properties allow MAF-based models to advance the state-of-the-art in dense prediction, sequence modeling, multimodal inference, real-time decision-making, and large-scale representation learning while respecting operational constraints on compute and memory.