MambaCAFU: Adaptive Fusion Architectures

Updated 5 March 2026

MambaCAFU is a fusion architecture that integrates Mamba-based state space models with CNN and transformer branches for adaptive multi-modal, multi-scale fusion.
The architecture employs co-attention gates and selective SSMs in fusion modules, achieving linear complexity and improved segmentation performance.
Empirical results on medical imaging benchmarks demonstrate state-of-the-art accuracy, efficiency, and hardware-friendly deployment.

MambaCAFU refers to a class of architectures and fusion strategies utilizing Mamba-based State Space Models (SSMs) to enable efficient and expressive multi-modal or multi-scale fusion, widely applied in medical image segmentation, neuroimaging, and cross-modal sequence analysis. The core innovation of MambaCAFU is the integration of selective SSMs into adaptive, multi-branch fusion modules, achieving linear complexity in sequence length (or spatial extent) while delivering superior performance compared to both pure transformer and CNN-based counterparts. MambaCAFU has specific architectural instantiations and is related to but distinct from CAF-Mamba, ClinicalFMamba, and hardware-efficient accelerators as seen in eMamba.

1. Architectural Foundations of MambaCAFU

MambaCAFU architectures are grounded in a multi-branch encoder-decoder scheme, prominently realized in the model described for medical image segmentation (Bui et al., 4 Oct 2025). The encoder branches are:

CNN branch: Employs a ResNet-18 backbone to extract local spatial and hierarchical features, producing multi-scale representations ( $r_0$ – $r_3$ ) at various resolutions.
Transformer branch: Utilizes PVTv2-B2 for global and contextual token interactions, generating features ( $t_0$ – $t_3$ ) with learned patch-wise positional embeddings.
Mamba-based Attention Fusion (MAF) branch: Incorporates a lightweight convolutional block (DConvB), followed by stacked CoASMamba blocks. These blocks leverage SSMs, spatial attention, and co-attention gating to integrate long-range, semantic, and spatial dependencies over progressively coarser spatial scales.

The decoder applies multi-scale attention-based CNNs, specifically the DoubleLCoA block, which recursively merges upsampled features with skip connections from each encoder branch, maintaining spatial detail and contextual integrity.

Key block operations are formally defined in LaTeX equations within (Bui et al., 4 Oct 2025), including the Co-Attention Gate (CoAG) and MambaConv, which implements the SSM with: $f_{out} = \text{ResB}\left(f_{in} + \text{SS2D}(\text{LN}(f_{in}))\right)$ where $\text{SS2D}$ denotes a 2D selective scan driven by the Mamba SSM.

2. Mamba-Based Fusion Mechanisms and Attention Gates

Fusion within MambaCAFU relies on co-attention and attention gating mechanisms that enable bidirectional semantic interaction and selective integration between branches:

Co-Attention Gate (CoAG) formulates bidirectional gating: $x' = \mathrm{CoAG}(x,t) = \mathrm{CA}\left([\mathrm{AG}(x,t),\,\mathrm{AG}(t,x)]\right)$ where CA is channel attention and AG (attention gate) generates learned spatial masks.
Attention Gate (AG) uses a gating function: $\hat x = \mathrm{AG}(g,x)=x \odot \alpha, \qquad \alpha = \sigma\left(\mathrm{Conv}_\psi(\mathrm{ReLU}(\mathrm{Conv}_x(x)+\mathrm{Conv}_g(g)))\right)$ where $\odot$ denotes element-wise multiplication and $\sigma$ is the sigmoid nonlinearity.
MambaConv injects SSM-driven long-range interactions, attaining linear complexity with respect to spatial/temporal length.

These attention gating and fusion operators are instantiated both in the encoding and decoding paths, including the CoASMamba and DoubleLCoA blocks.

3. Training, Optimization, and Inference Protocols

MambaCAFU architectures are optimized with a weighted combination of Dice and binary cross-entropy losses, parameterized as: $r_3$ 0 Key training hyperparameters, loss schedules, and optimizer selection (AdamW or Adam) are detailed for each dataset in (Bui et al., 4 Oct 2025). Data augmentation involves geometric transformations (flips, rotations), and multi-task loss weighting is dataset-specific (e.g., Synapse: 0.8 Dice, 0.2 BCE).

Standard hardware utilized includes NVIDIA V100 GPUs, achieving sub-linear run-time scaling due to SSMs’ linear computational requirements.

4. Empirical Performance and Ablation

MambaCAFU demonstrates state-of-the-art performance on diverse medical imaging benchmarks, consistently matching or exceeding prior transformer, CNN, and hybrid models, summarized in Table 1 below.

Dataset	Main Metric	Score	#Params	FLOPs
Synapse	DSC (%)	84.87	66.7M	40.3G
BTCV	DSC (%)	76.86	66.7M	40.3G
ACDC	DSC (%)	92.37	66.7M	40.3G
ISIC'17	ACC (%)	94.07	66.7M	40.3G
GlaS	DSC (%)	96.76	66.7M	40.3G
MoNuSeg	DSC (%)	81.85	66.7M	40.3G

Ablation experiments on (Bui et al., 4 Oct 2025) confirm substantial drops in segmentation Dice score (1–2%) if any major branch (ResNet CNN, CoAG, or MambaConv) or fusion block (CoASMamba, CoAMamba, DoubleLCoA) is omitted, establishing the necessity of each architectural component.

Extensions of MambaCAFU, notably in CAF-Mamba (Zhou et al., 29 Jan 2026), generalize the approach for cross-modal sequence fusion tasks. Here, the architecture processes multiple input modalities (e.g., audio, facial landmarks, gaze) via:

Separate unimodal 1D-CNN + ResMamba encoders,
An explicit cross-modal interaction Mamba encoder (CIME) that fuses modalities via summation inside a ResMamba block,
An adaptive attention fusion module that computes modality weights $r_3$ 1 with a softmax over pooled features: $r_3$ 2 This is followed by reweighted sum and projection into a fused sequence, and finally a Multimodal Mamba Encoder.

Empirically, CAF-Mamba achieves state-of-the-art accuracy (78.69% on LMVD), ablation results show 4–5% accuracy drop if explicit or adaptive fusion modules are omitted, and maintains parameter efficiency (<1M params) with linear scaling in sequence length (Zhou et al., 29 Jan 2026).

6. Hardware Acceleration and Linear Complexity Realization

Efficient deployment of MambaCAFU models on edge devices is enabled by hardware-aware modifications detailed in the eMamba framework (Kim et al., 14 Aug 2025):

Range normalization substitutes for layer normalization, removing square-root-based mean-variance calculations and using only maximization and subtraction.
Piecewise-linear and low-cost surrogate activations replace SiLU and exponential operations, attaining <3% and <2% maximum error, respectively, with minimal digital logic.
Scale-aware quantization allows all arithmetic to be performed at 8-bit width (except SSM states), with parameter-friendly shifts and clamps.
Neural architecture search identifies Pareto-optimal architectures, balancing accuracy, latency, and resource constraints.

FPGA/ASIC implementations achieve up to 5.62× lower latency and 9.95× higher throughput compared to transformer baselines (Kim et al., 14 Aug 2025).

7. Relations to Broader “Mamba Fusion” Architectures

The underlying principles found in MambaCAFU appear across related architectures for multimodal and multi-scale fusion:

MMMamba applies in-context Mamba fusion with multimodal interleaved scanning for pan-sharpening and image enhancement, showing consistent linear-time global fusion and SOTA results on several benchmarks (Wang et al., 17 Dec 2025).
ClinicalFMamba uses a CNN-Mamba hybrid with tri-plane scanning for 3D neuroimaging fusion and clinical grading/classification, demonstrating SSM-driven efficiency and superior fusion/diagnostic performance in both 2D and 3D settings (Zhou et al., 5 Aug 2025).

These models uniformly utilize selective SSMs and adaptive fusion—either by co-attention, in-context token concatenation, or interleaved scans—enabling deployment at scale and in computationally constrained environments, while maintaining or exceeding the empirical power of transformer-based fusion.

In summary, MambaCAFU designates a family of architectures centering on the integration of Mamba SSMs for adaptive, efficient, and expressive fusion across branches, modalities, or scales. These models achieve strong empirical results, low computational complexity, and hardware-efficiency, as substantiated in several published works (Bui et al., 4 Oct 2025, Zhou et al., 29 Jan 2026, Kim et al., 14 Aug 2025, Wang et al., 17 Dec 2025, Zhou et al., 5 Aug 2025).