Papers
Topics
Authors
Recent
2000 character limit reached

Mamba-Based Modality Disentanglement Network

Updated 29 December 2025
  • The paper introduces a neural architecture that disentangles modality-specific and modality-invariant features using Mamba state-space models.
  • It integrates dedicated encoders, attention-based fusion, and subtractive residuals to enhance representation quality and computational efficiency.
  • Results demonstrate state-of-the-art performance in registration, segmentation, and generative synthesis across diverse modalities.

A Mamba-based modality disentanglement network is a neural architecture utilizing Mamba state-space sequence models (SSMs) to extract, segregate, and synergistically fuse modality-specific and modality-shared representations from multi-modal data. Such architectures have advanced multi-modal learning across visual, auditory, medical imaging, and generative modeling domains by explicitly enforcing disentanglement via purpose-designed modules and attention mechanisms leveraging the unique capabilities of Mamba SSMs for long-range dependency modeling and computational efficiency. Recent research demonstrates that Mamba-based architectures, when combined with modality-aware attention, sparse parameterization, or iterative subtraction mechanisms, achieve state-of-the-art results on challenging registration, segmentation, reconstruction, and generative synthesis tasks across diverse modalities.

1. Architectural Paradigms for Modality Disentanglement

Mamba-based modality disentanglement networks typically employ combinations of dedicated encoders for modality-dependent and modality-invariant (shared) features, fusion blocks with attention-based weighting, and specialized state-space sequence modules for efficient global context aggregation.

  • Feature Extractors: Separate encoders for each modality (e.g., MRI contrasts, RGB/IR, text/audio) extract features that capture either modality-specific or shared (invariant) structure. For instance, MambaReg (Wen et al., 3 Nov 2024) deploys both a Modality-Dependent Feature Extractor (MDFE, built with learned convolutional sparse coding and Bi-Mamba blocks) and a Modality-Invariant Feature Extractor (MIFE, operating on the residual after subtracting the MD portion).
  • Fusion Modules: Architectures such as the bi-level synergistic integration block (Ji et al., 30 Apr 2025), cross-modal channel attention (Zhu et al., 5 Sep 2024), and SEAD (Style & Emotion Aware Disentangled fusion) (Fu et al., 29 Jul 2024) use modality attention, channel attention, and cross-stream projections to dynamically control contribution from each stream.
  • Mamba or Bi-Mamba Blocks: SSM-based layers capture both local and long-range dependencies critical for spatial/temporal alignment or fusion, enabling linear-complexity global modeling absent in pure CNN or Transformer-based methods (Wen et al., 3 Nov 2024, Ji et al., 30 Apr 2025, Fu et al., 29 Jul 2024).

2. Core Mechanisms for Disentanglement and Fusion

Explicit disentanglement is enforced through design choices at both the representational and parameterization levels:

  • Subtractive Residuals: Modality-invariant residuals are isolated by subtracting reconstructed modality-dependent features from the raw observation, as in MI=I−MDMI=I-MD (Wen et al., 3 Nov 2024) or via gated subtraction of reference-specific components after feature encoding (Lyu et al., 22 Dec 2025).
  • Attention and Parameter Decoupling: Mixture-of-Mamba (Liang et al., 27 Jan 2025) utilizes hard modality-aware sparsity, decoupling all major projection matrices for each modality, which prevents negative transfer and preserves the unique statistics of text, image, or speech data. In other cases, softmax (modality-level) and sigmoid (channel-level) attentions weigh and fuse features from each stream (Ji et al., 30 Apr 2025, Zhu et al., 5 Sep 2024).
  • Cross-Modal and Cross-Local Attention: Networks incorporate cross-modal self-attention (e.g., audio as query vs. style/emotion as key/value in SEAD (Fu et al., 29 Jul 2024), or channel attention blending in Tmamba (Zhu et al., 5 Sep 2024)) to achieve entangled yet traceable information routing.

3. State-Space Sequence Modeling via Mamba

Mamba SSMs provide a computational and inductive bias advantage:

  • Linear Complexity with Global Context: Mamba achieves global sequence modeling with O(N)O(N) complexity per token via parameterized state-space transitions, unlike quadratic-cost Transformers or the locality of CNNs (Wen et al., 3 Nov 2024, Ji et al., 30 Apr 2025).
  • Bidirectionality and Multimodal Heads: Many architectures integrate bidirectional Mamba or Bi-Mamba layers at key junctions for enhanced dependency modeling, and multi-head settings for further flexibility (Wen et al., 3 Nov 2024).
  • Integration with Other Architectures: Dual-branch systems (e.g., Tmamba (Zhu et al., 5 Sep 2024)) pair Mamba-SSM positional encoders with channel-centric linear Transformers and facilitate interaction at both feature map and attention map levels.

4. Training Objectives, Losses, and Data Regimens

Disentanglement networks are trained with composite losses reflecting domain-specific goals:

Loss Component Mathematical Formulation Purpose
Similarity (MSE) Lsim=12(MSE(warp,Iy)+...)\mathcal{L}_{sim} = \frac12(\mathrm{MSE}(\text{warp}, I_y) + ...) Alignment/fidelity
Smoothness Lsmooth=∑p∈φ∥∇p∥2\mathcal{L}_{smooth} = \sum_{p\in\varphi} \|\nabla p\|^2 Regularization (registration)
Disentanglement LG=MSE(MIAG,MIMR)\mathcal{L}_G = \mathrm{MSE}(MI^{AG}, MI^{MR}) MI feature supervision
Fusion Losses Pixel, gradient, or SSIM-based Image/feature fusion quality
Adversarial/Rec. Not always used; e.g., Lrec=∥I^tar−xtar∥1\mathcal{L}_{rec} = \|\hat I_{tar} - x_{tar}\|_1 Clean target construction
Task-specific Cross-entropy (segmentation), Huber (gesture), diversity, FGD Predictive/generative benchmarks

Data regimes span unannotated/annotated multi-modal images (Wen et al., 3 Nov 2024, Ji et al., 30 Apr 2025), multi-contrast MRI (Lyu et al., 22 Dec 2025), gesture datasets (Fu et al., 29 Jul 2024), or tokenized multi-modal corpora (Liang et al., 27 Jan 2025).

5. Quantitative Benchmarking and Ablation Findings

Performance is established via domain-matched metrics such as Dice, MI, SSIM, FGD, and PSNR/SSIM for reconstructions. Repeated findings include:

  • Superior Registration and Smoothness: MambaReg achieves 83.44 Dice and 91.01 NCC for RGB-IR registration, outperforming baseline CNN and Transformer paradigms (Wen et al., 3 Nov 2024).
  • Segmentation Gains: Bi-level fusion and Mamba modality-encoders yield 2–4% Dice improvements on BraTS and Hecktor over state-of-the-art (Ji et al., 30 Apr 2025).
  • Multi-Contrast MRI Reconstruction: MambaMDN delivers >1>1 dB PSNR improvement over MC-VANet with variable-density masking (Lyu et al., 22 Dec 2025).
  • Modality Decoupling Synergy: Mixture-of-Mamba’s modality-aware projection yields matching or superior losses at 25–65% of the compute cost in three-modality scenarios (Liang et al., 27 Jan 2025).
  • Generative Synthesis and Diversity: MambaGesture attains FGD <22.11 (vs. 103.15 prior best), higher diversity, and tighter beat alignment in co-speech gesture generation (Fu et al., 29 Jul 2024).

Ablations consistently demonstrate that omitting Mamba-based blocks, decoupling, or multi-level attention results in degraded disentanglement, lower accuracy, and reduced cross-modal generalization.

6. Application Domains and Example Systems

Mamba-based modality-disentanglement networks have demonstrated effectiveness in diverse, high-impact applications:

  • Image Registration and Segmentation: MambaReg (Wen et al., 3 Nov 2024) and the tumor segmentation network (Ji et al., 30 Apr 2025) define new accuracy standards in deformable, multi-modal alignment and volumetric labeling in RGB-IR, MRI, and PET/CT domains.
  • Multi-Contrast MRI: MambaMDN (Lyu et al., 22 Dec 2025) provides dual-domain feature completion and contrast-aware refinement for highly accelerated MRI.
  • Image Fusion: Tmamba (Zhu et al., 5 Sep 2024) leverages linear Transformer–Mamba duality for infrared-visible and medical image fusion.
  • Multi-modal Generative Models: MambaGesture’s (Fu et al., 29 Jul 2024) SEAD+MambaAttn architecture sets SOTA in conditional gesture generation over text, audio, style, and emotion.
  • Multi-modal Pretraining: Mixture-of-Mamba (Liang et al., 27 Jan 2025) extends modality disentanglement and expert sparsity to multi-modal pretraining across text, images, and speech.

7. Theoretical and Practical Implications

A characteristic attribute of Mamba-based modality-disentanglement is the explicit architectural and parameter-level isolation and controlled recombination of heterogeneous input streams. This enables:

  • Avoidance of Negative Transfer: Parameter decoupling and attention-based fusion preclude dominance and interference among modalities, yielding representations better suited for both unimodal and cross-modal tasks.
  • Computational Efficiency: SSM-based kernels and Mixture-of-Mamba’s sparse projections achieve linear scaling with respect to input size and substantial reduction in training FLOPs for modality-rich settings (Liang et al., 27 Jan 2025).
  • Iterative Disentanglement: The stacking of dedicated refinement or subtraction modules leads to progressive purification of shared and private features, with empirically demonstrated gains (Lyu et al., 22 Dec 2025).
  • Compositional Flexibility: Modular blocks and attentional fusion permit easy adaptation to varying numbers and types of modalities, including high-order and token-based settings.

The progression of these architectures suggests the Mamba-based approach is poised to become a foundational framework for future multi-modal, cross-modal, and modality-agnostic neural representation learning.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mamba-Based Modality Disentanglement Network.